July 2019 – Snowflake in the Carolinas

Month: July 2019

Snowflake in the Carolinas > 2019 > July

CHAR Padding in Snowflake

Posted on July 22, 2019January 16, 2020 by Greg Pavlik

For strings that need to be fixed-width, it’s common to use the CHAR data type, for example CHAR(10). In this example, often databases will right pad a field has fewer than 10 characters with spaces.

If we define a CHAR(10) and store “ABCDE” in the field, some databases will add spaces on the right to fill the 10 width. In the example below, the “.” character represents the space:

Column_Name
1234567890
ABCDE.....

A side effect of the advanced way Snowflake stores strings is that a column defined as CHAR(10) with only 5 characters such as “ABCDE” will return only the five characters instead of 10 such as “ABCDE “. From the Snowflake documentation:

Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end.
https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#char-character

For use cases where fixed-width of strings is important, there are a couple of options. One option is to right pad the strings on ingest, but this adds work and storage overhead.

I think a better way is to create a simple view that returns right padded strings for columns defined as CHAR. Fortunately, the Snowflake RPAD function makes this easy.

The following sample explains how to create a view to ensure fixed-width columns. Note that since CHAR is a synonym for VARCHAR, this technique will work with columns defined as CHAR, VARCHAR, STRING, CHARACTER, and TEXT.

--Create a simple test table with a CHAR(10)
create table TEST_CHAR
    (
    TEST_COLUMN char(10)
    );

--Insert some values 
insert into TEST_CHAR (TEST_COLUMN) values ('1234');
insert into TEST_CHAR (TEST_COLUMN) values ('12345');
insert into TEST_CHAR (TEST_COLUMN) values ('1234567890');

--Create a view that right pads the CHAR fields to width 10
--Change the space to another character to see the effect.
create or replace view V_TEST_CHAR as
select RPAD(TEST_COLUMN, 10, ' ') as TEST_COLUMN from TEST_CHAR;

--Select from the view.
select * from V_TEST_CHAR;

Creating a Hello World Stored Procedure in Snowflake

Posted on July 16, 2019September 15, 2019 by Greg Pavlik

Snowflake makes creating stored procedures easy. Make sure that you have the database in context where you want to create the stored procedure.

--Create a hello world Snowflake Stored Procedure.
create or replace procedure MyProcedure()
--Optionally return a value, only one.
returns VARCHAR
language JavaScript
as
  -- The "$$" indicates the beginning and end of the JavaScript code
  -- in the stored procedure
  $$  
  
  // Since we're in Javascript now, use Javascript comment markers.
  return "Hello world."
  
  $$;

Now, let’s run the stored procedure.

  --Call the stored procedure.
  call MyProcedure();

The output will look like this:

Conditional Column Masking Based on Role

Posted on July 16, 2019September 10, 2019 by Greg Pavlik

Snowflake secure views offer a powerful way to control who gets to see what. One way to secure data is to mask a column’s value based on who is accessing the data set.

First a bit of background on Snowflake views. They are not dynamic. Here is the section of the documentation explaining this:

View definitions are not dynamic, i.e. a view is not automatically updated if the underlying sources are modified such that they no longer match the view definition, particularly when columns are dropped. For example:
A view is created referencing a specific column in a source table and the column is subsequently dropped from the table.
A view is created using SELECT * from a table and any column is subsequently dropped from the table.

In practice what this means for data security and data privacy is that Snowflake secure views will mask column values rather than suppress the column entirely. Here’s an example of how to mask a column based on who is viewing the data. This example uses the currently logged in role to determine visibility rights, which is a best practice, but it could also check the current logged in user as well.

-- Log in as DBA or a login with appropriate permissions

-- Set the context. We'll be using the TEST warehouse (Extra Small
-- size is fine for this test), and a database named TEST. 
-- Create them if not already present.
use warehouse TEST;
use database TEST;

--Create and use schema TPCH_SF1, where we will copy some data from
--the Snowflake Sample data 
create schema TPCH_SF1;
use schema TPCH_SF1;

-- Create a writable table from the Snowflake sample data.
create or replace table ORDERS as 
  select * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;

--Make sure there are 1.5 million rows in the order table:
select count(*) as ROW_COUNT from ORDERS;

-- Mask the sensitive column from anyone not logged in using the DBA role:
create or replace secure view ORDERS_SECURE_VIEW as
select
O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY,
case current_role()
    when 'DBA' then O_CLERK
    else '*****'
end as O_CLERK,
O_SHIPPRIORITY, O_COMMENT
from ORDERS;

--Grant permissions to the Analyst role
grant usage on warehouse TEST to role ANALYST;
grant usage on database TEST to role ANALYST;
grant select on table ORDERS to role ANALYST;
grant select on view ORDERS_SECURE_VIEW to role ANALYST;
grant usage on schema TEST.TPCH_SF1 to role ANALYST;

--Test the following sections with the DBA and Analyst roles.
--The DBA will see unmasked data in the O_CLERK column.
--The Analyst will see masked data.

--Confirm the currently logged-in session's role:
select current_role() as current_role;

--Test our view using different roles, only the DBA role
--should see the unmasked data:
use warehouse TEST;
use database test;
use schema TPCH_SF1;
select * from orders_secure_view limit 10;

--Suspend our warehouse to save credits.
alter warehouse test suspend;

Geolocation of IP Addresses in Snowflake – Part 2

Posted on July 12, 2019February 2, 2020 by Greg Pavlik

One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here:

https://www.sec.gov/dera/data/edgar-log-file-data-set.html

Their weblogs obfuscate the final quad of the IPv4 dotted quad, so they look like this: 192.168.1.jjr.

According to their documentation, their method will always replace the final number from 0 to 255 with the same three-letter value. The SEC does not disclose that “jjr” maps to something like 134, but for the purposes of this test it’s acceptable to assign all 256 unique three letter replacements with numbers from 0 to 255.

After downloading the log file named “log20170630.zip”, we need to load it to a stage and use a file format to parse it. Here is the file format I used:

ALTER FILE FORMAT "TEST"."IP2LOCATION".IP_LOG SET COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');

After loading the data into a table (I called mine IP_LOCATION), the next thing to do is replace the three-letter obfuscated replacements with numbers from 0 through 255. To do this, I ran an ELT process replace these values using these steps:

Step 1: Create a new table with all the columns from the original web log, plus a new column to hold the last three characters of the IP number (the three-letter replacement).

Step 2: Create a table with each of the distinct values in the new column. There will be 256 representing numbers from 0 through 255. For the purposes of this exercise, it is not important which numbers map to which three-letter replacements, so I assigned them in sequential order from 0 through 255 in the order returned in the select distinct query. I called these IP numbers simulated, even though only the final quad of the dotted quads is simulated.

Step 3: Convert the IP dotted quads into IP numbers using the UDF in Part 1 of this series. The SQL looks like this:

create table IP_INT_LOG as(
select IpToInteger(SIMULATED_IP) as IP_NUMBER, IP, SIMULATED_QUAD, SIMULATED_IP, REQUEST_DATE, REQUEST_TIME, ZONE, CIK, ACCESSION, EXTENSION, CODE, SIZE, IDX, NOREFER, NOAGENT, FIND, CRAWLER, BROWSER from IP_LOG);

I was concerned about performance running a Javascript UDF millions of 20,488,579 times to convert IP dotted quads into IP numbers. It turned out I needn’t have been concerned. Snowflake converted all 20.5 million in 16.97 seconds using an Extra Small (XS) warehouse. Out of curiosity, I dropped the resulting table and increased the size of the warehouse to a medium (M) and it ran in 7.85 seconds. This provides empirical evidence that increasing the warehouse size improves performance including those with UDFs.

This led me to the final part of the experiment — resolving the IP numbers in the web log to the geolocations. We have the IP numbers in the a log table, and the IP geolocations in another. My first thought was to join the tables, but this is an unusual join. The standard join matches keys. In this example, we need to use inequalities to join the table. In other words, join the IP log information with the geolocation data where the IP log’s IP number falls between the lower and upper bounds of a locations address range.

select 
L.COUNTRY_ABBREVIATION, L.COUNTRY, L.STATE_OR_PROVINCE, L.CITY, L.LATITUDE, L.LONGITUDE, L.POSTAL_CODE, 
I.IP_NUMBER, I.IP, I.SIMULATED_QUAD, I.SIMULATED_IP, I.REQUEST_DATE, I.REQUEST_TIME, I.ZONE, I.CIK, I.ACCESSION, I.EXTENSION, I.CODE, I.SIZE, I.IDX, I.NOREFER, I.NOAGENT, I.FIND, I.CRAWLER, I.BROWSER
from IP_INT_LOG I, IP_LOCATION L
WHERE I.IP_NUMBER BETWEEN L.START_NUMBER AND L.END_NUMBER
limit 10;

Note: This procedure is missing one important change that dramatically improves performance. This will be the topic of a follow-on Part 3 of this post.