Category: UDF

Converting camelCase and SNAKE_CASE

Converting camelCase and SNAKE_CASE

In programming and string matching, use of camelCase and SNAKE_CASE are common. Here are two simple Snowflake UDFs to convert between the two.

create or replace function camelToSnake("s" string)
returns string
language sql
as
$$
    upper(regexp_replace(s,'([A-Z])', '_\\1', 2))
$$;

select camelToSnake('quickBrownFox');

create or replace function snakeToCamel("s" string)
returns string
language javascript
strict immutable
as
$$
const snakeToCamel = str =>
  str.toLowerCase().replace(/([-_][a-z])/g, group => group
      .toUpperCase()
      .replace('-', '')
      .replace('_', '')
  );
return snakeToCamel(s);
$$;

select snakeToCamel('QUICK_BROWN_FOX');
Getting Snowflake’s Current Timezone

Getting Snowflake’s Current Timezone

Snowflake’s built-in way to get the current timezone is using the SHOW statement, like this:

show parameters like 'TIMEZONE';

If you’re running the statement from a command line, this isn’t a problem. If you need it programmatically, the SHOW command has two key limitations. First, you can’t use it in a stored procedure. Second, if you use it in a SQL script running externally, you have to get the results of the SHOW statement in second query using RESULT_SCAN.

One of my customers needed a way to get and current timezone in a single call and use it in a stored procedure. Since JavaScript has built-in functions, it’s possible to use a UDF to get the current time zone that way:

create or replace function GET_CURRENT_TIMEZONE()
returns string
language javascript
as
$$
    return Intl.DateTimeFormat().resolvedOptions().timeZone;
$$;

-- Test the UDF:
select get_current_timezone();
alter session set TIMEZONE = 'America/Chicago';
select get_current_timezone();
alter session set TIMEZONE = 'America/New_York';
select get_current_timezone();
Regex Non-Capturing Groups and Lookarounds in Snowflake

Regex Non-Capturing Groups and Lookarounds in Snowflake

If you don’t need the background or discussion of how they work and just want to download Snowflake UDFs that support regex non-capturing groups, lookaheads, and lookbehinds, you can download them here:

https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions

Now for the background:

Snowflake supports regular expressions (regex) for string matching and replacements. If your regex skills are like mine, Snowflake’s regex implementation provides more than you’ll ever need.

For regex ninjas and people who want to use regular expression libraries, there are two commonly-used capabilities that this post explains Snowflake’s regex functions do not currently support: non-capturing groups and lookarounds.

Every once in a while I run into a customer who’s a regex ninja or wants to use a regex from a library that requires one of these capabilities.

It occurred to me that JavaScript supports regex with these features, and Snowflake supports JavaScript user defined functions (UDFs). To use a regex in Snowflake that has non-capturing groups or lookarounds, It’s a simple matter of writing a UDF.

The problem is writing a new UDF for each use of a regex reduces some of the main advantages of regular expressions including compactness and simplicity.

This lead me to write two general-purpose UDFs that approximate Snowflake’s REGEXP_REPLACE and RLIKE (synonym REGEXP_LIKE) as closely as possible while enabling non-capturing groups and lookarounds.

I named the JavaScript UDFs similar to the Snowflake functions they approximate, REGEXP_REPLACE2 and RLIKE2 (synonym REGEXP_LIKE2). I also overloaded the UDFs so that you can call them using minimal parameters or optional parameters the same as their base Snowflake functions.

Here’s an example of their usage:

-- Running the base function returns this error:
-- Invalid regular expression: 'bar(?=bar)', no argument for repetition operator: ?
select regexp_replace('foobarbarfoo', 'bar(?=bar)', '***');

-- Running the UDF approximating the base function returns foo***barfoo
select regexp_replace2('foobarbarfoo', 'bar(?=bar)', '***');

-- Running the base function returns this error:
-- Invalid regular expression: 'bar(?=bar)', no argument for repetition operator: ?
select rlike('foobarbarfoo', 'bar(?=bar)');

-- Running the UDF approximating the base function returns TRUE
select rlike2('foobarbarfoo', 'bar(?=bar)');

You can download the UDFs on my Github here: https://github.com/GregPavlik/SnowflakeUDFs/tree/main/RegularExpressions

Getting Snowflake Primary Key Columns as a Table

Getting Snowflake Primary Key Columns as a Table

One of my customers had an interesting requirement. In order to dynamically create merge statements, they needed a way to collect the primary key columns for any given table. After discussing some options — returning them as a delimited string, array, etc., we agreed that returning the columns in a table is the best option.

This User Defined Table Function (UDTF) returns the columns for a table’s primary key. The UDTF will return a table with a single column, each row in the table is one of the columns in the input table’s primary key. If there is no primary key, the table will have no rows. For a single-column primary key, the table will have the one row, and for composite primary keys it will return all columns in the key.

One thing you may notice is that the input to the UDTF is the table’s DDL, not just the table’s name. The reason for this is because UDTFs cannot execute SQL. The simplest way to handle this situation is to nest the GET_DDL function as the parameter for the GET_PK_COLUMNS function. You can see how this works in the code samp

create database DB_Primary_Keys;

/**************************************************************************************************************
*                                                                                                             *
*  Set up test tables with four types of primary key: Named composite, unnamed composite, inline, and none.   * 
*                                                                                                             *
**************************************************************************************************************/
 
-- Named multi-column PK
create or replace temporary table table1
(
column_name1 number NOT NULL,
column_name2 number NOT NULL,
column_name3 string,
CONSTRAINT Constraint_name PRIMARY KEY (column_name1, column_name2)
);
 
-- Unnamed multi-column PK
create temporary table table2
(
column_name1 number NOT NULL,
column_name2 number NOT NULL,
column_name3 string,
PRIMARY KEY (column_name1, column_name2)
);
 
-- Inline single-column PK
create or replace temporary table table3
(
column_name1 number primary key,
column_name2 number NOT NULL,
column_name3 string
);
 
-- No PK defined
create or replace temporary table table4
(
column_name1 number,
column_name2 number,
column_name3 string
);
 
/********************************************************************************************************
*                                                                                                       *
* User defined table function (UDTF) to get primary keys for a table.                                   *
*                                                                                                       *
* @param  {string}:  TABLE_DDL    The DDL for the table to get the PKs. Usually use get_ddl.            *
* @return {table}:                A table with the columns comprising the table's primary key           *
*                                                                                                       *
********************************************************************************************************/
create or replace function GET_PK_COLUMNS(TABLE_DDL string)
returns table (PK_COLUMN string)
language javascript
as
$$
{
    processRow: function get_params(row, rowWriter, context){
        var pkCols = getPKs(row.TABLE_DDL);
        for (i = 0; i < pkCols.length; i++) {
            rowWriter.writeRow({PK_COLUMN: pkCols[i]}); 
        }
         
        function getPKs(tableDDL) {
            var c;
            var keyword = "primary key";
            var ins = -1;
            var s = tableDDL.split("\n");
            for (var i = 0; i < s.length; i++) {  
                ins = s[i].indexOf(keyword);
                if (ins != -1) {
                    var colList = s[i].substring(ins + keyword.length);
                    colList = colList.replace("(", "");
                    colList = colList.replace(")", "");
                    var colArray = colList.split(",");
                    for (pkc = 0; c < colArray.length; pkc++) {
                        colArray[pkc] = colArray[pkc].trim();
                    }
                    return colArray;
                }
            }
            return [];  // No PK
        }
    }
}
$$;
 
/**************************************************************************************************************
*                                                                                                             *
*  Test execution of the UDTF.                                                                                * 
*                                                                                                             *
**************************************************************************************************************/
 
select * from table(get_pk_columns(get_ddl('table', 'table1'))) PKS;  -- Multi-column PK with named constraint
select * from table(get_pk_columns(get_ddl('table', 'table2'))) PKS;  -- Multi-column PK with no name for constraint
select * from table(get_pk_columns(get_ddl('table', 'table3'))) PKS;  -- Single column PK inline definition
select * from table(get_pk_columns(get_ddl('table', 'table4'))) PKS;  -- No PKs
Overloading JavaScript UDFs in Snowflake

Overloading JavaScript UDFs in Snowflake

A Base Function with Two Overloads

Snowflake supports overloading user defined functions. It’s a great way to handle function calls with parameters of different data types or different numbers of parameters. Developers often overload functions to let users send only relevant parameters.

Consider the common SUBSTRING function. You can call it using one of two overloads:

select substr('Hello, world.', 8);     --This returns "world."

select substr('Hello, world.', 8, 5);  --This returns "world"

In the first statement, the caller sent the string to use and the start position. Omitting the final parameter uses the overload with default behavior, returning to the end of the string.

In the second statement, the caller decided to get rid of the final period. Adding the third parameter for length used the other overload to return five characters instead of the default behavior.

This is a common design with overloaded functions. Mandatory parameters go on the left and optional parameters follow. Each allowable combination of parameters becomes an overload of the function. In this design, developers typically write one base function with all parameters. For the overloads with missing parameters, they’ll call the base function using a default value for the missing parameter(s).

This design ensures that there’s only one place to maintain and debug code. The problem is Snowflake JavaScript UDFs cannot call other UDFs. While one way to deal with this is to write the same code in all overloads, it means three places to maintain, improve, and debug code. Fortunately, there’s a way to write once base function and call it from overloaded functions using defaults.

The solution is to write the base UDF with all parameters in JavaScript. For the overloads that simply call the base function using defaults for missing parameters, call the base JavaScript UDF using an overloaded SQL UDF. This works because SQL UDFs can call other UDFs, which JavaScript UDFs cannot do.

In this example, the JavaScript UDF returns a column displaying a graphical progress bar. The opens are typical for progress bars: percent completion, number of decimal places to display on the percentage, and number of segments to display.

The only one that can’t be defaulted is the percent complete. It’s okay to default to two decimal points and ten segments.

-- Make a progress bar function that looks like this:   ⬛⬛⬜⬜⬜⬜⬜⬜⬜⬜ 24.53%

-- This is the main JavaScript function with all parameters.
create or replace function PROGRESS_BAR(PERCENTAGE float, DECIMALS float, SEGMENTS float)
returns string
language javascript
as
$$

    var percent = PERCENTAGE;
    
    if (isNaN(percent)) percent =   0;
    if (percent < 0)    percent =   0;
    if (percent > 100)  percent = 100;

    percent        = percent.toFixed(DECIMALS);

    var filledSegments = Math.round(SEGMENTS * (percent / 100));
    var emptySegments  = SEGMENTS - filledSegments;

    var bar = '⬛'.repeat(filledSegments) + '⬜'.repeat(emptySegments);
 
    return bar + " " + percent + "%";

$$;

-- This is an overload with only the percentage, using defaults for 
-- number of segments and decimal points to display on percentage.
create or replace function PROGRESS_BAR(PERCENTAGE float)
returns string
language sql
as
$$
    select progress_bar(PERCENTAGE, 2, 10)
$$;

-- This is an overload with the percentage and the option set for the
-- number of decimals to display. It uses a default for number of segments.
create or replace function PROGRESS_BAR(PERCENTAGE float, DECIMALS float)
returns string
language sql
as
$$
    select progress_bar(PERCENTAGE, DECIMALS, 10)
$$;

-- Call the main JavaScript function by sending all three parameters:
select progress_bar(24.5293, 0, 100) as PROGRESS;

-- Call the overloaded SQL function by omitting the number of segments (segments defaults to 10):
select progress_bar(24.5293, 1) as PROGRESS;

-- Call the overloaded SQL function specifying only the percentage 
-- (segments defaults to 10 and decimals to 2)
-- It should display like this:   ⬛⬛⬜⬜⬜⬜⬜⬜⬜⬜ 24.53%
select progress_bar(24.5293) as PROGRESS;

By the way, this UDF progress bar is fully functional. If you have a long-running process such as loading a large number of files, you can use it to monitor progress by refreshing the query periodically. Here’s an example using a long progress bar and 24.41%:

select progress_bar(24.41, 2, 100) as PROGRESS;
Progress Bar from Overloaded Snowflake UDF
Geolocation of IP Addresses in Snowflake – Part 2

Geolocation of IP Addresses in Snowflake – Part 2

One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here:

https://www.sec.gov/dera/data/edgar-log-file-data-set.html

Their weblogs obfuscate the final quad of the IPv4 dotted quad, so they look like this: 192.168.1.jjr.

According to their documentation, their method will always replace the final number from 0 to 255 with the same three-letter value. The SEC does not disclose that “jjr” maps to something like 134, but for the purposes of this test it’s acceptable to assign all 256 unique three letter replacements with numbers from 0 to 255.

After downloading the log file named “log20170630.zip”, we need to load it to a stage and use a file format to parse it. Here is the file format I used:

ALTER FILE FORMAT "TEST"."IP2LOCATION".IP_LOG SET COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');

After loading the data into a table (I called mine IP_LOCATION), the next thing to do is replace the three-letter obfuscated replacements with numbers from 0 through 255. To do this, I ran an ELT process replace these values using these steps:

Step 1: Create a new table with all the columns from the original web log, plus a new column to hold the last three characters of the IP number (the three-letter replacement).

Step 2: Create a table with each of the distinct values in the new column. There will be 256 representing numbers from 0 through 255. For the purposes of this exercise, it is not important which numbers map to which three-letter replacements, so I assigned them in sequential order from 0 through 255 in the order returned in the select distinct query. I called these IP numbers simulated, even though only the final quad of the dotted quads is simulated.

Step 3: Convert the IP dotted quads into IP numbers using the UDF in Part 1 of this series. The SQL looks like this:

create table IP_INT_LOG as(
select IpToInteger(SIMULATED_IP) as IP_NUMBER, IP, SIMULATED_QUAD, SIMULATED_IP, REQUEST_DATE, REQUEST_TIME, ZONE, CIK, ACCESSION, EXTENSION, CODE, SIZE, IDX, NOREFER, NOAGENT, FIND, CRAWLER, BROWSER from IP_LOG);

I was concerned about performance running a Javascript UDF millions of 20,488,579 times to convert IP dotted quads into IP numbers. It turned out I needn’t have been concerned. Snowflake converted all 20.5 million in 16.97 seconds using an Extra Small (XS) warehouse. Out of curiosity, I dropped the resulting table and increased the size of the warehouse to a medium (M) and it ran in 7.85 seconds. This provides empirical evidence that increasing the warehouse size improves performance including those with UDFs.

This led me to the final part of the experiment — resolving the IP numbers in the web log to the geolocations. We have the IP numbers in the a log table, and the IP geolocations in another. My first thought was to join the tables, but this is an unusual join. The standard join matches keys. In this example, we need to use inequalities to join the table. In other words, join the IP log information with the geolocation data where the IP log’s IP number falls between the lower and upper bounds of a locations address range.

select 
L.COUNTRY_ABBREVIATION, L.COUNTRY, L.STATE_OR_PROVINCE, L.CITY, L.LATITUDE, L.LONGITUDE, L.POSTAL_CODE, 
I.IP_NUMBER, I.IP, I.SIMULATED_QUAD, I.SIMULATED_IP, I.REQUEST_DATE, I.REQUEST_TIME, I.ZONE, I.CIK, I.ACCESSION, I.EXTENSION, I.CODE, I.SIZE, I.IDX, I.NOREFER, I.NOAGENT, I.FIND, I.CRAWLER, I.BROWSER
from IP_INT_LOG I, IP_LOCATION L
WHERE I.IP_NUMBER BETWEEN L.START_NUMBER AND L.END_NUMBER
limit 10;

Note: This procedure is missing one important change that dramatically improves performance. This will be the topic of a follow-on Part 3 of this post.

Geolocation of IP Addresses in Snowflake

Geolocation of IP Addresses in Snowflake

It’s probably a safe assumption that business intelligence and data science teams can get valuable insights knowing the geolocation of website visitors. Suppose a product launch gets lots of web traffic, but the only source of information on visitors is the web log. Some web server statistics report on traffic grouped by nation, but what if we want to get much more granular information and incorporate this information with the main data warehouse?

Let’s take one sample web server among many, Apache Web Server, and quickly examine the structure of a log entry. Here’s a line in a sample Apache Web Server log.

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

In the Apache documentation, we can get more detailed information on the meaning of each field in the line, but for now we’re going to concentrate on 1) how to load the web logs into Snowflake, and 2) the key aspects for business intelligence and geolocation.

Loading the data is a quick proposition. Even without reading the Apache documentation it’s clear that the web log is space delimited and wraps any fields with spaces inside double quotes. Snowflake provides a very simple way to ingest structured data in flat files using File Formats. You can create a file format using SnowSQL (documented here: https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html) or you can use the Snowflake Web UI (documented here: https://docs.snowflake.net/manuals/user-guide/data-load-web-ui.html#step-1-open-the-load-data-wizard).

Note: Although the Apache Web Log is space delimited, we will use the CSV option for the Snowflake File Format — simply change the delimiter from a comma to a space.

After loading the data, we now need to geolocate the web hits. Problem: the Apache Web Server log, as most web logs, does not show anything for geolocation. Fortunately in most cases, we can use the visitor’s IP address to get fairly accurate geolocation.

Third party services keep up to date databases of IPv4 and IPv5 geolocation data. Once such service I found at https://lite.ip2location.com includes free databases with less rich information than the paid versions. In my testing I found the free databases accurate and useful, though production BI or data science work should consider the paid versions.

The next question is how to resolve millions or billions of web log lines’ IP address to approximate geolocation. This is where Snowflake shines. The IP2Location LITE comes as a flat structured file with millions of rows After creating another Snowflake File Format, it’s an easy matter to turn the IP2Location flat file into a Snowflake table. From there, Snowflake’s powerful massive-scale join make it a simple matter to create a joined view that shows the IP’s approximate location.

There’s one final issue to get this all working. The IP2Location data comes with IPs represented by 32-bit integers, not the familiar dotted quad notation. This makes it much easier to use code and database routines that search for ranges of IPs that all happen to be in the same area by specifying a lower and upper range for the IP number.

IP Dotted Quad:     IP 32-bit Integer, Decimal
192.168.1.1         3232235777

This means we need to convert our dotted quad values into 32-bit integer values. Fortunately, Snowflake makes that easy with a UDF (User Defined Function):

--Convert an IPv4 Dotted Quad into an IP Number
create or replace function IpToInteger(ipAddress varchar)
  returns double 
  language javascript
  strict
  as '
    
    var ipQuad = IPADDRESS.split(".");
 
    var quad1 = parseInt(ipQuad[0]);
    var quad2 = parseInt(ipQuad[1]);
    var quad3 = parseInt(ipQuad[2]);
    var quad4 = parseInt(ipQuad[3]);
    return (quad1 * 16777216) + (quad2 * 65536) + (quad3 * 256) + quad4;

  ';

You may have noticed that the function returns a double, when we know the return will be a 32-bit integer. The reason is that all numeric values in Javascript are floating point, and fortunately we won’t be encountering any issues with precision (i.e. 0.9999999999 != 1) since we’re dealing with only multiplication and addition.

This is one possible approach, but running the Javascript millions or billions of times to get the value concerns me from a performance perspective, so in my next post I’ll explore the possibility of using straight SnowSQL to convert a dotted quad into a 32-bit integer.

Grouping Numbers in Snowflake – Part 2

Grouping Numbers in Snowflake – Part 2

In a previous post, I showed a way to format numbers using local grouping characters. One of the advantages of the approach is it will work everywhere across Snowflake since it does not require creation of a User Defined Function (UDF).

Using a UDF provides a lot more power and flexibility, so let’s see how we do that. First, you need to be working with a specific Snowflake Database and have an active Data Warehouse to do any processing including creation of a UDF.

use database TEST_DATABASE;
use warehouse TEST_WAREHOUSE;

Next we need to create the User Defined Function (UDF):

--Return a formatted number with locale-specific grouping and decimal characters.
create or replace function FormatNumber(d double)
  returns string
  language javascript
  strict
  as '
	if (!isNaN(D)){
    	return D.toLocaleString("en-US");
    } else {
    	return "Not a number.";
    }
  ';

Of course, we could create a much more sophisticated function with overloads that handles optional local input, etc., but for the purposes of this function that’s overkill. If we need to adjust the locale, we can do that easily in the function by changing the parameter sent to the “toLocaleString”. Note that if you omit this parameter, the function will use the default for the current machine. In my case, the machine is in the US-East region of AWS, so it’s already set to US English, “en-US”, but to be on the safe side, it’s best to specify the right value.

To test the function, we can issue this SnowSQL statement (note that we need to be using the same database as the one where we created the UDF):

--Optionally set the database context if it's changed.
use database TEST_DATABASE;

select formatnumber(123456789.012) as FORMATTED_NUMBER;

When the locale specified US English, “us-EN”, it will return the following:

To get a list of possible values for the locale parameter worldwide, refer to the Mozilla Developer Network documentation here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString

One of the options on that page is for Han Chinese. I don’t know how to read that, but when I set the locale in the Snowflake UDF to zh-Hans-CN-u-nu-hanidec I’ll assume it’s working just fine when it returns the following result:

Perhaps someone who can read Han Chinese can confirm this for me.

Theme: Overlay by Kaira