Category: Functions

Overloading JavaScript UDFs in Snowflake

Overloading JavaScript UDFs in Snowflake

A Base Function with Two Overloads

Snowflake supports overloading user defined functions. It’s a great way to handle function calls with parameters of different data types or different numbers of parameters. Developers often overload functions to let users send only relevant parameters.

Consider the common SUBSTRING function. You can call it using one of two overloads:

select substr('Hello, world.', 8);     --This returns "world."

select substr('Hello, world.', 8, 5);  --This returns "world"

In the first statement, the caller sent the string to use and the start position. Omitting the final parameter uses the overload with default behavior, returning to the end of the string.

In the second statement, the caller decided to get rid of the final period. Adding the third parameter for length used the other overload to return five characters instead of the default behavior.

This is a common design with overloaded functions. Mandatory parameters go on the left and optional parameters follow. Each allowable combination of parameters becomes an overload of the function. In this design, developers typically write one base function with all parameters. For the overloads with missing parameters, they’ll call the base function using a default value for the missing parameter(s).

This design ensures that there’s only one place to maintain and debug code. The problem is Snowflake JavaScript UDFs cannot call other UDFs. While one way to deal with this is to write the same code in all overloads, it means three places to maintain, improve, and debug code. Fortunately, there’s a way to write once base function and call it from overloaded functions using defaults.

The solution is to write the base UDF with all parameters in JavaScript. For the overloads that simply call the base function using defaults for missing parameters, call the base JavaScript UDF using an overloaded SQL UDF. This works because SQL UDFs can call other UDFs, which JavaScript UDFs cannot do.

In this example, the JavaScript UDF returns a column displaying a graphical progress bar. The opens are typical for progress bars: percent completion, number of decimal places to display on the percentage, and number of segments to display.

The only one that can’t be defaulted is the percent complete. It’s okay to default to two decimal points and ten segments.

-- Make a progress bar function that looks like this:   ⬛⬛⬜⬜⬜⬜⬜⬜⬜⬜ 24.53%

-- This is the main JavaScript function with all parameters.
create or replace function PROGRESS_BAR(PERCENTAGE float, DECIMALS float, SEGMENTS float)
returns string
language javascript
as
$$

    var percent = PERCENTAGE;
    
    if (isNaN(percent)) percent =   0;
    if (percent < 0)    percent =   0;
    if (percent > 100)  percent = 100;

    percent        = percent.toFixed(DECIMALS);

    var filledSegments = Math.round(SEGMENTS * (percent / 100));
    var emptySegments  = SEGMENTS - filledSegments;

    var bar = '⬛'.repeat(filledSegments) + '⬜'.repeat(emptySegments);
 
    return bar + " " + percent + "%";

$$;

-- This is an overload with only the percentage, using defaults for 
-- number of segments and decimal points to display on percentage.
create or replace function PROGRESS_BAR(PERCENTAGE float)
returns string
language sql
as
$$
    select progress_bar(PERCENTAGE, 2, 10)
$$;

-- This is an overload with the percentage and the option set for the
-- number of decimals to display. It uses a default for number of segments.
create or replace function PROGRESS_BAR(PERCENTAGE float, DECIMALS float)
returns string
language sql
as
$$
    select progress_bar(PERCENTAGE, DECIMALS, 10)
$$;

-- Call the main JavaScript function by sending all three parameters:
select progress_bar(24.5293, 0, 100) as PROGRESS;

-- Call the overloaded SQL function by omitting the number of segments (segments defaults to 10):
select progress_bar(24.5293, 1) as PROGRESS;

-- Call the overloaded SQL function specifying only the percentage 
-- (segments defaults to 10 and decimals to 2)
-- It should display like this:   ⬛⬛⬜⬜⬜⬜⬜⬜⬜⬜ 24.53%
select progress_bar(24.5293) as PROGRESS;

By the way, this UDF progress bar is fully functional. If you have a long-running process such as loading a large number of files, you can use it to monitor progress by refreshing the query periodically. Here’s an example using a long progress bar and 24.41%:

select progress_bar(24.41, 2, 100) as PROGRESS;
Progress Bar from Overloaded Snowflake UDF
Snowflake UDF to Get Payment Card Type

Snowflake UDF to Get Payment Card Type

Payment Cards

This User Defined Function (UDF) doesn’t require much explanation. Payment card number goes in; payment card type comes out. Since it is designed for speed, it does not validate the check digit. A subsequent post will provide a UDF to validate the check digit using the Luhn algorithm.

/********************************************************************************************************************

Function:    PaymentCardType
Description: Decodes the type of payment card from Visa, Mastercard, AMEX, etc.
Parameters:  A string indicating the type of payment card, or a blank string if not identified. 

*********************************************************************************************************************/
create or replace function PaymentCardType(cardNumber string)
  returns string 
  language javascript
  strict
  as '
     
    //Remove all spaces and dashes. Simply ignore them.
    NUMBER = CARDNUMBER.replace(/ /g, "");
    NUMBER = NUMBER.replace(/-/g, "");
     
     
    // Visa
    var re = new RegExp("(4[0-9]{15})");
    if (NUMBER.match(re) != null)
        return "Visa";

    // Mastercard
    re = new RegExp("(5[1-5][0-9]{14})");
    if (NUMBER.match(re) != null)
        return "Mastercard";

    // AMEX
    re = new RegExp("^3[47]");
    if (NUMBER.match(re) != null)
        return "AMEX";

    // Discover
    re = new RegExp("^(6011|622(12[6-9]|1[3-9][0-9]|[2-8][0-9]{2}|9[0-1][0-9]|92[0-5]|64[4-9])|65)");
    if (NUMBER.match(re) != null)
        return "Discover";

    // Diners
    re = new RegExp("^36");
    if (NUMBER.match(re) != null)
        return "Diners";

    // Diners - Carte Blanche
    re = new RegExp("^30[0-5]");
    if (NUMBER.match(re) != null)
        return "Diners - Carte Blanche";

    // JCB
    re = new RegExp("^35(2[89]|[3-8][0-9])");
    if (NUMBER.match(re) != null)
        return "JCB";

    // Visa Electron
    re = new RegExp("^(4026|417500|4508|4844|491(3|7))");
    if (NUMBER.match(re) != null)
        return "Visa Electron";

    return "";
 
  ';

-- Test the UDF:
select PaymentCardType('4470653497431234');

CHAR Padding in Snowflake

CHAR Padding in Snowflake

For strings that need to be fixed-width, it’s common to use the CHAR data type, for example CHAR(10). In this example, often databases will right pad a field has fewer than 10 characters with spaces.

If we define a CHAR(10) and store “ABCDE” in the field, some databases will add spaces on the right to fill the 10 width. In the example below, the “.” character represents the space:

Column_Name
1234567890
ABCDE.....

A side effect of the advanced way Snowflake stores strings is that a column defined as CHAR(10) with only 5 characters such as “ABCDE” will return only the five characters instead of 10 such as “ABCDE “. From the Snowflake documentation:

Snowflake currently deviates from common CHAR semantics in that strings shorter than the maximum length are not space-padded at the end.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html#char-character

For use cases where fixed-width of strings is important, there are a couple of options. One option is to right pad the strings on ingest, but this adds work and storage overhead.

I think a better way is to create a simple view that returns right padded strings for columns defined as CHAR. Fortunately, the Snowflake RPAD function makes this easy.

The following sample explains how to create a view to ensure fixed-width columns. Note that since CHAR is a synonym for VARCHAR, this technique will work with columns defined as CHAR, VARCHAR, STRING, CHARACTER, and TEXT.

--Create a simple test table with a CHAR(10)
create table TEST_CHAR
    (
    TEST_COLUMN char(10)
    );

--Insert some values 
insert into TEST_CHAR (TEST_COLUMN) values ('1234');
insert into TEST_CHAR (TEST_COLUMN) values ('12345');
insert into TEST_CHAR (TEST_COLUMN) values ('1234567890');

--Create a view that right pads the CHAR fields to width 10
--Change the space to another character to see the effect.
create or replace view V_TEST_CHAR as
select RPAD(TEST_COLUMN, 10, ' ') as TEST_COLUMN from TEST_CHAR;

--Select from the view.
select * from V_TEST_CHAR;
Geolocation of IP Addresses in Snowflake – Part 2

Geolocation of IP Addresses in Snowflake – Part 2

One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here:

https://www.sec.gov/dera/data/edgar-log-file-data-set.html

Their weblogs obfuscate the final quad of the IPv4 dotted quad, so they look like this: 192.168.1.jjr.

According to their documentation, their method will always replace the final number from 0 to 255 with the same three-letter value. The SEC does not disclose that “jjr” maps to something like 134, but for the purposes of this test it’s acceptable to assign all 256 unique three letter replacements with numbers from 0 to 255.

After downloading the log file named “log20170630.zip”, we need to load it to a stage and use a file format to parse it. Here is the file format I used:

ALTER FILE FORMAT "TEST"."IP2LOCATION".IP_LOG SET COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');

After loading the data into a table (I called mine IP_LOCATION), the next thing to do is replace the three-letter obfuscated replacements with numbers from 0 through 255. To do this, I ran an ELT process replace these values using these steps:

Step 1: Create a new table with all the columns from the original web log, plus a new column to hold the last three characters of the IP number (the three-letter replacement).

Step 2: Create a table with each of the distinct values in the new column. There will be 256 representing numbers from 0 through 255. For the purposes of this exercise, it is not important which numbers map to which three-letter replacements, so I assigned them in sequential order from 0 through 255 in the order returned in the select distinct query. I called these IP numbers simulated, even though only the final quad of the dotted quads is simulated.

Step 3: Convert the IP dotted quads into IP numbers using the UDF in Part 1 of this series. The SQL looks like this:

create table IP_INT_LOG as(
select IpToInteger(SIMULATED_IP) as IP_NUMBER, IP, SIMULATED_QUAD, SIMULATED_IP, REQUEST_DATE, REQUEST_TIME, ZONE, CIK, ACCESSION, EXTENSION, CODE, SIZE, IDX, NOREFER, NOAGENT, FIND, CRAWLER, BROWSER from IP_LOG);

I was concerned about performance running a Javascript UDF millions of 20,488,579 times to convert IP dotted quads into IP numbers. It turned out I needn’t have been concerned. Snowflake converted all 20.5 million in 16.97 seconds using an Extra Small (XS) warehouse. Out of curiosity, I dropped the resulting table and increased the size of the warehouse to a medium (M) and it ran in 7.85 seconds. This provides empirical evidence that increasing the warehouse size improves performance including those with UDFs.

This led me to the final part of the experiment — resolving the IP numbers in the web log to the geolocations. We have the IP numbers in the a log table, and the IP geolocations in another. My first thought was to join the tables, but this is an unusual join. The standard join matches keys. In this example, we need to use inequalities to join the table. In other words, join the IP log information with the geolocation data where the IP log’s IP number falls between the lower and upper bounds of a locations address range.

select 
L.COUNTRY_ABBREVIATION, L.COUNTRY, L.STATE_OR_PROVINCE, L.CITY, L.LATITUDE, L.LONGITUDE, L.POSTAL_CODE, 
I.IP_NUMBER, I.IP, I.SIMULATED_QUAD, I.SIMULATED_IP, I.REQUEST_DATE, I.REQUEST_TIME, I.ZONE, I.CIK, I.ACCESSION, I.EXTENSION, I.CODE, I.SIZE, I.IDX, I.NOREFER, I.NOAGENT, I.FIND, I.CRAWLER, I.BROWSER
from IP_INT_LOG I, IP_LOCATION L
WHERE I.IP_NUMBER BETWEEN L.START_NUMBER AND L.END_NUMBER
limit 10;

Note: This procedure is missing one important change that dramatically improves performance. This will be the topic of a follow-on Part 3 of this post.

Grouping Numbers in Snowflake – Part 2

Grouping Numbers in Snowflake – Part 2

In a previous post, I showed a way to format numbers using local grouping characters. One of the advantages of the approach is it will work everywhere across Snowflake since it does not require creation of a User Defined Function (UDF).

Using a UDF provides a lot more power and flexibility, so let’s see how we do that. First, you need to be working with a specific Snowflake Database and have an active Data Warehouse to do any processing including creation of a UDF.

use database TEST_DATABASE;
use warehouse TEST_WAREHOUSE;

Next we need to create the User Defined Function (UDF):

--Return a formatted number with locale-specific grouping and decimal characters.
create or replace function FormatNumber(d double)
  returns string
  language javascript
  strict
  as '
	if (!isNaN(D)){
    	return D.toLocaleString("en-US");
    } else {
    	return "Not a number.";
    }
  ';

Of course, we could create a much more sophisticated function with overloads that handles optional local input, etc., but for the purposes of this function that’s overkill. If we need to adjust the locale, we can do that easily in the function by changing the parameter sent to the “toLocaleString”. Note that if you omit this parameter, the function will use the default for the current machine. In my case, the machine is in the US-East region of AWS, so it’s already set to US English, “en-US”, but to be on the safe side, it’s best to specify the right value.

To test the function, we can issue this SnowSQL statement (note that we need to be using the same database as the one where we created the UDF):

--Optionally set the database context if it's changed.
use database TEST_DATABASE;

select formatnumber(123456789.012) as FORMATTED_NUMBER;

When the locale specified US English, “us-EN”, it will return the following:

To get a list of possible values for the locale parameter worldwide, refer to the Mozilla Developer Network documentation here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString

One of the options on that page is for Han Chinese. I don’t know how to read that, but when I set the locale in the Snowflake UDF to zh-Hans-CN-u-nu-hanidec I’ll assume it’s working just fine when it returns the following result:

Perhaps someone who can read Han Chinese can confirm this for me.

Grouping Numbers in Snowflake

Grouping Numbers in Snowflake

As I started working with Snowflake, one of the things I noticed is that large integers appear without grouping symbols. When counting rows in a table, it may display as 752941241 rows. Without counting the digits, it’s hard to see at a glance if this is about 75 million rows or 750 million rows.

After reading the docs, I found a simple way to format long numbers with grouping digits.

select to_varchar(count(*), '999,999,999') as TRIP_COUNT from trips;

I’m using a large number in the format string because when a the first argument in to_varchar exceeded the capacity of the number in the format string, the function will return ###,###. Be sure to use a format string large enough for any return. I have tested it with format strings as large as 999,999,999,999,999,999,999 (that’s twenty one 9’s). If the format string is larger than the required number of digits, the function will return the right sized formatted number without leading spaces or zeros.

You can also specify a decimal point using dot character. For example, to show a number with grouped digits and a decimal point:

select to_varchar(some_column, '999,999,999.999999999') from some_table;

This will show up to the maximum number of digits specified in the format string. but will not show more than are necessary. The decimal point will always show up, even when there’s no decimal portion of the number. I will see if there’s a way to format this better and update here.

Theme: Overlay by Kaira