Category: SnowSQL

Geolocation of IP Addresses in Snowflake – Part 2

Geolocation of IP Addresses in Snowflake – Part 2

One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here:

https://www.sec.gov/dera/data/edgar-log-file-data-set.html

Their weblogs obfuscate the final quad of the IPv4 dotted quad, so they look like this: 192.168.1.jjr.

According to their documentation, their method will always replace the final number from 0 to 255 with the same three-letter value. The SEC does not disclose that “jjr” maps to something like 134, but for the purposes of this test it’s acceptable to assign all 256 unique three letter replacements with numbers from 0 to 255.

After downloading the log file named “log20170630.zip”, we need to load it to a stage and use a file format to parse it. Here is the file format I used:

ALTER FILE FORMAT "TEST"."IP2LOCATION".IP_LOG SET COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');

After loading the data into a table (I called mine IP_LOCATION), the next thing to do is replace the three-letter obfuscated replacements with numbers from 0 through 255. To do this, I ran an ELT process replace these values using these steps:

Step 1: Create a new table with all the columns from the original web log, plus a new column to hold the last three characters of the IP number (the three-letter replacement).

Step 2: Create a table with each of the distinct values in the new column. There will be 256 representing numbers from 0 through 255. For the purposes of this exercise, it is not important which numbers map to which three-letter replacements, so I assigned them in sequential order from 0 through 255 in the order returned in the select distinct query. I called these IP numbers simulated, even though only the final quad of the dotted quads is simulated.

Step 3: Convert the IP dotted quads into IP numbers using the UDF in Part 1 of this series. The SQL looks like this:

create table IP_INT_LOG as(
select IpToInteger(SIMULATED_IP) as IP_NUMBER, IP, SIMULATED_QUAD, SIMULATED_IP, REQUEST_DATE, REQUEST_TIME, ZONE, CIK, ACCESSION, EXTENSION, CODE, SIZE, IDX, NOREFER, NOAGENT, FIND, CRAWLER, BROWSER from IP_LOG);

I was concerned about performance running a Javascript UDF millions of 20,488,579 times to convert IP dotted quads into IP numbers. It turned out I needn’t have been concerned. Snowflake converted all 20.5 million in 16.97 seconds using an Extra Small (XS) warehouse. Out of curiosity, I dropped the resulting table and increased the size of the warehouse to a medium (M) and it ran in 7.85 seconds. This provides empirical evidence that increasing the warehouse size improves performance including those with UDFs.

This led me to the final part of the experiment — resolving the IP numbers in the web log to the geolocations. We have the IP numbers in the a log table, and the IP geolocations in another. My first thought was to join the tables, but this is an unusual join. The standard join matches keys. In this example, we need to use inequalities to join the table. In other words, join the IP log information with the geolocation data where the IP log’s IP number falls between the lower and upper bounds of a locations address range.

select 
L.COUNTRY_ABBREVIATION, L.COUNTRY, L.STATE_OR_PROVINCE, L.CITY, L.LATITUDE, L.LONGITUDE, L.POSTAL_CODE, 
I.IP_NUMBER, I.IP, I.SIMULATED_QUAD, I.SIMULATED_IP, I.REQUEST_DATE, I.REQUEST_TIME, I.ZONE, I.CIK, I.ACCESSION, I.EXTENSION, I.CODE, I.SIZE, I.IDX, I.NOREFER, I.NOAGENT, I.FIND, I.CRAWLER, I.BROWSER
from IP_INT_LOG I, IP_LOCATION L
WHERE I.IP_NUMBER BETWEEN L.START_NUMBER AND L.END_NUMBER
limit 10;

Note: This procedure is missing one important change that dramatically improves performance. This will be the topic of a follow-on Part 3 of this post.

Geolocation of IP Addresses in Snowflake

Geolocation of IP Addresses in Snowflake

It’s probably a safe assumption that business intelligence and data science teams can get valuable insights knowing the geolocation of website visitors. Suppose a product launch gets lots of web traffic, but the only source of information on visitors is the web log. Some web server statistics report on traffic grouped by nation, but what if we want to get much more granular information and incorporate this information with the main data warehouse?

Let’s take one sample web server among many, Apache Web Server, and quickly examine the structure of a log entry. Here’s a line in a sample Apache Web Server log.

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

In the Apache documentation, we can get more detailed information on the meaning of each field in the line, but for now we’re going to concentrate on 1) how to load the web logs into Snowflake, and 2) the key aspects for business intelligence and geolocation.

Loading the data is a quick proposition. Even without reading the Apache documentation it’s clear that the web log is space delimited and wraps any fields with spaces inside double quotes. Snowflake provides a very simple way to ingest structured data in flat files using File Formats. You can create a file format using SnowSQL (documented here: https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html) or you can use the Snowflake Web UI (documented here: https://docs.snowflake.net/manuals/user-guide/data-load-web-ui.html#step-1-open-the-load-data-wizard).

Note: Although the Apache Web Log is space delimited, we will use the CSV option for the Snowflake File Format — simply change the delimiter from a comma to a space.

After loading the data, we now need to geolocate the web hits. Problem: the Apache Web Server log, as most web logs, does not show anything for geolocation. Fortunately in most cases, we can use the visitor’s IP address to get fairly accurate geolocation.

Third party services keep up to date databases of IPv4 and IPv5 geolocation data. Once such service I found at https://lite.ip2location.com includes free databases with less rich information than the paid versions. In my testing I found the free databases accurate and useful, though production BI or data science work should consider the paid versions.

The next question is how to resolve millions or billions of web log lines’ IP address to approximate geolocation. This is where Snowflake shines. The IP2Location LITE comes as a flat structured file with millions of rows After creating another Snowflake File Format, it’s an easy matter to turn the IP2Location flat file into a Snowflake table. From there, Snowflake’s powerful massive-scale join make it a simple matter to create a joined view that shows the IP’s approximate location.

There’s one final issue to get this all working. The IP2Location data comes with IPs represented by 32-bit integers, not the familiar dotted quad notation. This makes it much easier to use code and database routines that search for ranges of IPs that all happen to be in the same area by specifying a lower and upper range for the IP number.

IP Dotted Quad:     IP 32-bit Integer, Decimal
192.168.1.1         3232235777

This means we need to convert our dotted quad values into 32-bit integer values. Fortunately, Snowflake makes that easy with a UDF (User Defined Function):

--Convert an IPv4 Dotted Quad into an IP Number
create or replace function IpToInteger(ipAddress varchar)
  returns double 
  language javascript
  strict
  as '
    
    var ipQuad = IPADDRESS.split(".");
 
    var quad1 = parseInt(ipQuad[0]);
    var quad2 = parseInt(ipQuad[1]);
    var quad3 = parseInt(ipQuad[2]);
    var quad4 = parseInt(ipQuad[3]);
    return (quad1 * 16777216) + (quad2 * 65536) + (quad3 * 256) + quad4;

  ';

You may have noticed that the function returns a double, when we know the return will be a 32-bit integer. The reason is that all numeric values in Javascript are floating point, and fortunately we won’t be encountering any issues with precision (i.e. 0.9999999999 != 1) since we’re dealing with only multiplication and addition.

This is one possible approach, but running the Javascript millions or billions of times to get the value concerns me from a performance perspective, so in my next post I’ll explore the possibility of using straight SnowSQL to convert a dotted quad into a 32-bit integer.

Grouping Numbers in Snowflake – Part 2

Grouping Numbers in Snowflake – Part 2

In a previous post, I showed a way to format numbers using local grouping characters. One of the advantages of the approach is it will work everywhere across Snowflake since it does not require creation of a User Defined Function (UDF).

Using a UDF provides a lot more power and flexibility, so let’s see how we do that. First, you need to be working with a specific Snowflake Database and have an active Data Warehouse to do any processing including creation of a UDF.

use database TEST_DATABASE;
use warehouse TEST_WAREHOUSE;

Next we need to create the User Defined Function (UDF):

--Return a formatted number with locale-specific grouping and decimal characters.
create or replace function FormatNumber(d double)
  returns string
  language javascript
  strict
  as '
	if (!isNaN(D)){
    	return D.toLocaleString("en-US");
    } else {
    	return "Not a number.";
    }
  ';

Of course, we could create a much more sophisticated function with overloads that handles optional local input, etc., but for the purposes of this function that’s overkill. If we need to adjust the locale, we can do that easily in the function by changing the parameter sent to the “toLocaleString”. Note that if you omit this parameter, the function will use the default for the current machine. In my case, the machine is in the US-East region of AWS, so it’s already set to US English, “en-US”, but to be on the safe side, it’s best to specify the right value.

To test the function, we can issue this SnowSQL statement (note that we need to be using the same database as the one where we created the UDF):

--Optionally set the database context if it's changed.
use database TEST_DATABASE;

select formatnumber(123456789.012) as FORMATTED_NUMBER;

When the locale specified US English, “us-EN”, it will return the following:

To get a list of possible values for the locale parameter worldwide, refer to the Mozilla Developer Network documentation here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString

One of the options on that page is for Han Chinese. I don’t know how to read that, but when I set the locale in the Snowflake UDF to zh-Hans-CN-u-nu-hanidec I’ll assume it’s working just fine when it returns the following result:

Perhaps someone who can read Han Chinese can confirm this for me.

Grouping Numbers in Snowflake

Grouping Numbers in Snowflake

As I started working with Snowflake, one of the things I noticed is that large integers appear without grouping symbols. When counting rows in a table, it may display as 752941241 rows. Without counting the digits, it’s hard to see at a glance if this is about 75 million rows or 750 million rows.

After reading the docs, I found a simple way to format long numbers with grouping digits.

select to_varchar(count(*), '999,999,999') as TRIP_COUNT from trips;

I’m using a large number in the format string because when a the first argument in to_varchar exceeded the capacity of the number in the format string, the function will return ###,###. Be sure to use a format string large enough for any return. I have tested it with format strings as large as 999,999,999,999,999,999,999 (that’s twenty one 9’s). If the format string is larger than the required number of digits, the function will return the right sized formatted number without leading spaces or zeros.

You can also specify a decimal point using dot character. For example, to show a number with grouped digits and a decimal point:

select to_varchar(some_column, '999,999,999.999999999') from some_table;

This will show up to the maximum number of digits specified in the format string. but will not show more than are necessary. The decimal point will always show up, even when there’s no decimal portion of the number. I will see if there’s a way to format this better and update here.

Theme: Overlay by Kaira