It’s probably a safe assumption that business intelligence and data science teams can get valuable insights knowing the geolocation of website visitors. Suppose a product launch gets lots of web traffic, but the only source of information on visitors is the web log. Some web server statistics report on traffic grouped by nation, but what if we want to get much more granular information and incorporate this information with the main data warehouse?
Let’s take one sample web server among many, Apache Web Server, and quickly examine the structure of a log entry. Here’s a line in a sample Apache Web Server log.
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
In the Apache documentation, we can get more detailed information on the meaning of each field in the line, but for now we’re going to concentrate on 1) how to load the web logs into Snowflake, and 2) the key aspects for business intelligence and geolocation.
Loading the data is a quick proposition. Even without reading the Apache documentation it’s clear that the web log is space delimited and wraps any fields with spaces inside double quotes. Snowflake provides a very simple way to ingest structured data in flat files using File Formats. You can create a file format using SnowSQL (documented here: https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html) or you can use the Snowflake Web UI (documented here: https://docs.snowflake.net/manuals/user-guide/data-load-web-ui.html#step-1-open-the-load-data-wizard).
Note: Although the Apache Web Log is space delimited, we will use the CSV option for the Snowflake File Format — simply change the delimiter from a comma to a space.
After loading the data, we now need to geolocate the web hits. Problem: the Apache Web Server log, as most web logs, does not show anything for geolocation. Fortunately in most cases, we can use the visitor’s IP address to get fairly accurate geolocation.
Third party services keep up to date databases of IPv4 and IPv5 geolocation data. Once such service I found at https://lite.ip2location.com includes free databases with less rich information than the paid versions. In my testing I found the free databases accurate and useful, though production BI or data science work should consider the paid versions.
The next question is how to resolve millions or billions of web log lines’ IP address to approximate geolocation. This is where Snowflake shines. The IP2Location LITE comes as a flat structured file with millions of rows After creating another Snowflake File Format, it’s an easy matter to turn the IP2Location flat file into a Snowflake table. From there, Snowflake’s powerful massive-scale join make it a simple matter to create a joined view that shows the IP’s approximate location.
There’s one final issue to get this all working. The IP2Location data comes with IPs represented by 32-bit integers, not the familiar dotted quad notation. This makes it much easier to use code and database routines that search for ranges of IPs that all happen to be in the same area by specifying a lower and upper range for the IP number.
IP Dotted Quad: IP 32-bit Integer, Decimal
192.168.1.1 3232235777
This means we need to convert our dotted quad values into 32-bit integer values. Fortunately, Snowflake makes that easy with a UDF (User Defined Function):
--Convert an IPv4 Dotted Quad into an IP Number
create or replace function IpToInteger(ipAddress varchar)
returns double
language javascript
strict
as '
var ipQuad = IPADDRESS.split(".");
var quad1 = parseInt(ipQuad[0]);
var quad2 = parseInt(ipQuad[1]);
var quad3 = parseInt(ipQuad[2]);
var quad4 = parseInt(ipQuad[3]);
return (quad1 * 16777216) + (quad2 * 65536) + (quad3 * 256) + quad4;
';
You may have noticed that the function returns a double, when we know the return will be a 32-bit integer. The reason is that all numeric values in Javascript are floating point, and fortunately we won’t be encountering any issues with precision (i.e. 0.9999999999 != 1) since we’re dealing with only multiplication and addition.
This is one possible approach, but running the Javascript millions or billions of times to get the value concerns me from a performance perspective, so in my next post I’ll explore the possibility of using straight SnowSQL to convert a dotted quad into a 32-bit integer.