Month: May 2019

Geolocation of IP Addresses in Snowflake

Geolocation of IP Addresses in Snowflake

It’s probably a safe assumption that business intelligence and data science teams can get valuable insights knowing the geolocation of website visitors. Suppose a product launch gets lots of web traffic, but the only source of information on visitors is the web log. Some web server statistics report on traffic grouped by nation, but what if we want to get much more granular information and incorporate this information with the main data warehouse?

Let’s take one sample web server among many, Apache Web Server, and quickly examine the structure of a log entry. Here’s a line in a sample Apache Web Server log.

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

In the Apache documentation, we can get more detailed information on the meaning of each field in the line, but for now we’re going to concentrate on 1) how to load the web logs into Snowflake, and 2) the key aspects for business intelligence and geolocation.

Loading the data is a quick proposition. Even without reading the Apache documentation it’s clear that the web log is space delimited and wraps any fields with spaces inside double quotes. Snowflake provides a very simple way to ingest structured data in flat files using File Formats. You can create a file format using SnowSQL (documented here: https://docs.snowflake.net/manuals/sql-reference/sql/create-file-format.html) or you can use the Snowflake Web UI (documented here: https://docs.snowflake.net/manuals/user-guide/data-load-web-ui.html#step-1-open-the-load-data-wizard).

Note: Although the Apache Web Log is space delimited, we will use the CSV option for the Snowflake File Format — simply change the delimiter from a comma to a space.

After loading the data, we now need to geolocate the web hits. Problem: the Apache Web Server log, as most web logs, does not show anything for geolocation. Fortunately in most cases, we can use the visitor’s IP address to get fairly accurate geolocation.

Third party services keep up to date databases of IPv4 and IPv5 geolocation data. Once such service I found at https://lite.ip2location.com includes free databases with less rich information than the paid versions. In my testing I found the free databases accurate and useful, though production BI or data science work should consider the paid versions.

The next question is how to resolve millions or billions of web log lines’ IP address to approximate geolocation. This is where Snowflake shines. The IP2Location LITE comes as a flat structured file with millions of rows After creating another Snowflake File Format, it’s an easy matter to turn the IP2Location flat file into a Snowflake table. From there, Snowflake’s powerful massive-scale join make it a simple matter to create a joined view that shows the IP’s approximate location.

There’s one final issue to get this all working. The IP2Location data comes with IPs represented by 32-bit integers, not the familiar dotted quad notation. This makes it much easier to use code and database routines that search for ranges of IPs that all happen to be in the same area by specifying a lower and upper range for the IP number.

IP Dotted Quad:     IP 32-bit Integer, Decimal
192.168.1.1         3232235777

This means we need to convert our dotted quad values into 32-bit integer values. Fortunately, Snowflake makes that easy with a UDF (User Defined Function):

--Convert an IPv4 Dotted Quad into an IP Number
create or replace function IpToInteger(ipAddress varchar)
  returns double 
  language javascript
  strict
  as '
    
    var ipQuad = IPADDRESS.split(".");
 
    var quad1 = parseInt(ipQuad[0]);
    var quad2 = parseInt(ipQuad[1]);
    var quad3 = parseInt(ipQuad[2]);
    var quad4 = parseInt(ipQuad[3]);
    return (quad1 * 16777216) + (quad2 * 65536) + (quad3 * 256) + quad4;

  ';

You may have noticed that the function returns a double, when we know the return will be a 32-bit integer. The reason is that all numeric values in Javascript are floating point, and fortunately we won’t be encountering any issues with precision (i.e. 0.9999999999 != 1) since we’re dealing with only multiplication and addition.

This is one possible approach, but running the Javascript millions or billions of times to get the value concerns me from a performance perspective, so in my next post I’ll explore the possibility of using straight SnowSQL to convert a dotted quad into a 32-bit integer.

Grouping Numbers in Snowflake – Part 2

Grouping Numbers in Snowflake – Part 2

In a previous post, I showed a way to format numbers using local grouping characters. One of the advantages of the approach is it will work everywhere across Snowflake since it does not require creation of a User Defined Function (UDF).

Using a UDF provides a lot more power and flexibility, so let’s see how we do that. First, you need to be working with a specific Snowflake Database and have an active Data Warehouse to do any processing including creation of a UDF.

use database TEST_DATABASE;
use warehouse TEST_WAREHOUSE;

Next we need to create the User Defined Function (UDF):

--Return a formatted number with locale-specific grouping and decimal characters.
create or replace function FormatNumber(d double)
  returns string
  language javascript
  strict
  as '
	if (!isNaN(D)){
    	return D.toLocaleString("en-US");
    } else {
    	return "Not a number.";
    }
  ';

Of course, we could create a much more sophisticated function with overloads that handles optional local input, etc., but for the purposes of this function that’s overkill. If we need to adjust the locale, we can do that easily in the function by changing the parameter sent to the “toLocaleString”. Note that if you omit this parameter, the function will use the default for the current machine. In my case, the machine is in the US-East region of AWS, so it’s already set to US English, “en-US”, but to be on the safe side, it’s best to specify the right value.

To test the function, we can issue this SnowSQL statement (note that we need to be using the same database as the one where we created the UDF):

--Optionally set the database context if it's changed.
use database TEST_DATABASE;

select formatnumber(123456789.012) as FORMATTED_NUMBER;

When the locale specified US English, “us-EN”, it will return the following:

To get a list of possible values for the locale parameter worldwide, refer to the Mozilla Developer Network documentation here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Number/toLocaleString

One of the options on that page is for Han Chinese. I don’t know how to read that, but when I set the locale in the Snowflake UDF to zh-Hans-CN-u-nu-hanidec I’ll assume it’s working just fine when it returns the following result:

Perhaps someone who can read Han Chinese can confirm this for me.

Grouping Numbers in Snowflake

Grouping Numbers in Snowflake

As I started working with Snowflake, one of the things I noticed is that large integers appear without grouping symbols. When counting rows in a table, it may display as 752941241 rows. Without counting the digits, it’s hard to see at a glance if this is about 75 million rows or 750 million rows.

After reading the docs, I found a simple way to format long numbers with grouping digits.

select to_varchar(count(*), '999,999,999') as TRIP_COUNT from trips;

I’m using a large number in the format string because when a the first argument in to_varchar exceeded the capacity of the number in the format string, the function will return ###,###. Be sure to use a format string large enough for any return. I have tested it with format strings as large as 999,999,999,999,999,999,999 (that’s twenty one 9’s). If the format string is larger than the required number of digits, the function will return the right sized formatted number without leading spaces or zeros.

You can also specify a decimal point using dot character. For example, to show a number with grouped digits and a decimal point:

select to_varchar(some_column, '999,999,999.999999999') from some_table;

This will show up to the maximum number of digits specified in the format string. but will not show more than are necessary. The decimal point will always show up, even when there’s no decimal portion of the number. I will see if there’s a way to format this better and update here.

Privacy Policy

Who we are

Although I am a Snowflake Computing employee, this is a subdomain of my personal website dedicated to tips, tricks, random thoughts, and all things Snowflake Computing. My website address is: https://snowflake.pavlik.us.

What personal data we collect and why we collect it

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymized string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service privacy policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Media

If you upload images to the website, you should avoid uploading images with embedded location data (EXIF GPS) included. Visitors to the website can download and extract any location data from images on the website.

Contact forms

If you leave contact details in a form, it will be used only to contact you. Your contact information will not be shared with or sold to third parties.

Cookies

If you leave a comment on our site you may opt-in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

If you visit our login page, we will set a temporary cookie to determine if your browser accepts cookies. This cookie contains no personal data and is discarded when you close your browser.

When you log in, we will also set up several cookies to save your login information and your screen display choices. Login cookies last for two days, and screen options cookies last for a year. If you select “Remember Me”, your login will persist for two weeks. If you log out of your account, the login cookies will be removed.

If you edit or publish an article, an additional cookie will be saved in your browser. This cookie includes no personal data and simply indicates the post ID of the article you just edited. It expires after 1 day.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Analytics

The website collects clickstream data, IP addresses, times of visits, page impressions, etc. This data will be analyzed to improve the site, and will not be shared with or sold to third parties.

Who we share your data with

No one. If you wish to contact other Snowflake Computing staff in sales, service, or support, I will initiate contact outside of this website.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

For users that register on our website (if any), we also store the personal information they provide in their user profile. All users can see, edit, or delete their personal information at any time (except they cannot change their username). Website administrators can also see and edit that information.

What rights you have over your data

If you have an account on this site, or have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where we send your data

Visitor comments may be checked through an automated spam detection service.

Your contact information

We do not collect personal contact information and will delete or redact any found in a post or comment. If you need someone from Snowflake Computing to contact you, please visit https://snowflake.com and request contact with the appropriate department.

What third parties we receive data from

This website may share data or links to data from third parties from open-data repositories or sample datasets. All datasets have an attribution of the source.

Posted in Uncategorized
Snowflake in Raleigh/Durham

Snowflake in Raleigh/Durham

Hi there. My name is Greg Pavlik, and I’m a Sr. Sales Engineer at Snowflake Computing based in Raleigh-Durham. I created this site to share tips, tricks, and random thoughts on all things Snowflake Computing.

Snowflake is the only data warehouse built from the ground up for the cloud. You can find more about Snowflake at https://snowflake.com.

Although I am a Snowflake employee, this is not an official Snowflake Computing source of information. It’s intended to help others with SQL syntax, project ideas, data integration and more.

Posted in Uncategorized
Theme: Overlay by Kaira