{"id":60,"date":"2019-07-12T19:04:43","date_gmt":"2019-07-12T19:04:43","guid":{"rendered":"https:\/\/snowflake.pavlik.us\/?p=60"},"modified":"2020-02-02T03:44:18","modified_gmt":"2020-02-02T03:44:18","slug":"geolocation-of-ip-addresses-in-snowflake-part-2","status":"publish","type":"post","link":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/","title":{"rendered":"Geolocation of IP Addresses in Snowflake &#8211; Part 2"},"content":{"rendered":"\n<p>One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here:<\/p>\n\n\n\n<p><a href=\"https:\/\/www.sec.gov\/dera\/data\/edgar-log-file-data-set.html\">https:\/\/www.sec.gov\/dera\/data\/edgar-log-file-data-set.html<\/a><\/p>\n\n\n\n<p>Their weblogs obfuscate the final quad of the IPv4 dotted quad, so they look like this: 192.168.1.jjr.<\/p>\n\n\n\n<p>According to their documentation, their method will always replace the final number from 0 to 255 with the same three-letter value. The SEC does not disclose that &#8220;jjr&#8221; maps to something like 134, but for the purposes of this test it&#8217;s acceptable to assign all 256 unique three letter replacements with numbers from 0 to 255.<\/p>\n\n\n\n<p>After downloading the log file named &#8220;log20170630.zip&#8221;, we need to load it to a stage and use a file format to parse it. Here is the file format I used:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nALTER FILE FORMAT &quot;TEST&quot;.&quot;IP2LOCATION&quot;.IP_LOG SET COMPRESSION = &#039;AUTO&#039; FIELD_DELIMITER = &#039;,&#039; RECORD_DELIMITER = &#039;\\n&#039; SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = &#039;NONE&#039; TRIM_SPACE = TRUE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = &#039;NONE&#039; ESCAPE_UNENCLOSED_FIELD = &#039;\\134&#039; DATE_FORMAT = &#039;AUTO&#039; TIMESTAMP_FORMAT = &#039;AUTO&#039; NULL_IF = (&#039;\\\\N&#039;);\n<\/pre><\/div>\n\n\n<p>After loading the data into a table (I called mine IP_LOCATION), the next thing to do is replace the three-letter obfuscated replacements with numbers from 0 through 255. To do this, I ran an ELT process replace these values using these steps:<\/p>\n\n\n\n<p>Step 1: Create a new table with all the columns from the original web log, plus a new column to hold the last three characters of the IP number (the three-letter replacement).<\/p>\n\n\n\n<p>Step 2: Create a table with each of the distinct values in the new column. There will be 256 representing numbers from 0 through 255. For the purposes of this exercise, it is not important which numbers map to which three-letter replacements, so I assigned them in sequential order from 0 through 255 in the order returned in the select distinct query. I called these IP numbers simulated, even though only the final quad of the dotted quads is simulated.<\/p>\n\n\n\n<p>Step 3: Convert the IP dotted quads into IP numbers using the UDF in Part 1 of this series. The SQL looks like this:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\ncreate table IP_INT_LOG as(\nselect IpToInteger(SIMULATED_IP) as IP_NUMBER, IP, SIMULATED_QUAD, SIMULATED_IP, REQUEST_DATE, REQUEST_TIME, ZONE, CIK, ACCESSION, EXTENSION, CODE, SIZE, IDX, NOREFER, NOAGENT, FIND, CRAWLER, BROWSER from IP_LOG);\n<\/pre><\/div>\n\n\n<p>I was concerned about performance running a Javascript UDF millions of 20,488,579 times to convert IP dotted quads into IP numbers. It turned out I needn&#8217;t have been concerned. Snowflake converted all 20.5 million in 16.97 seconds using an Extra Small (XS) warehouse. Out of curiosity, I dropped the resulting table and increased the size of the warehouse to a medium (M) and it ran in 7.85 seconds. This provides empirical evidence that increasing the warehouse size improves performance including those with UDFs.<\/p>\n\n\n\n<p>This led me to the final part of the experiment &#8212; resolving the IP numbers in the web log to the geolocations. We have the IP numbers in the a log table, and the IP geolocations in another. My first thought was to join the tables, but this is an unusual join. The standard join matches keys. In this example, we need to use inequalities to join the table. In other words, join the IP log information with the geolocation data where the IP log&#8217;s IP number falls between the lower and upper bounds of a locations address range.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: sql; title: ; notranslate\" title=\"\">\nselect \nL.COUNTRY_ABBREVIATION, L.COUNTRY, L.STATE_OR_PROVINCE, L.CITY, L.LATITUDE, L.LONGITUDE, L.POSTAL_CODE, \nI.IP_NUMBER, I.IP, I.SIMULATED_QUAD, I.SIMULATED_IP, I.REQUEST_DATE, I.REQUEST_TIME, I.ZONE, I.CIK, I.ACCESSION, I.EXTENSION, I.CODE, I.SIZE, I.IDX, I.NOREFER, I.NOAGENT, I.FIND, I.CRAWLER, I.BROWSER\nfrom IP_INT_LOG I, IP_LOCATION L\nWHERE I.IP_NUMBER BETWEEN L.START_NUMBER AND L.END_NUMBER\nlimit 10;\n<\/pre><\/div>\n\n\n<p>Note: This procedure is missing one important change that dramatically improves performance. This will be the topic of a follow-on Part 3 of this post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here: https:\/\/www.sec.gov\/dera\/data\/edgar-log-file-data-set.html Their weblogs obfuscate the final quad of [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,2,7,1],"tags":[9,10,8],"class_list":["post-60","post","type-post","status-publish","format-standard","hentry","category-functions","category-sql","category-udf-sql","category-uncategorized","tag-geolocation","tag-table-joins","tag-udf"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\r\n<title>Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas<\/title>\r\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\r\n<link rel=\"canonical\" href=\"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/\" \/>\r\n<meta property=\"og:locale\" content=\"en_US\" \/>\r\n<meta property=\"og:type\" content=\"article\" \/>\r\n<meta property=\"og:title\" content=\"Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas\" \/>\r\n<meta property=\"og:description\" content=\"One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here: https:\/\/www.sec.gov\/dera\/data\/edgar-log-file-data-set.html Their weblogs obfuscate the final quad of [&hellip;]\" \/>\r\n<meta property=\"og:url\" content=\"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/\" \/>\r\n<meta property=\"og:site_name\" content=\"Snowflake in the Carolinas\" \/>\r\n<meta property=\"article:published_time\" content=\"2019-07-12T19:04:43+00:00\" \/>\r\n<meta property=\"article:modified_time\" content=\"2020-02-02T03:44:18+00:00\" \/>\r\n<meta name=\"author\" content=\"Greg Pavlik\" \/>\r\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\r\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Greg Pavlik\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\r\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/\"},\"author\":{\"name\":\"Greg Pavlik\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/#\\\/schema\\\/person\\\/019455f4675665b6cf5edea31ec44d7b\"},\"headline\":\"Geolocation of IP Addresses in Snowflake &#8211; Part 2\",\"datePublished\":\"2019-07-12T19:04:43+00:00\",\"dateModified\":\"2020-02-02T03:44:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/\"},\"wordCount\":534,\"commentCount\":0,\"keywords\":[\"Geolocation\",\"Table Joins\",\"UDF\"],\"articleSection\":[\"Functions\",\"SnowSQL\",\"UDF\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/\",\"url\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/\",\"name\":\"Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/#website\"},\"datePublished\":\"2019-07-12T19:04:43+00:00\",\"dateModified\":\"2020-02-02T03:44:18+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/#\\\/schema\\\/person\\\/019455f4675665b6cf5edea31ec44d7b\"},\"breadcrumb\":{\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/index.php\\\/2019\\\/07\\\/12\\\/geolocation-of-ip-addresses-in-snowflake-part-2\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/snowflake.pavlik.us\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Geolocation of IP Addresses in Snowflake &#8211; Part 2\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/#website\",\"url\":\"https:\\\/\\\/snowflake.pavlik.us\\\/\",\"name\":\"Snowflake in the Carolinas\",\"description\":\"Random thoughts on all things Snowflake in the Carolinas\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/snowflake.pavlik.us\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/snowflake.pavlik.us\\\/#\\\/schema\\\/person\\\/019455f4675665b6cf5edea31ec44d7b\",\"name\":\"Greg Pavlik\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g\",\"caption\":\"Greg Pavlik\"},\"description\":\"Greg is a Senior Sales Engineer at Snowflake Computing, in the Raleigh-Durham area. He's been in data management and security for the twenty years.\"}]}<\/script>\r\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/","og_locale":"en_US","og_type":"article","og_title":"Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas","og_description":"One of the thing needed to test Geolocating a weblog is, well, a weblog. As it turns out, this is not an easy thing to do. After some research, I located a partially obfuscated Apache web log from the US Securities and Exchange Commission (SEC) available here: https:\/\/www.sec.gov\/dera\/data\/edgar-log-file-data-set.html Their weblogs obfuscate the final quad of [&hellip;]","og_url":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/","og_site_name":"Snowflake in the Carolinas","article_published_time":"2019-07-12T19:04:43+00:00","article_modified_time":"2020-02-02T03:44:18+00:00","author":"Greg Pavlik","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Greg Pavlik","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/#article","isPartOf":{"@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/"},"author":{"name":"Greg Pavlik","@id":"https:\/\/snowflake.pavlik.us\/#\/schema\/person\/019455f4675665b6cf5edea31ec44d7b"},"headline":"Geolocation of IP Addresses in Snowflake &#8211; Part 2","datePublished":"2019-07-12T19:04:43+00:00","dateModified":"2020-02-02T03:44:18+00:00","mainEntityOfPage":{"@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/"},"wordCount":534,"commentCount":0,"keywords":["Geolocation","Table Joins","UDF"],"articleSection":["Functions","SnowSQL","UDF"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/","url":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/","name":"Geolocation of IP Addresses in Snowflake - Part 2 - Snowflake in the Carolinas","isPartOf":{"@id":"https:\/\/snowflake.pavlik.us\/#website"},"datePublished":"2019-07-12T19:04:43+00:00","dateModified":"2020-02-02T03:44:18+00:00","author":{"@id":"https:\/\/snowflake.pavlik.us\/#\/schema\/person\/019455f4675665b6cf5edea31ec44d7b"},"breadcrumb":{"@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/snowflake.pavlik.us\/index.php\/2019\/07\/12\/geolocation-of-ip-addresses-in-snowflake-part-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/snowflake.pavlik.us\/"},{"@type":"ListItem","position":2,"name":"Geolocation of IP Addresses in Snowflake &#8211; Part 2"}]},{"@type":"WebSite","@id":"https:\/\/snowflake.pavlik.us\/#website","url":"https:\/\/snowflake.pavlik.us\/","name":"Snowflake in the Carolinas","description":"Random thoughts on all things Snowflake in the Carolinas","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/snowflake.pavlik.us\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/snowflake.pavlik.us\/#\/schema\/person\/019455f4675665b6cf5edea31ec44d7b","name":"Greg Pavlik","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g","url":"https:\/\/secure.gravatar.com\/avatar\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d81df729eebf37a042922b17d4a4c834b1e0ccfa9fea1c2c78cb8e95c7e91701?s=96&d=mm&r=g","caption":"Greg Pavlik"},"description":"Greg is a Senior Sales Engineer at Snowflake Computing, in the Raleigh-Durham area. He's been in data management and security for the twenty years."}]}},"_links":{"self":[{"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/posts\/60","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/comments?post=60"}],"version-history":[{"count":4,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/posts\/60\/revisions"}],"predecessor-version":[{"id":301,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/posts\/60\/revisions\/301"}],"wp:attachment":[{"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/media?parent=60"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/categories?post=60"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/snowflake.pavlik.us\/index.php\/wp-json\/wp\/v2\/tags?post=60"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}