Quick Sample of Fuzzy Matching in Snowflake
Quick Sample of Fuzzy Matching in Snowflake
Introduction
This post walks through a quick, practical example of fuzzy name matching using Snowflake SQL. The goal is to identify approximate matches based on phonetic similarity and spelling distance. We’ll progressively build up a simple pattern using SOUNDEX
for fast phonetic filtering and EDITDISTANCE
for final scoring. This isn’t a production-ready pipeline—it’s a conceptual starting point. A more advanced post will follow with a normalized nickname mapping approach.
Step 1: Input Parameters
We begin by setting the target name to match against. These could be passed in dynamically or used in ad hoc analysis:
SET FIRST_NAME = 'Greg';
SET LAST_NAME = 'Smith';
Step 2: Sample Data
Next, we define a small set of names with intentional variations—nicknames, spelling shifts, and common soundalikes. This is your test data:
WITH NAMES AS (
SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
VALUES
('Greg', 'Smith'),
('Gray', 'Smith'),
('Greg', 'Smyth'),
('Craig', 'Smythe'),
('Gregory', 'Smithe'),
('Mike', 'Smith'),
('Gregg', 'Smith'),
('Gregg', 'Smithe')
)
),
Step 3: Phonetic Projection with SOUNDEX
We calculate the phonetic representation of first and last names using SOUNDEX
. This lets us filter out clearly unrelated candidates before calculating edit distance:
SOUNDEX_PROJECTION AS (
SELECT FIRST_NAME,
LAST_NAME,
SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
SOUNDEX(LAST_NAME) AS SOUNDEX_LAST
FROM NAMES
)
Step 4: Match by Edit Distance
Finally, we compare names that share a soundex prefix, ranking them by EDITDISTANCE
from the target name:
SELECT FIRST_NAME,
LAST_NAME,
EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM SOUNDEX_PROJECTION
WHERE SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
AND SOUNDEX_LAST = SOUNDEX($LAST_NAME)
AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY DISTANCE ASC;
This query filters and scores results, favoring names with both phonetic and lexical similarity. You can tune the distance threshold based on your precision-recall tradeoff.
Final: Minimal Reproducible Example
Here’s the entire working example in one block for easy copy-paste and experimentation:
SET FIRST_NAME = 'Greg';
SET LAST_NAME = 'Smith';
WITH NAMES AS (
SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
VALUES
('Greg', 'Smith'),
('Gray', 'Smith'),
('Greg', 'Smyth'),
('Craig', 'Smythe'),
('Gregory', 'Smithe'),
('Mike', 'Smith'),
('Gregg', 'Smith'),
('Gregg', 'Smithe')
)
),
SOUNDEX_PROJECTION AS (
SELECT FIRST_NAME,
LAST_NAME,
SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
SOUNDEX(LAST_NAME) AS SOUNDEX_LAST
FROM NAMES
)
SELECT FIRST_NAME,
LAST_NAME,
EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM SOUNDEX_PROJECTION
WHERE SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
AND SOUNDEX_LAST = SOUNDEX($LAST_NAME)
AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY DISTANCE ASC;
A follow-up article will extend this logic using a normalized mapping table (e.g., mapping Catherine
to Cat
, Katie
, etc.) for formal/informal name handling.
Stay tuned.