Quick Sample of Fuzzy Matching in Snowflake
Introduction
This post walks through a quick, practical example of fuzzy name matching using Snowflake SQL. The goal is to identify approximate matches based on phonetic similarity and spelling distance. We’ll progressively build up a simple pattern using SOUNDEX for fast phonetic filtering and EDITDISTANCE for final scoring. This isn’t a production-ready pipeline—it’s a conceptual starting point. A more advanced post will follow with a normalized nickname mapping approach.
Step 1: Input Parameters
We begin by setting the target name to match against. These could be passed in dynamically or used in ad hoc analysis:
SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';
Step 2: Sample Data
Next, we define a small set of names with intentional variations—nicknames, spelling shifts, and common soundalikes. This is your test data:
WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),
Step 3: Phonetic Projection with SOUNDEX
We calculate the phonetic representation of first and last names using SOUNDEX. This lets us filter out clearly unrelated candidates before calculating edit distance:
SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)
Step 4: Match by Edit Distance
Finally, we compare names that share a soundex prefix, ranking them by EDITDISTANCE from the target name:
SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY  DISTANCE ASC;
This query filters and scores results, favoring names with both phonetic and lexical similarity. You can tune the distance threshold based on your precision-recall tradeoff.
Final: Minimal Reproducible Example
Here’s the entire working example in one block for easy copy-paste and experimentation:
SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';
WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),
SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)
SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY  DISTANCE ASC;
A follow-up article will extend this logic using a normalized mapping table (e.g., mapping Catherine to Cat, Katie, etc.) for formal/informal name handling.
Stay tuned.
