Quick Sample of Fuzzy Matching in Snowflake

Quick Sample of Fuzzy Matching in Snowflake

Introduction

This post walks through a quick, practical example of fuzzy name matching using Snowflake SQL. The goal is to identify approximate matches based on phonetic similarity and spelling distance. We’ll progressively build up a simple pattern using SOUNDEX for fast phonetic filtering and EDITDISTANCE for final scoring. This isn’t a production-ready pipeline—it’s a conceptual starting point. A more advanced post will follow with a normalized nickname mapping approach.

Step 1: Input Parameters

We begin by setting the target name to match against. These could be passed in dynamically or used in ad hoc analysis:

SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';

Step 2: Sample Data

Next, we define a small set of names with intentional variations—nicknames, spelling shifts, and common soundalikes. This is your test data:

WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),

Step 3: Phonetic Projection with SOUNDEX

We calculate the phonetic representation of first and last names using SOUNDEX. This lets us filter out clearly unrelated candidates before calculating edit distance:

SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)

Step 4: Match by Edit Distance

Finally, we compare names that share a soundex prefix, ranking them by EDITDISTANCE from the target name:

SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY  DISTANCE ASC;

This query filters and scores results, favoring names with both phonetic and lexical similarity. You can tune the distance threshold based on your precision-recall tradeoff.

Final: Minimal Reproducible Example

Here’s the entire working example in one block for easy copy-paste and experimentation:

SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';

WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),
SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)
SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY  DISTANCE ASC;

A follow-up article will extend this logic using a normalized mapping table (e.g., mapping Catherine to Cat, Katie, etc.) for formal/informal name handling.

Stay tuned.