2025 - Snowflake in the Carolinas

Quick Sample of Fuzzy Matching in Snowflake

Posted on April 23, 2025April 25, 2025 by Greg Pavlik

Quick Sample of Fuzzy Matching in Snowflake

Introduction

This post walks through a quick, practical example of fuzzy name matching using Snowflake SQL. The goal is to identify approximate matches based on phonetic similarity and spelling distance. We’ll progressively build up a simple pattern using SOUNDEX for fast phonetic filtering and EDITDISTANCE for final scoring. This isn’t a production-ready pipeline—it’s a conceptual starting point. A more advanced post will follow with a normalized nickname mapping approach.

Step 1: Input Parameters

We begin by setting the target name to match against. These could be passed in dynamically or used in ad hoc analysis:

SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';

Step 2: Sample Data

Next, we define a small set of names with intentional variations—nicknames, spelling shifts, and common soundalikes. This is your test data:

WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),

Step 3: Phonetic Projection with SOUNDEX

We calculate the phonetic representation of first and last names using SOUNDEX. This lets us filter out clearly unrelated candidates before calculating edit distance:

SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)

Step 4: Match by Edit Distance

Finally, we compare names that share a soundex prefix, ranking them by EDITDISTANCE from the target name:

SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) &lt;= 10
ORDER BY  DISTANCE ASC;

This query filters and scores results, favoring names with both phonetic and lexical similarity. You can tune the distance threshold based on your precision-recall tradeoff.

Final: Minimal Reproducible Example

Here’s the entire working example in one block for easy copy-paste and experimentation:

SET FIRST_NAME = 'Greg';
SET LAST_NAME  = 'Smith';

WITH NAMES AS (
    SELECT COLUMN1 AS FIRST_NAME, COLUMN2 AS LAST_NAME FROM (
        VALUES 
            ('Greg', 'Smith'),
            ('Gray', 'Smith'),
            ('Greg', 'Smyth'),
            ('Craig', 'Smythe'),
            ('Gregory', 'Smithe'),
            ('Mike', 'Smith'),
            ('Gregg', 'Smith'),
            ('Gregg', 'Smithe')
    )
),
SOUNDEX_PROJECTION AS (
    SELECT   FIRST_NAME,
             LAST_NAME,
             SOUNDEX(FIRST_NAME) AS SOUNDEX_FIRST,
             SOUNDEX(LAST_NAME)  AS SOUNDEX_LAST
    FROM NAMES
)
SELECT    FIRST_NAME,
          LAST_NAME,
          EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) AS DISTANCE
FROM      SOUNDEX_PROJECTION
WHERE     SOUNDEX_FIRST = SOUNDEX($FIRST_NAME)
      AND SOUNDEX_LAST  = SOUNDEX($LAST_NAME)
      AND EDITDISTANCE(CONCAT(FIRST_NAME, ' ', LAST_NAME),
                       CONCAT($FIRST_NAME, ' ', $LAST_NAME)) <= 10
ORDER BY  DISTANCE ASC;

A follow-up article will extend this logic using a normalized mapping table (e.g., mapping Catherine to Cat, Katie, etc.) for formal/informal name handling.

Stay tuned.