Many customers have asked me how to separate good rows from bad rows during a data load. You can use the validate table function to return all the errors encountered during the load. This may not be exactly what you need though.
What you may be looking for is a design pattern like this:
- Load data from files into a raw table
- Except for major errors, insert rows with minor data quality issues into the raw table
- After loading the raw table, insert good rows to staging (if more processing to do) or production
- At the same time, insert bad rows into a separate table for examination of data quality problems
You usually load Snowflake tables from files. Files are string data, so you can define all columns in your raw table as string type. This ensures simple errors will not disrupt the load process. Major errors such as an improper number of columns in a row will generate an error during the load. You can specify the appropriate copy option to set how you want Snowflake to handle major errors like this.
After defining a raw table, you can create a staging table or a production table. Either option uses proper data types instead of all strings. You’ll insert new rows to the target table while sending bad ones to a table containing the original bad values. You can then examine the bad rows to see why they failed to convert to the proper data types.
You can use the following SQL script as a template for how to use Snowflake multi-table Inserts to do this.
-- Create a staging table with all columns defined as strings.
-- This will hold all raw values from the load files.
create or replace table SALES_RAW
( -- Actual Data Type
SALE_TIMESTAMP string, -- timestamp
ITEM_SKU string, -- int
PRICE string, -- number(10,2)
IS_TAXABLE string, -- boolean
COMMENTS string -- string
);
-- Create the production table with actual data types.
create or replace table SALES_STAGE
(
SALE_TIMESTAMP timestamp,
ITEM_SKU int,
PRICE number(10,2),
IS_TAXABLE boolean,
COMMENTS string
);
-- Simulate adding some rows from a load file. Two rows are good.
-- Four rows generate errors when converting to the data types.
insert into SALES_RAW
(SALE_TIMESTAMP, ITEM_SKU, PRICE, IS_TAXABLE, COMMENTS)
values
('2020-03-17 18:21:34', '23289', '3.42', 'TRUE', 'Good row.'),
('2020-17-03 18:21:56', '91832', '1.41', 'FALSE', 'Bad row: SALE_TIMESTAMP has the month and day transposed.'),
('2020-03-17 18:22:03', '7O242', '2.99', 'T', 'Bad row: ITEM_SKU has a capital "O" instead of a zero.'),
('2020-03-17 18:22:10', '53921', '$6.25', 'F', 'Bad row: PRICE should not have a dollar sign.'),
('2020-03-17 18:22:17', '90210', '2.49', 'Foo', 'Bad row: IS_TAXABLE cannot be converted to true or false'),
('2020-03-17 18:22:24', '80386', '1.89', '1', 'Good row.');
-- Make sure the rows inserted okay.
select * from SALES_RAW;
-- Create a table to hold the bad rows.
create or replace table SALES_BAD_ROWS like SALES_RAW;
-- Using multi-table inserts (https://docs.snowflake.net/manuals/sql-reference/sql/insert-multi-table.html)
-- Insert good rows into SALES_STAGE and bad rows into SALES_BAD_ROWS
insert first
when SALE_TIMESTAMP_X is null and SALE_TIMESTAMP is not null or
ITEM_SKU_X is null and SALE_TIMESTAMP is not null or
PRICE_X is null and PRICE is not null or
IS_TAXABLE_X is null and IS_TAXABLE is not null
then
into SALES_BAD_ROWS
(SALE_TIMESTAMP, ITEM_SKU, PRICE, IS_TAXABLE, COMMENTS)
values
(SALE_TIMESTAMP, ITEM_SKU, PRICE, IS_TAXABLE, COMMENTS)
else
into SALES_STAGE
(SALE_TIMESTAMP, ITEM_SKU, PRICE, IS_TAXABLE, COMMENTS)
values
(SALE_TIMESTAMP_X, ITEM_SKU_X, PRICE_X, IS_TAXABLE_X, COMMENTS)
select try_to_timestamp (SALE_TIMESTAMP) as SALE_TIMESTAMP_X,
try_to_number (ITEM_SKU, 10, 0) as ITEM_SKU_X,
try_to_number (PRICE, 10, 2) as PRICE_X,
try_to_boolean (IS_TAXABLE) as IS_TAXABLE_X,
COMMENTS,
SALE_TIMESTAMP,
ITEM_SKU,
PRICE,
IS_TAXABLE
from SALES_RAW;
-- Examine the two good rows
select * from SALES_STAGE;
-- Examine the four bad rows
select * from SALES_BAD_ROWS;
If your incoming values have no nulls or a default value, you can eliminate the null check from the SQL. Now
when SALE_TIMESTAMP_X is null or
ITEM_SKU_X is null or
PRICE_X is null or
IS_TAXABLE_X is null
This works because if the original value isn’t null, the only reason it would be null is type cast failure. There’s one final note on this section of the SQL. Why doesn’t the “when” section with several “and” and “or” operators need parenthesis?
The short answer is that AND has a higher operator precedence than the OR operator. This is true in SQL and most programming languages and seems familiar to many people. If it improves clarity you can add parenthesis. As an academic exercise, this shows the operator precedence of “and” and “or”.
select TRUE and TRUE or FALSE and FALSE; -- In operator precedence, AND comes first, then left to right. This evaluates to TRUE.
-- Processing AND first
select (TRUE and TRUE) or (FALSE and FALSE); -- This is functionally equivalent to the above statement.
-- Processing OR first
select TRUE and (TRUE or FALSE) and FALSE; -- This shows what would happen if OR had higher operator precedence
-- Processing only left to right
select ((TRUE and TRUE) or FALSE) and FALSE; -- This shows what would happen with no operator precedence, just left to right