Duplicate records can result in unintended outcomes when working with data. Despite data professionals disliking it, this issue is often encountered in data warehouse systems. When loading data from sources into a staging table, the data load job may execute multiple times due to various reasons, such as network connectivity failures or runtime exceptions, which can lead to the insertion of duplicate records.
How to delete only excessive rows from a table?
Consider the following image, data from 6/10/2017 loaded four times into this table while there should be only one record.
Using Window function and CTE
The window function ROW_NUMBER() and Common Table Expression (CTE) come in handy to remove duplicate records. Refer to the following query, where the ROW_NUMBER() function is used to generate a unique sequence number for each row within the partition group. The records are partitioned based on the columns ‘StoreId, SaleDate, SalesAmount’ where duplicates are found. The result of this query is shown in the following image.
;WITH DUPLICATE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum FROM StoreSale ) SELECT * FROM DUPLICATE
DELETE Records
It is clear that records with RowNum greater than 1 are confirmed to be duplicates and it is safe to delete them using the following query
;WITH DUPLICATE AS ( SELECT ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum FROM StoreSale ) DELETE FROM DUPLICATE WHERE RowNum >1
Complete script used in this exercise given here.
USE TEMPDB GO --------------------------------------------------------------------------------------------------------------- --Prep --------------------------------------------------------------------------------------------------------------- DROP TABLE IF EXISTS [dbo].[StoreSale] GO CREATE TABLE [dbo].[StoreSale] ( [StoreSaleId] [bigint] IDENTITY(1,1) NOT NULL, [StoreId] [int] NOT NULL, [SaleDate] [date] NOT NULL, [SalesAmount] [money] NOT NULL, [DataLoadDate] [datetime] NULL DEFAULT (GETDATE()), CONSTRAINT [PK_StoreSaleId] PRIMARY KEY CLUSTERED ( [StoreSaleId] ASC) ) ON [PRIMARY] GO DECLARE @StoreId SMALLINT = 1 DECLARE @SaleDate DATE = '6/1/2017' WHILE @StoreId <=1 BEGIN WHILE @SaleDate <='6/10/2017' BEGIN INSERT INTO [dbo].[StoreSale] ([StoreId],[SaleDate],[SalesAmount]) VALUES (@StoreId,@SaleDate,ROUND(RAND()*1000,2)) SET @SaleDate=DATEADD(DD,1,@SaleDate) END SET @StoreId+=1 SET @SaleDate = '6/1/2017' END GO INSERT INTO [dbo].[StoreSale] ([StoreId],[SaleDate],[SalesAmount]) SELECT [StoreId],[SaleDate],[SalesAmount] FROM [StoreSale] WHERE StoreId=1 AND SaleDate = '6/10/2017' GO 2 --------------------------------------------------------------------------------------------------------------- --Select and delete duplicate records --------------------------------------------------------------------------------------------------------------- GO ;WITH DUPLICATE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum FROM StoreSale ) SELECT * FROM DUPLICATE GO ;WITH DUPLICATE AS ( SELECT ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum FROM StoreSale ) DELETE FROM DUPLICATE WHERE RowNum >1 GO ;WITH DUPLICATE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY StoreId, SaleDate, SalesAmount ORDER BY StoreSaleId) RowNum FROM StoreSale ) SELECT * FROM DUPLICATE