Detect and / or Delete Duplicates

Ok - I got tired of repeatedly modifying my favorite duplicate deletion algorithm for whatever table I happen to be working with at the time. This code is the result of that frustration. It provides hassle free locating and reporting (or deletion) of duplicate rows in a SQL table based on the uniqueness of a column or columns that the user provides with a few nifty hands-off features. The actual deletion method employed is a correlated subquery, so I must locate or create a column of unique values. To execute, this code only requires that you specify the table name and which column or columns make up a unique row in that table. It will examine the table to discover if it has a Row-GUID or IDENTITY column. If not, then each column is examined to determine if it contains all unique values. Failing that it will create a temporary IDENTITY column. Alternatively, the unique column can be specified.

Syntax:

sp_RemoveDups 'TableName', 'DupQualifierFieldNameList', ['DeleteDups'], ['UniqueColName'], ['CreateIdentColIfNeeded'], <'StoredProcedureResult'>

Inputs:

  • @TableName: Required - String - Name of the table to detect duplicate records in.
  • @DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName.
  • @DeleteDups: Optional - Bit, set to 0 to delete duplicate records. A value of NULL or 1 will return the duplicate records to be deleted.
  • @UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic. If no UniqueColName is provided, then an attempt will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the IDENTITY column. If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected.
  • @CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an IDENTITY column if no unique column can be located. Pass in a 1 here to turn this feature off.
  • @StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.

Returns:

  • If a the first two parameters are either missing or invalid and the user is making the stored procedure call from SQL Query Analyzer then help text is provided.
  • If the DeleteDups argument is not passed in or it is set TO 1 then a recordset of the duplicate records to be deleted will be returned.
  • If the DeleteDups argument is set to 0 then the default SQL success or failure behavior should occur.

Assumes:

  • By default, this code does not delete records - it just reports them. To delete records you must pass in the appropriate parameters.

Code:

/*     Description: UTILITY - Locate IN MASTER
         
      Syntax: EXEC sp_RemoveDups TableName, DupQualifierFieldNameList, DeleteDups, UniqueColName, CreateIdentColIfNeeded, StoredProcedureResult

               ONLY the first two arguments are required
               FOR HELP, enter the stored procedures name without ANY arguments or see the PRINT statements below

               NO DELETION WILL OCCUR BY DEFAULT - ONLY duplicate recordset returned. TO DELETE records, pass IN a 0 FOR the DeleteDups argument.

      Example: EXEC sp_RemoveDups 'MyTable','LastName,FirstName,HomePhone'
       Purpose: Allow removal OF duplicate rows WHERE
               1. We define what fields qualify the duplicate

               2. We SELECT the UNIQUE rowid OR it IS detected automatically ELSE no action takes place
     Method:          DELETE BY RowID ALL duplicate rows EXCEPT the highest RowID (in alpha-sort sequence)

               OF each GROUP of duplicates.
      DATE          BY                    CHANGE
     09-23-2002     Frank                    Original v1.0
     09-23-2002     Frank                    Changed the name FROM sp_RemoveDupsByRowID TO sp_RemoveDups

     10-8-2002     Sean P. O. MacCath-Moran     Made @UniqueColName optional
                                   Added logic TO auto-detect RowGUID AND IDENTITY columns
                                   Added logic TO CHECK FOR UNIQUE value COLUMN IF no RowGUID or IDENTITY column EXISTS

                                   Added logic TO CREATE a TEMPORARY ID field AS a last resort IF no UNIQUE COLUMN could be located

                                   Added HELP
                                   
*/
CREATE PROCEDURE sp_RemoveDups
@TableName AS varchar(50) = NULL,
@DupQualifierFieldNameList AS varchar(200) = NULL,

@DeleteDups AS bit = NULL,
@UniqueColName AS varchar(50) = NULL,
@CreateIdentColIfNeeded AS bit = NULL,
@StoredProcedureResult int = NULL OUTPUT

AS
SET NOCOUNT ON
DECLARE @SQL nvarchar(2000)
DECLARE @SQL_DetermineUniqueTemplate nvarchar(2000)
DECLARE @TempIdentColName varchar(20)
DECLARE @HostName varchar(50)
DECLARE @ActionText varchar(10)

DECLARE @SUM int
DECLARE @COUNT int
DECLARE @NextColumn varchar(50)
/*==================================================================================*/
/*========================VARIABLE INITIALIZATION AND SETUP========================*/
/*=================================================================================*/
/*If no UNIQUE COLUMN IS located THEN a TEMPORARY IDENTITY column will be created useing the name specified in this TempIdentColName string*/

SET @TempIdentColName = 'TempIdentityColXY123'
SET @SQL_DetermineUniqueTemplate = 'SELECT @COUNT = COUNT(Count), @SUM = sum(Count) FROM '
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + '(SELECT TOP 100 PERCENT , COUNT(*) AS Count FROM ' + @TableName

SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + ' GROUP BY  ORDER BY ) a'
/*Retrieve the Host Name. This will be used later IN this SP AS a test TO determine IF the USER is making this call FROM within SQL Query Analyzer*/

SELECT @HostName = hostname FROM master..sysprocesses WHERE spid = @@SPID AND program_name = 'SQL Query Analyzer'
/*Set ActionText TO be used IN message output*/

IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
     SET @ActionText = 'Selection'
ELSE
     SET @ActionText = 'Deletion'

/*If a value IS specified FOR USE BY UniqueColName it cannot exist IN the columns FROM the DupQualifierFieldNameList*/

IF CHARINDEX(@UniqueColName, @DupQualifierFieldNameList) > 0
     BEGIN
          /*The value IN UniqueColName was detected in DupQualifierFieldNameList.*/
          IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') must not exist in DupQualifierFieldNameList (' + @DupQualifierFieldNameList + '). Other solutions will be sought automatically.'

          SET @UniqueColName = NULL
     END
/*If UniqueColName IS provided THEN perform CHECK TO ensure that ALL the VALUES IN that COLUMN are, in fact, unique.*/

IF NOT (@UniqueColName IS NULL)
     BEGIN
          SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @UniqueColName)
          /*Perform a CHECK OF this COLUMN TO determine IF ALL of it's VALUES are unique*/

          EXEC sp_executesql @SQL, N'@SUM AS int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
          /*Test TO determine IF this COLUMN CONTAINS UNIQUE values*/

          IF @SUM <> @COUNT
               BEGIN
                    /*The COLUMN specified BY UniqueColName does NOT contain unique values.*/
                    IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') does not contain unique values. Other solutions will be sought automatically.'

                    SET @UniqueColName = NULL
               END
     END
/*==============================================================*/
/*======================HELP OUTPUT TEXT======================*/
/*==============================================================*/
IF (@TableName IS NULL) OR (@TableName = '/?') OR (@TableName = '?') OR (@DupQualifierFieldNameList IS NULL) OR (@DupQualifierFieldNameList = '/?') OR (@DupQualifierFieldNameList = '?')

     BEGIN
          IF NOT (@HostName IS NULL)
               BEGIN
                    PRINT 'sp_RemoveDups ''TableName'', ''DupQualifierFieldNameList'', [''DeleteDups''], [''UniqueColName''], [''CreateIdentColIfNeeded''], <''StoredProcedureResult''>'
                    PRINT '====================================================================================================================================================================='

                    PRINT 'TableName: Required - String - Name OF the table TO detect duplicate records in.'
                    PRINT 'DupQualifierFieldNameList: Required - String - Comma seperated list OF columns that make up the UNIQUE record within TableName.'

                    PRINT 'DeleteDups: Optional - Bit, SET TO 0 to delete duplicate records. A value OF NULL or 1 will RETURN the duplicate records to be deleted.'

                    PRINT 'UniqueColName: Optional - Bit - A TABLE must have a unique COLUMN value in it TO perform the deletion logic. IF no UniqueColName IS provided THEN an attemp will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the IDENTITY column. If that fails then ALL OF the columns of the table will be examined AND the first one WITH all unique VALUES will be selected.'

                    PRINT 'CreateIdentColIfNeeded: Optional - Bit - BY DEFAULT this SP will create an IDENTITY COLUMN if no UNIQUE column can be located. Pass in a 1 here TO run this feature off.'

                    PRINT 'StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 IF an error occured, otherwise returns a 0.'

               END
          SET @StoredProcedureResult = 3
          RETURN
     END
/*========================================================================*/
/*======================DETECT USABLE UniqueColName======================*/
/*========================================================================*/
IF @UniqueColName IS NULL
     BEGIN

          /*Check FOR a RowGUID or IDENTITY COLUMN IN this table. IF one exists, THEN utilze it AS the UNIQUE value for the purposes OF this deletion*/

          IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) AND ColumnProperty(ID,Name,'IsRowGUIDCol') = 1) SET @UniqueColName = 'RowGUIDCol'

          IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) AND ColumnProperty(ID,Name,'IsIdentity') = 1) SET @UniqueColName = 'IdentityCol'

         
          IF @UniqueColName IS NULL
          /*If no RowGUID OR IDENTITY COLUMN was found THEN CHECK ALL OF the columns IN this TABLE TO see if one of them can be utilized as a UNIQUE value column*/

               BEGIN
                    /*Select ALL OF the columns FROM the TABLE IN question...*/
                    DECLARE MyCursor CURSOR LOCAL SCROLL STATIC FOR SELECT name FROM syscolumns WHERE OBJECT_ID(@TableName)=ID

                   
                    OPEN MyCursor
                    FETCH NEXT FROM MyCursor INTO @NextColumn
                    WHILE @@fetch_status = 0
                         BEGIN
                              /*Create SQL string WITH correct COLUMN name in place.*/

                              SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @NextColumn)
                              /*Perform a CHECK OF this COLUMN TO determine IF ALL of it's VALUES are unique*/

                              EXEC sp_executesql @SQL, N'@SUM AS int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
                              /*Test TO determine IF this COLUMN CONTAINS UNIQUE values*/

                              IF @SUM = @COUNT
                                   BEGIN
                                        /*A UNIQUE VALUES COLUMN IS detected. USE it AND BREAK out OF the loop UNLESS column is specified IN DupQualifierFieldNameList*/

                                        IF CHARINDEX(@NextColumn, @DupQualifierFieldNameList) = 0
                                             BEGIN
                                                  /*NextColumn was NOT detected IN DupQualifierFieldNameList, so this is the column we will use.*/

                                                  SET @UniqueColName = @NextColumn
                                                  BREAK
                                             END
                                   END
                              ELSE
                              FETCH NEXT FROM MyCursor INTO @NextColumn
                         END
                    CLOSE MyCursor

                    DEALLOCATE MyCursor
                   
               END
     END
/*If no UniqueColName has been found THEN CREATE one UNLESS @CreateIdentColIfNeeded = 1*/

IF (@UniqueColName IS NULL) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
     BEGIN
          /*Add a sequence COLUMN TO the table...*/

          IF NOT (@HostName IS NULL) PRINT 'Creating TEMPORARY IDENTITY COLUMN in the ' + @TableName + ' table named ' + @TempIdentColName + ' FOR USE in this ' + LOWER(@ActionText) + ' process...'

          EXEC('ALTER TABLE ' + @TableName + ' ADD ' + @TempIdentColName + ' [int] IDENTITY (1, 1)')

          SET @UniqueColName = @TempIdentColName
     END
/*============================================================================*/
/*======================EXECUTE DELETION OR SELECTION======================*/
/*===========================================================================*/
IF @UniqueColName IS NULL
     BEGIN
          /*No UniqueColName was provided BY the USER AND none were detected by the script. This deletion algorythm cannot run.*/

          IF NOT (@HostName IS NULL) PRINT 'Could not perform ' + LOWER(@ActionText) + ' process. No UNIQUE columns were located AND the UniqueColName flag is SET TO 1 (False).'

          SET @StoredProcedureResult = 3
          RETURN
     END
ELSE
     BEGIN
          IF NOT (@HostName IS NULL) PRINT 'Performing ' + LOWER(@ActionText) + ' utilizing the UNIQUE VALUES in the ' + @UniqueColName + ' COLUMN AS a reference...'

          /*
          CREATE AND EXECUTE an SQL statement IN the form of:
         
          SELECT * (or DELETE)
               FROM TableName WHERE UniqueColName IN

               (
               SELECT UniqueColName FROM TableName WHERE UniqueColName NOT IN
                    (
                    SELECT MAX(Cast(UniqueColName AS varchar(36))) FROM TableName GROUP BY DupQualifierFieldNameList, DupQualifierFieldNameList, etc

                    )
               )
          */
          /*Delete ALL duplicate records useing @UniqueColName AS a unique ID COLUMN */
          IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)

               SET @SQL = 'SELECT * '
          ELSE
               SET @SQL = 'DELETE '
          SET @SQL = @SQL + 'FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' IN '

          SET @SQL = @SQL + CHAR(13) + CHAR(9) + '(' + CHAR(13) + CHAR(9)
          SET @SQL = @SQL + 'SELECT ' + @UniqueColName + ' FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' NOT IN '

          SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + '(' + CHAR(13) + CHAR(9)+CHAR(9)
          SET @SQL = @SQL + 'SELECT MAX(Cast(' + @UniqueColName + ' AS varchar(36))) FROM '

          SET @SQL = @SQL + @TableName + ' GROUP BY ' + @DupQualifierFieldNameList
          SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + ')' + CHAR(13) + CHAR(9) + ')'

               
          EXEC (@SQL)
          IF @@ERROR <> 0
               BEGIN
                    IF NOT (@HostName IS NULL) PRINT @ActionText + ' process failed.'

                    SET @StoredProcedureResult = 3
               END
          ELSE
               BEGIN
                    IF NOT (@HostName IS NULL) PRINT @ActionText + ' completed successfully WITH this SQL: ' + CHAR(13) + @SQL

                    SET @StoredProcedureResult = 0
               END
     END
IF (@UniqueColName = @TempIdentColName) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )

     BEGIN
          /*Remove the sequence COLUMN FROM the table...*/
          IF NOT (@HostName IS NULL) PRINT 'Removing TEMPORARY IDENTITY COLUMN named ' + @TempIdentColName + ' FROM the ' + @TableName + ' table...'

          EXEC('ALTER TABLE ' + @TableName + ' DROP COLUMN ' + @TempIdentColName)
     END
GO