Data Consistency. I was tasked with finding variations of our default values. End-users were complaining that some default values get listed twice because their system was case-sensitive despite the collation of the SQL Server being case-insensitive.
What?
End-users said that they could see the options "Data Not Yet Available" as well as "Data not yet available". These are obviously the same values, so I should go through the database and change all variations to be typed the same way in the same case.
Wouldn't it be nice to search through your entire database for similar sounding values (so not just case), and in this example, find typos as well, eg. "Daat ont Ety Aviable".
How?
Mine's a bit of a copout as it is a two-step process. I wanted a query that searched every column containing strings in the database and we are somewhat limited in Sql Server 2008 R2 by 4000 characters in a variable (so I couldn't store it in a local variable). So the idea here is to run the below example which outputs a huge SQL query which you then copy & paste and run against your database.
Step 1 of 2
The following stored procedure will generate the full query to your output panel:
CREATE PROCEDURE [usp_GenerateSearchbySoundexQuery] ( @p_SearchString VARCHAR(max), @p_TableSchema VARCHAR(max) ) AS /***************************************************************************************************************** ** Author: Joel Lipman ** ** Created On: 12 October 2012 ** ** Revision: v1.2 ** ** Description: Script to create a SQL query which searches a database with SOUNDEX (detects typos). ** ** ** ** Usage: ** ** Step #1: EXEC [usp_GenerateSearchbySoundexQuery] 'mySearchString', 'tableSchemaToSearch'; ** ** Step #2: Run the query this generates against the database you want to search. ** ** ** ** Example: ** ** EXEC [usp_GenerateSearchbySoundexQuery] 'Not Available', 'dbo'; ** ** ** ** Return Values : Table of two columns: ** ** - Value: Your search term and matching variations ** ** - Source: The table_name.column_name it found this value (Maximum) ** ** ** ** Additional: ** ** - For performance, restrict this by schema otherwise occurrences in tables where the column is ** ** the referred to in a view, will appear twice: ** ** Change: ** ** WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') ** ** To: ** ** WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') ** ** AND TABLE_SCHEMA='dbo' ** ** ** ** - For performance, return results based on string length which is 1 greater or lesser than the original ** ** string value: ** ** Automatic: ** ** @minStringLength and @maxStringLength are calculated based on original value. ** ** ** ** Issues : ** ** Will pick up variations with trailing spaces but this is not easily recognized from the results ** ** ** ** Only finds the first table_name.column_name where the match occurred (ordered alphabetically). ** ** -> Example: if a table has two rows, both with the typo: ** ** Row1: EthnicCode DnYa ** ** Row2: RegionCode DnYa ** ** -> Only Row1 will display in column source as this is first alphabetically. ** ** Correct the value for EthnicCode and re-run the code. ** ** ** ** Link to other SQL Server collations: http://msdn.microsoft.com/en-us/library/ms180175.aspx ** ** ** ****************************************************************************************************************** ** Change History: ** ****************************************************************************************************************** ** Date: By: Rev: Desc: ** ** ------------------------------------------------------------------------------------------------------------ ** ** 11/10/2012 Jlipman 1.0 Created ** ** 12/10/2012 Jlipman 1.1 Merged into one query. ** ** Added schema to column_source. ** ** 15/10/2012 Jlipman 1.2 Added count column to final results. ** ** Added minStrLen and maxStrLen for more accurate matches. ** ** Changed order by MAX(column_source) to MIN(column_source). ** ** 26/10/2012 Jlipman 1.3 Changed into a Stored Procedure. ** ** ** *****************************************************************************************************************/ BEGIN -- Declare some variables to use DECLARE @SqlToExecute nvarchar(max); DECLARE @FullSqlToExecute VARCHAR(max); DECLARE @myCounter int; DECLARE @minStrLen int; DECLARE @maxStrLen int; -- Set the values of some variables SET @myCounter = 0; SET @minStrLen = LEN(@p_SearchString) - 1; SET @maxStrLen = LEN(@p_SearchString) + 1; -- Populate Cursor1 (used to generate final SQL query to run) DECLARE Cursor1 CURSOR FOR SELECT 'SELECT ' + COLUMN_NAME + ' COLLATE Latin1_General_CS_AS AS column_value, ''' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME + ''' AS column_source FROM ' + (TABLE_SCHEMA + '.' + TABLE_NAME) + ' WHERE SOUNDEX(' + COLUMN_NAME + ')=SOUNDEX('''+ @p_SearchString +''') AND LEN(' + COLUMN_NAME + ')=' + CAST(@minStrLen AS varchar(11)) FROM INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') AND TABLE_SCHEMA=@p_TableSchema ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME; PRINT ' SELECT DISTINCT t1.column_value COLLATE Latin1_General_CS_AS AS Value, MIN(t1.column_source) AS Source, COUNT(t1.column_value) AS Count FROM ( '; -- Iterate through cursor1 OPEN Cursor1 FETCH NEXT FROM Cursor1 INTO @SqlToExecute WHILE @@FETCH_STATUS = 0 BEGIN IF @myCounter=0 PRINT @SqlToExecute; ELSE PRINT ' union all ' + @SqlToExecute; SET @myCounter = @myCounter + 1; FETCH NEXT FROM Cursor1 INTO @SqlToExecute END CLOSE Cursor1; DEALLOCATE Cursor1; PRINT ' ) AS t1 GROUP BY column_value ORDER BY MIN(column_source)'; END
- CREATE PROCEDURE [usp_GenerateSearchbySoundexQuery]
- (
- @p_SearchString VARCHAR(max),
- @p_TableSchema VARCHAR(max)
- )
- AS
- /*****************************************************************************************************************
- ** Author: Joel Lipman **
- ** Created On: 12 October 2012 **
- ** Revision: v1.2 **
- ** Description: Script to create a SQL query which searches a database with SOUNDEX (detects typos). **
- ** **
- ** Usage: **
- ** Step #1: EXEC [usp_GenerateSearchbySoundexQuery] 'mySearchString', 'tableSchemaToSearch'; **
- ** Step #2: Run the query this generates against the database you want to search. **
- ** **
- ** Example: **
- ** EXEC [usp_GenerateSearchbySoundexQuery] 'Not Available', 'dbo'; **
- ** **
- ** Return Values : Table of two columns: **
- ** - Value: Your search term and matching variations **
- ** - Source: The table_name.column_name it found this value (Maximum) **
- ** **
- ** Additional: **
- ** - For performance, restrict this by schema otherwise occurrences in tables where the column is **
- ** the referred to in a view, will appear twice: **
- ** Change: **
- ** WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') **
- ** To: **
- ** WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') **
- ** AND TABLE_SCHEMA='dbo' **
- ** **
- ** - For performance, return results based on string length which is 1 greater or lesser than the original **
- ** string value: **
- ** Automatic: **
- ** @minStringLength and @maxStringLength are calculated based on original value. **
- ** **
- ** Issues : **
- ** Will pick up variations with trailing spaces but this is not easily recognized from the results **
- ** **
- ** Only finds the first table_name.column_name where the match occurred (ordered alphabetically). **
- ** -> Example: if a table has two rows, both with the typo: **
- ** Row1: EthnicCode DnYa **
- ** Row2: RegionCode DnYa **
- ** -> Only Row1 will display in column source as this is first alphabetically. **
- ** Correct the value for EthnicCode and re-run the code. **
- ** **
- ** Link to other SQL Server collations: http://msdn.microsoft.com/en-us/library/ms180175.aspx **
- ** **
- ******************************************************************************************************************
- ** Change History: **
- ******************************************************************************************************************
- ** Date: By: Rev: Desc: **
- ** ------------------------------------------------------------------------------------------------------------ **
- ** 11/10/2012 Jlipman 1.0 Created **
- ** 12/10/2012 Jlipman 1.1 Merged into one query. **
- ** Added schema to column_source. **
- ** 15/10/2012 Jlipman 1.2 Added count column to final results. **
- ** Added minStrLen and maxStrLen for more accurate matches. **
- ** Changed order by MAX(column_source) to MIN(column_source). **
- ** 26/10/2012 Jlipman 1.3 Changed into a Stored Procedure. **
- ** **
- *****************************************************************************************************************/
- BEGIN
- -- Declare some variables to use
- DECLARE @SqlToExecute nvarchar(max);
- DECLARE @FullSqlToExecute VARCHAR(max);
- DECLARE @myCounter int;
- DECLARE @minStrLen int;
- DECLARE @maxStrLen int;
- -- Set the values of some variables
- SET @myCounter = 0;
- SET @minStrLen = LEN(@p_SearchString) - 1;
- SET @maxStrLen = LEN(@p_SearchString) + 1;
- -- Populate Cursor1 (used to generate final SQL query to run)
- DECLARE Cursor1 CURSOR FOR
- SELECT
- 'SELECT
- ' + COLUMN_NAME + ' COLLATE Latin1_General_CS_AS AS column_value,
- ''' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME + ''' AS column_source
- FROM
- ' + (TABLE_SCHEMA + '.' + TABLE_NAME) + '
- WHERE
- SOUNDEX(' + COLUMN_NAME + ')=SOUNDEX('''+ @p_SearchString +''')
- AND LEN(' + COLUMN_NAME + ')=' + CAST(@minStrLen AS varchar(11))
- FROM
- INFORMATION_SCHEMA.COLUMNS
- WHERE
- DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')
- AND TABLE_SCHEMA=@p_TableSchema
- ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;
- PRINT '
- SELECT DISTINCT
- t1.column_value COLLATE Latin1_General_CS_AS AS Value,
- MIN(t1.column_source) AS Source,
- COUNT(t1.column_value) AS Count
- FROM
- (
- ';
- -- Iterate through cursor1
- OPEN Cursor1
- FETCH NEXT FROM Cursor1 INTO @SqlToExecute
- WHILE @@FETCH_STATUS = 0
- BEGIN
- IF @myCounter=0
- PRINT @SqlToExecute;
- ELSE
- PRINT ' union all ' + @SqlToExecute;
- SET @myCounter = @myCounter + 1;
- FETCH NEXT FROM Cursor1 INTO @SqlToExecute
- END
- CLOSE Cursor1;
- DEALLOCATE Cursor1;
- PRINT '
- ) AS t1
- GROUP BY column_value
- ORDER BY
- MIN(column_source)';
- END
Step 2 of 2
You now have a big SELECT query which you can copy, paste and run against your SQL database. Simply run the query that the above outputted and you should get something like the following:
Value Source Count ------- ------------------------- -------- dnya myTable1.Title 2 DNya myTable1.OccupationCode 1 DnYa myTable2.FeePayerCode 1 dnYA myTable2.DisabilityCode 1 DyNa myTable3.EthnicityCode 1 DñYà myTable3.OriginalSurName 1 DNYA myTable4.RegionCode 4093 -- note how it picked up DyNa as matching DNYA. -- values it did NOT pick up were words like "dayname" and "dynamite".
- Value Source Count
- ------- ------------------------- --------
- dnya myTable1.Title 2
- DNya myTable1.OccupationCode 1
- DnYa myTable2.FeePayerCode 1
- dnYA myTable2.DisabilityCode 1
- DyNa myTable3.EthnicityCode 1
- DñYà myTable3.OriginalSurName 1
- DNYA myTable4.RegionCode 4093
- -- note how it picked up DyNa as matching DNYA.
- -- values it did NOT pick up were words like "dayname" and "dynamite".
Additional Notes
Our database and environment need to be in British English, the collation used in the example above is where the server is set to "Latin1_General_CI_AS" [General Latin, case-insensitive, accent-sensitive] and because we want our search in the example above to be case-sensitive, I need to add "COLLATE Latin1_General_CS_AS" after the column this affects.
More collations can be found by going to MSDN's SQL Server Collation Name page.
Limitations
- Trailing spaces or prefixing spaces are not obvious in the results, the Value may look the same but if you get your search result twice, this may be why.
Performance Average
- Run against a database of circa 30'000 students and their details (incl. courses ~ 700'000 records) took just under 2 minutes (found over 7 million occurrences similar to "DNYA").