Search a database with SOUNDEX

Why?
Data Consistency. I was tasked with finding variations of our default values. End-users were complaining that some default values get listed twice because their system was case-sensitive despite the collation of the SQL Server being case-insensitive.

What?
End-users said that they could see the options "Data Not Yet Available" as well as "Data not yet available". These are obviously the same values, so I should go through the database and change all variations to be typed the same way in the same case.

Wouldn't it be nice to search through your entire database for similar sounding values (so not just case), and in this example, find typos as well, eg. "Daat ont Ety Aviable".

How?
Mine's a bit of a copout as it is a two-step process. I wanted a query that searched every column containing strings in the database and we are somewhat limited in Sql Server 2008 R2 by 4000 characters in a variable (so I couldn't store it in a local variable). So the idea here is to run the below example which outputs a huge SQL query which you then copy & paste and run against your database.

Step 1 of 2
The following stored procedure will generate the full query to your output panel:
copyraw
CREATE PROCEDURE [usp_GenerateSearchbySoundexQuery]
(
          @p_SearchString VARCHAR(max),
          @p_TableSchema VARCHAR(max)
)
AS
/*****************************************************************************************************************
** Author:      Joel Lipman                                                                                     **
** Created On:  12 October 2012                                                                                 **
** Revision:    v1.2                                                                                            **
** Description: Script to create a SQL query which searches a database with SOUNDEX (detects typos).            **
**                                                                                                              **
** Usage:                                                                                                       **
**    Step #1:      EXEC [usp_GenerateSearchbySoundexQuery] 'mySearchString', 'tableSchemaToSearch';            **
**    Step #2:      Run the query this generates against the database you want to search.                       **
**                                                                                                              **
** Example:                                                                                                     **
**    EXEC [usp_GenerateSearchbySoundexQuery] 'Not Available', 'dbo';                                           **
**                                                                                                              **
** Return Values : Table of two columns:                                                                        **
**    - Value: Your search term and matching variations                                                         **
**    - Source: The table_name.column_name it found this value (Maximum)                                        **
**                                                                                                              **
** Additional:                                                                                                  **
**      - For performance, restrict this by schema otherwise occurrences in tables where the column is          **
**      the referred to in a view, will appear twice:                                                           **
**           Change:                                                                                            **
**                 WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')                 **
**           To:                                                                                                **
**                 WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')                 **
**                                 AND TABLE_SCHEMA='dbo'                                                       **
**                                                                                                              **
**      - For performance, return results based on string length which is 1 greater or lesser than the original **
**      string value:                                                                                           **
**           Automatic:                                                                                         **
**                 @minStringLength and @maxStringLength are calculated based on original value.                **
**                                                                                                              **
** Issues :                                                                                                     **
**       Will pick up variations with trailing spaces but this is not easily recognized from the results        **
**                                                                                                              **
**       Only finds the first table_name.column_name where the match occurred (ordered alphabetically).         **
**           -> Example: if a table has two rows, both with the typo:                                           **
**                 Row1:   EthnicCode      DnYa                                                                 **
**                 Row2:   RegionCode      DnYa                                                                 **
**                 -> Only Row1 will display in column source as this is first alphabetically.                  **
**                    Correct the value for EthnicCode and re-run the code.                                     **
**                                                                                                              **
**       Link to other SQL Server collations: http://msdn.microsoft.com/en-us/library/ms180175.aspx             **
**                                                                                                              **
******************************************************************************************************************
** Change History:                                                                                              **
******************************************************************************************************************
** Date:                By:             Rev:        Desc:                                                       **
** ------------------------------------------------------------------------------------------------------------ **
** 11/10/2012           Jlipman         1.0         Created                                                     **
** 12/10/2012           Jlipman         1.1         Merged into one query.                                      **
**                                                  Added schema to column_source.                              **
** 15/10/2012           Jlipman         1.2         Added count column to final results.                        **
**                                                  Added minStrLen and maxStrLen for more accurate matches.    **
**                                                  Changed order by MAX(column_source) to MIN(column_source).  **
** 26/10/2012           Jlipman         1.3         Changed into a Stored Procedure.                            **
**                                                                                                              **
*****************************************************************************************************************/

BEGIN   

-- Declare some variables to use
DECLARE @SqlToExecute nvarchar(max);
DECLARE @FullSqlToExecute VARCHAR(max);
DECLARE @myCounter int;
DECLARE @minStrLen int;
DECLARE @maxStrLen int;

-- Set the values of some variables
SET @myCounter = 0;
SET @minStrLen = LEN(@p_SearchString) - 1;
SET @maxStrLen = LEN(@p_SearchString) + 1;

-- Populate Cursor1 (used to generate final SQL query to run)
DECLARE Cursor1 CURSOR FOR
        SELECT
                'SELECT 
                        ' + COLUMN_NAME + ' COLLATE Latin1_General_CS_AS AS column_value, 
                        ''' + TABLE_SCHEMA + '.' + TABLE_NAME + '.' + COLUMN_NAME + ''' AS column_source
                FROM 
                        ' + (TABLE_SCHEMA + '.' + TABLE_NAME) + ' 
                WHERE 
                        SOUNDEX(' + COLUMN_NAME + ')=SOUNDEX('''+ @p_SearchString +''')
                        AND LEN(' + COLUMN_NAME + ')=' + CAST(@minStrLen AS varchar(11))
        FROM
                INFORMATION_SCHEMA.COLUMNS
        WHERE
                DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')
                AND TABLE_SCHEMA=@p_TableSchema  
        ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;

PRINT '
                SELECT DISTINCT 
                        t1.column_value COLLATE Latin1_General_CS_AS AS Value, 
                        MIN(t1.column_source) AS Source,
                        COUNT(t1.column_value) AS Count
                FROM
                (
                        ';
                        


-- Iterate through cursor1
OPEN Cursor1
FETCH NEXT FROM Cursor1 INTO @SqlToExecute
WHILE @@FETCH_STATUS = 0
BEGIN
                IF @myCounter=0
                        PRINT @SqlToExecute;        
                ELSE
                        PRINT ' union all ' + @SqlToExecute;        
        SET @myCounter = @myCounter + 1;
    FETCH NEXT FROM Cursor1 INTO @SqlToExecute
END
CLOSE Cursor1;
DEALLOCATE Cursor1;

PRINT '
        ) AS t1 
        GROUP BY column_value 
        ORDER BY 
                MIN(column_source)';


END
  1.  CREATE PROCEDURE [usp_GenerateSearchbySoundexQuery] 
  2.  ( 
  3.            @p_SearchString VARCHAR(max), 
  4.            @p_TableSchema VARCHAR(max) 
  5.  ) 
  6.  AS 
  7.  /***************************************************************************************************************** 
  8.  ** Author:      Joel Lipman                                                                                     ** 
  9.  ** Created On:  12 October 2012                                                                                 ** 
  10.  ** Revision:    v1.2                                                                                            ** 
  11.  ** Description: Script to create a SQL query which searches a database with SOUNDEX (detects typos).            ** 
  12.  **                                                                                                              ** 
  13.  ** Usage:                                                                                                       ** 
  14.  **    Step #1:      EXEC [usp_GenerateSearchbySoundexQuery] 'mySearchString', 'tableSchemaToSearch';            ** 
  15.  **    Step #2:      Run the query this generates against the database you want to search.                       ** 
  16.  **                                                                                                              ** 
  17.  ** Example:                                                                                                     ** 
  18.  **    EXEC [usp_GenerateSearchbySoundexQuery] 'Not Available', 'dbo';                                           ** 
  19.  **                                                                                                              ** 
  20.  ** Return Values : Table of two columns:                                                                        ** 
  21.  **    - Value: Your search term and matching variations                                                         ** 
  22.  **    - Source: The table_name.column_name it found this value (Maximum)                                        ** 
  23.  **                                                                                                              ** 
  24.  ** Additional:                                                                                                  ** 
  25.  **      - For performance, restrict this by schema otherwise occurrences in tables where the column is          ** 
  26.  **      the referred to in a view, will appear twice:                                                           ** 
  27.  **           Change:                                                                                            ** 
  28.  **                 WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')                 ** 
  29.  **           To:                                                                                                ** 
  30.  **                 WHERE DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext')                 ** 
  31.  **                                 AND TABLE_SCHEMA='dbo'                                                       ** 
  32.  **                                                                                                              ** 
  33.  **      - For performance, return results based on string length which is 1 greater or lesser than the original ** 
  34.  **      string value:                                                                                           ** 
  35.  **           Automatic:                                                                                         ** 
  36.  **                 @minStringLength and @maxStringLength are calculated based on original value.                ** 
  37.  **                                                                                                              ** 
  38.  ** Issues :                                                                                                     ** 
  39.  **       Will pick up variations with trailing spaces but this is not easily recognized from the results        ** 
  40.  **                                                                                                              ** 
  41.  **       Only finds the first table_name.column_name where the match occurred (ordered alphabetically).         ** 
  42.  **           -> Example: if a table has two rows, both with the typo:                                           ** 
  43.  **                 Row1:   EthnicCode      DnYa                                                                 ** 
  44.  **                 Row2:   RegionCode      DnYa                                                                 ** 
  45.  **                 -> Only Row1 will display in column source as this is first alphabetically.                  ** 
  46.  **                    Correct the value for EthnicCode and re-run the code.                                     ** 
  47.  **                                                                                                              ** 
  48.  **       Link to other SQL Server collations: http://msdn.microsoft.com/en-us/library/ms180175.aspx             ** 
  49.  **                                                                                                              ** 
  50.  ****************************************************************************************************************** 
  51.  ** Change History:                                                                                              ** 
  52.  ****************************************************************************************************************** 
  53.  ** Date:                By:             Rev:        Desc:                                                       ** 
  54.  ** ------------------------------------------------------------------------------------------------------------ ** 
  55.  ** 11/10/2012           Jlipman         1.0         Created                                                     ** 
  56.  ** 12/10/2012           Jlipman         1.1         Merged into one query.                                      ** 
  57.  **                                                  Added schema to column_source.                              ** 
  58.  ** 15/10/2012           Jlipman         1.2         Added count column to final results.                        ** 
  59.  **                                                  Added minStrLen and maxStrLen for more accurate matches.    ** 
  60.  **                                                  Changed order by MAX(column_source) to MIN(column_source).  ** 
  61.  ** 26/10/2012           Jlipman         1.3         Changed into a Stored Procedure.                            ** 
  62.  **                                                                                                              ** 
  63.  *****************************************************************************************************************/ 
  64.   
  65.  BEGIN 
  66.   
  67.  -- Declare some variables to use 
  68.  DECLARE @SqlToExecute nvarchar(max)
  69.  DECLARE @FullSqlToExecute VARCHAR(max)
  70.  DECLARE @myCounter int; 
  71.  DECLARE @minStrLen int; 
  72.  DECLARE @maxStrLen int; 
  73.   
  74.  -- Set the values of some variables 
  75.  SET @myCounter = 0
  76.  SET @minStrLen = LEN(@p_SearchString) - 1
  77.  SET @maxStrLen = LEN(@p_SearchString) + 1
  78.   
  79.  -- Populate Cursor1 (used to generate final SQL query to run) 
  80.  DECLARE Cursor1 CURSOR FOR 
  81.          SELECT 
  82.                  'SELECT 
  83.                          + COLUMN_NAME + ' COLLATE Latin1_General_CS_AS AS column_value, 
  84.                          ''+ TABLE_SCHEMA + '.+ TABLE_NAME + '.+ COLUMN_NAME + ''' AS column_source 
  85.                  FROM 
  86.                          + (TABLE_SCHEMA + '.+ TABLE_NAME) + ' 
  87.                  WHERE 
  88.                          SOUNDEX(+ COLUMN_NAME + ')=SOUNDEX('''+ @p_SearchString +''') 
  89.                          AND LEN(+ COLUMN_NAME + ')=' + CAST(@minStrLen AS varchar(11)) 
  90.          FROM 
  91.                  INFORMATION_SCHEMA.COLUMNS 
  92.          WHERE 
  93.                  DATA_TYPE IN ('char', 'nchar', 'varchar', 'nvarchar', 'text', 'ntext') 
  94.                  AND TABLE_SCHEMA=@p_TableSchema 
  95.          ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME; 
  96.   
  97.  PRINT ' 
  98.                  SELECT DISTINCT 
  99.                          t1.column_value COLLATE Latin1_General_CS_AS AS Value, 
  100.                          MIN(t1.column_source) AS Source, 
  101.                          COUNT(t1.column_value) AS Count 
  102.                  FROM 
  103.                  ( 
  104.                          '
  105.   
  106.   
  107.   
  108.  -- Iterate through cursor1 
  109.  OPEN Cursor1 
  110.  FETCH NEXT FROM Cursor1 INTO @SqlToExecute 
  111.  WHILE @@FETCH_STATUS = 0 
  112.  BEGIN 
  113.                  IF @myCounter=0 
  114.                          PRINT @SqlToExecute; 
  115.                  ELSE 
  116.                          PRINT ' union all ' + @SqlToExecute; 
  117.          SET @myCounter = @myCounter + 1
  118.      FETCH NEXT FROM Cursor1 INTO @SqlToExecute 
  119.  END 
  120.  CLOSE Cursor1; 
  121.  DEALLOCATE Cursor1; 
  122.   
  123.  PRINT ' 
  124.          ) AS t1 
  125.          GROUP BY column_value 
  126.          ORDER BY 
  127.                  MIN(column_source)'
  128.   
  129.   
  130.  END 

Step 2 of 2
You now have a big SELECT query which you can copy, paste and run against your SQL database. Simply run the query that the above outputted and you should get something like the following:
copyraw
Value   Source                    Count
------- ------------------------- --------
dnya	myTable1.Title            2
DNya	myTable1.OccupationCode   1 
DnYa	myTable2.FeePayerCode     1
dnYA	myTable2.DisabilityCode   1
DyNa	myTable3.EthnicityCode    1
DñYà	myTable3.OriginalSurName  1
DNYA	myTable4.RegionCode       4093

-- note how it picked up DyNa as matching DNYA.
-- values it did NOT pick up were words like "dayname" and "dynamite".
  1.  Value   Source                    Count 
  2.  ------- ------------------------- -------- 
  3.  dnya    myTable1.Title            2 
  4.  DNya    myTable1.OccupationCode   1 
  5.  DnYa    myTable2.FeePayerCode     1 
  6.  dnYA    myTable2.DisabilityCode   1 
  7.  DyNa    myTable3.EthnicityCode    1 
  8.  DñYà    myTable3.OriginalSurName  1 
  9.  DNYA    myTable4.RegionCode       4093 
  10.   
  11.  -- note how it picked up DyNa as matching DNYA. 
  12.  -- values it did NOT pick up were words like "dayname" and "dynamite". 


Additional Notes
Our database and environment need to be in British English, the collation used in the example above is where the server is set to "Latin1_General_CI_AS" [General Latin, case-insensitive, accent-sensitive] and because we want our search in the example above to be case-sensitive, I need to add "COLLATE Latin1_General_CS_AS" after the column this affects.

More collations can be found by going to MSDN's SQL Server Collation Name page.

Limitations
  1. Trailing spaces or prefixing spaces are not obvious in the results, the Value may look the same but if you get your search result twice, this may be why.


Performance Average
- Run against a database of circa 30'000 students and their details (incl. courses ~ 700'000 records) took just under 2 minutes (found over 7 million occurrences similar to "DNYA").
Category: Databases :: Article: 439

Credit where Credit is Due:


Feel free to copy, redistribute and share this information. All that we ask is that you attribute credit and possibly even a link back to this website as it really helps in our search engine rankings.

Disclaimer: Please note that the information provided on this website is intended for informational purposes only and does not represent a warranty. The opinions expressed are those of the author only. We recommend testing any solutions in a development environment before implementing them in production. The articles are based on our good faith efforts and were current at the time of writing, reflecting our practical experience in a commercial setting.

Thank you for visiting and, as always, we hope this website was of some use to you!

Kind Regards,

Joel Lipman
www.joellipman.com

Related Articles

Joes Revolver Map

Accreditation

Badge - Certified Zoho Creator Associate
Badge - Certified Zoho Creator Associate

Donate & Support

If you like my content, and would like to support this sharing site, feel free to donate using a method below:

Paypal:
Donate to Joel Lipman via PayPal

Bitcoin:
Donate to Joel Lipman with Bitcoin bc1qf6elrdxc968h0k673l2djc9wrpazhqtxw8qqp4

Ethereum:
Donate to Joel Lipman with Ethereum 0xb038962F3809b425D661EF5D22294Cf45E02FebF
© 2024 Joel Lipman .com. All Rights Reserved.