Find Duplicate Input With MySQL

Back to basic, let's brush up on some SQL :)

At my company we have employees creating customer accounts every day. Sometimes we make mistakes, for instance, we forget to check if the company already was a customer (maybe 10y ago they may have had a product).

Duplicate accounts can cause all sorts of problems, so I wanted way to detect them with SQL.

The problem was, the company names may have been entered with different punctuation, whitespace, etc. So I needed similar names to surface from the depths of our database, not just exact matches (that would have been too easy :)

For the solution I turned to SOUNDEX for fuzzy matching similar sounding company names, and then review the results myself (false positives are possible, but since they would be few, it becomes a simple task to doublecheck) and report back to our company.

I thought I'd share

partly because it could be useful to others (obviously this could be used to detect all kinds of user generated typos and similar entries);
mostly because I'm curious to find if there is a better (more performant) way to write this query.

Do you know how? Leave a comment :)

-- Select all the individual company names that have a
-- soundex_code that occurs more than once (I now use a subquery for that)
SELECT
  `id`,
  `customer_name`,
  SOUNDEX(`customer_name`) AS soundex_code
FROM `customers`
WHERE SOUNDEX(`customer_name`) IN (
  -- Subquery: select all soundex_codes that occur more than once,
  -- (this does not return the individual company names that share them)
  SELECT SOUNDEX(`customer_name`) AS soundex_code
  FROM `customers`
  WHERE 1 = 1
    AND `is_active` = 1
    -- More specific criteria to define who you want to compare
  GROUP BY soundex_code
  HAVING COUNT(*) > 1
)
ORDER BY
  soundex_code,
  `customer_name`

This e.g. returns:

`id` `customer_name`  `soundex_code`
 291  F.S. Hosting     F2352
 1509 FS hosting       F2352
 9331 R  Schmit        R253
 9332 R Schmit         R253

By the way: The SQL is formatted according to my old SQL Formatting blogpost. Exactly 4 years after I published it, I still find it a pretty useful code convention.