kvz.io
Published on

Find Duplicate Input With MySQL

Authors
  • avatar
    Name
    Kevin van Zonneveld
    Twitter
    @kvz

Back to basic, let's brush up on some SQL :)

At my company we have employees creating customer accounts every day. Sometimes we make mistakes, for instance, we forget to check if the company already was a customer (maybe 10y ago they may have had a product).

Duplicate accounts can cause all sorts of problems, so I wanted way to detect them with SQL.

The problem was, the company names may have been entered with different punctuation, whitespace, etc. So I needed similar names to surface from the depths of our database, not just exact matches (that would have been too easy :)

For the solution I turned to SOUNDEX for fuzzy matching similar sounding company names, and then review the results myself (false positives are possible, but since they would be few, it becomes a simple task to doublecheck) and report back to our company.

I thought I'd share

  • partly because it could be useful to others (obviously this could be used to detect all kinds of user generated typos and similar entries);
  • mostly because I'm curious to find if there is a better (more performant) way to write this query.

Do you know how? Leave a comment :)

-- Select all the individual company names that have a
-- soundex_code that occurs more than once (I now use a subquery for that)
SELECT
  `id`,
  `customer_name`,
  SOUNDEX(`customer_name`) AS soundex_code
FROM `customers`
WHERE SOUNDEX(`customer_name`) IN (
  -- Subquery: select all soundex_codes that occur more than once,
  -- (this does not return the individual company names that share them)
  SELECT SOUNDEX(`customer_name`) AS soundex_code
  FROM `customers`
  WHERE 1 = 1
    AND `is_active` = 1
    -- More specific criteria to define who you want to compare
  GROUP BY soundex_code
  HAVING COUNT(*) > 1
)
ORDER BY
  soundex_code,
  `customer_name`

This e.g. returns:

`id` `customer_name`  `soundex_code`
 291  F.S. Hosting     F2352
 1509 FS hosting       F2352
 9331 R  Schmit        R253
 9332 R Schmit         R253

By the way: The SQL is formatted according to my old SQL Formatting blogpost. Exactly 4 years after I published it, I still find it a pretty useful code convention.

Legacy Comments (4)

These comments were imported from the previous blog system (Disqus).

tersmitten
tersmitten·

Useful post. I've been struggling with the same problem in our application.

There's a minor bug in the GROUP BY clause. soundex should be soundex_code.

Kev van Zonneveld
Kev van Zonneveld·

Ah thanks for noticing, I fixed the post!

Robert Eisele
Robert Eisele·

If you want to rely on soundex, MySQL can't optimize this query too much, even if it's a deterministic function. From a complexity point of view, I would rewrite the query like so (dk if it's that much faster on your data set) on MySQL < 5.6. Beginning with 5.6, the optimizer can rewrite the whole thing on it's own:

SELECT
id,
customer_name,
SOUNDEX(customer_name) AS soundex_code
FROM customers A
JOIN (
-- Subquery: select all soundex_codes that occur more than once,
-- (this does not return the individual company names that share them)
SELECT SOUNDEX(customer_name) AS soundex_codeX
FROM customers
WHERE is_active = 1
-- More specific criteria to define who you want to compare
GROUP BY soundex_code
HAVING COUNT(*) > 1
) B ON SOUNDEX(A.customer_name) = soundex_codeX
ORDER BY
soundex_code,
customer_name;

Another thing you could try is using a temporary table where you add an index:
CREATE TEMPORARY TABLE customer_tmp AS
SELECT SOUNDEX(customer_name) AS soundex_codeX
FROM customers
WHERE is_active = 1
-- More specific criteria to define who you want to compare
GROUP BY soundex_code
HAVING COUNT(*) > 1;

-- Add the index

ALTER TABLE customer_tmp ADD INDEX(soundex_codeX);

-- Get the result
SELECT
id,
customer_name,
SOUNDEX(customer_name) AS soundex_code
FROM customers A
JOIN customer_tmp B ON SOUNDEX(A.customer_name) = soundex_codeX
ORDER BY
soundex_code,
customer_name;

Robert
http://www.xarg.org/

Kev van Zonneveld
Kev van Zonneveld·

Thanks for the insightful comment Robert, appreciated!