- Published on
Find Duplicate Input With MySQL
- Authors

- Name
- Kevin van Zonneveld
- @kvz
Back to basic, let's brush up on some SQL :)
At my company we have employees creating customer accounts every day. Sometimes we make mistakes, for instance, we forget to check if the company already was a customer (maybe 10y ago they may have had a product).
Duplicate accounts can cause all sorts of problems, so I wanted
way to detect them with SQL.
The problem was, the company names may have been entered with different punctuation, whitespace, etc. So I needed similar names to surface from the depths of our database, not just exact matches (that would have been too easy :)
For the solution I turned to SOUNDEX for fuzzy matching similar sounding company names, and then review the results myself (false positives are possible, but since they would be few, it becomes a simple task to doublecheck) and report back to our company.
I thought I'd share
- partly because it could be useful to others (obviously this could be used to detect all kinds of user generated typos and similar entries);
- mostly because I'm curious to find if there is a better (more performant) way to write this query.
Do you know how? Leave a comment :)
-- Select all the individual company names that have a
-- soundex_code that occurs more than once (I now use a subquery for that)
SELECT
`id`,
`customer_name`,
SOUNDEX(`customer_name`) AS soundex_code
FROM `customers`
WHERE SOUNDEX(`customer_name`) IN (
-- Subquery: select all soundex_codes that occur more than once,
-- (this does not return the individual company names that share them)
SELECT SOUNDEX(`customer_name`) AS soundex_code
FROM `customers`
WHERE 1 = 1
AND `is_active` = 1
-- More specific criteria to define who you want to compare
GROUP BY soundex_code
HAVING COUNT(*) > 1
)
ORDER BY
soundex_code,
`customer_name`
This e.g. returns:
`id` `customer_name` `soundex_code`
291 F.S. Hosting F2352
1509 FS hosting F2352
9331 R Schmit R253
9332 R Schmit R253
By the way: The SQL is formatted according to my old SQL Formatting blogpost. Exactly 4 years after I published it, I still find it a pretty useful code convention.
Legacy Comments (4)
These comments were imported from the previous blog system (Disqus).
Useful post. I've been struggling with the same problem in our application.
There's a minor bug in the GROUP BY clause. soundex should be soundex_code.
Ah thanks for noticing, I fixed the post!
If you want to rely on soundex, MySQL can't optimize this query too much, even if it's a deterministic function. From a complexity point of view, I would rewrite the query like so (dk if it's that much faster on your data set) on MySQL < 5.6. Beginning with 5.6, the optimizer can rewrite the whole thing on it's own:
SELECT
id,
customer_name,
SOUNDEX(customer_name) AS soundex_code
FROM customers A
JOIN (
-- Subquery: select all soundex_codes that occur more than once,
-- (this does not return the individual company names that share them)
SELECT SOUNDEX(customer_name) AS soundex_codeX
FROM customers
WHERE is_active = 1
-- More specific criteria to define who you want to compare
GROUP BY soundex_code
HAVING COUNT(*) > 1
) B ON SOUNDEX(A.customer_name) = soundex_codeX
ORDER BY
soundex_code,
customer_name;
Another thing you could try is using a temporary table where you add an index:
CREATE TEMPORARY TABLE customer_tmp AS
SELECT SOUNDEX(customer_name) AS soundex_codeX
FROM customers
WHERE is_active = 1
-- More specific criteria to define who you want to compare
GROUP BY soundex_code
HAVING COUNT(*) > 1;
-- Add the index
ALTER TABLE customer_tmp ADD INDEX(soundex_codeX);
-- Get the result
SELECT
id,
customer_name,
SOUNDEX(customer_name) AS soundex_code
FROM customers A
JOIN customer_tmp B ON SOUNDEX(A.customer_name) = soundex_codeX
ORDER BY
soundex_code,
customer_name;
Robert
http://www.xarg.org/
Thanks for the insightful comment Robert, appreciated!