Back to basic, let's brush up on some
At my company we have employees creating customer accounts every day. Sometimes we make mistakes, for instance, we forget to check if the company already was a customer (maybe 10y ago they may have had a product).
Duplicate accounts can cause all sorts of problems, so I wanted
way to detect them with
The problem was, the company names may have been entered with different punctuation, whitespace, etc. So I needed similar names to surface from the depths of our database, not just exact matches (that would have been too easy :)
For the solution I turned to SOUNDEX for fuzzy matching similar sounding company names, and then review the results myself (false positives are possible, but since they would be few, it becomes a simple task to doublecheck) and report back to our company.
I thought I'd share
- partly because it could be useful to others (obviously this could be used to detect all kinds of user generated typos and similar entries);
- mostly because I'm curious to find if there is a better (more performant) way to write this query.
Do you know how? Leave a comment :)
-- Select all the individual company names that have a -- soundex_code that occurs more than once (I now use a subquery for that) SELECT `id`, `customer_name`, SOUNDEX(`customer_name`) AS soundex_code FROM `customers` WHERE SOUNDEX(`customer_name`) IN ( -- Subquery: select all soundex_codes that occur more than once, -- (this does not return the individual company names that share them) SELECT SOUNDEX(`customer_name`) AS soundex_code FROM `customers` WHERE 1 = 1 AND `is_active` = 1 -- More specific criteria to define who you want to compare GROUP BY soundex_code HAVING COUNT(*) > 1 ) ORDER BY soundex_code, `customer_name`
This e.g. returns:
`id` `customer_name` `soundex_code` 291 F.S. Hosting F2352 1509 FS hosting F2352 9331 R Schmit R253 9332 R Schmit R253
By the way: The SQL is formatted according to my old SQL Formatting blogpost. Exactly 4 years after I published it, I still find it a pretty useful code convention.