0. The problem
When I hear the term fuzzy searching, I think of high computation cost, but this is not always the case. In the great book Art of Computer Programming by D. Knuth, vol. 3, ch. 6: Searching we can read about the soundex algorithm. It was originally used to deal with misspelled English surnames. The other usages could involve for example:
- Validation of user input data upon registration against the database. Maybe this user already exists, but they misspelled their name. Maybe there’s a typo in the name of the city, etc.
- Clearing out the meaning or finding the name during speech recognition. Some words are spelled similarly and it is easy to confuse them.
The soundex algorithm takes a string as an input and produces 4-character-long code as the result. It is obvious that this method is faster than computing the distance between each value and the reference. The question is – how faster? It turns out postgres already has a package for fuzzy searching, that includes everything we need. And even more…
I have used the sakila database to conduct the benchmarks. The customer table I wanted to use is small – 599 rows. To make it more significant I have created a new table and populated it with all the names cross-joined with the surnames from the customer table:
create table huge_cust ( id bigint primary key, first_name character varying, last_name character varying ); insert into huge_cust select nextval('customer_customer_id_seq'), c1.first_name, c2.last_name from customer c1 cross join customer c2;
Now the table huge_cust has 358801 rows, which is simply 599 squared.
To use the fuzzy search package we need to create an extension:
create extension if not exists fuzzystrmatch;
levenshtein methods the package contains implementation of yet another algorithm that turns any string into a code –
levenshtein method mentioned computes the Levenshtein distance between two strings fed as parameters.
There are two surnames with similar sound in our database: Burke and Bourque. What is the cost of finding the latter using the former considering mentioned functions? We can find out using
explain analyse select * from huge_cust where soundex(last_name) = soundex('BURKE'); explain analyse select * from huge_cust where metaphone(last_name, 5) = metaphone('BURKE', 5); explain analyse select * from huge_cust where levenshtein(last_name, 'BURKE') <= 3;
Notice several things:
metaphoneis parameterized, so the cost of the second query might differ depending on the length of the output (second parameter).
- Levenshtein distance between Burke and Bourque is 3 (2 insertions and 1 edit), so the threshold is minimal for the strings to be matched. The greater distance we set, the more fuzziness we can allow in our search. Of course this will impact the performance, because there are more operations to check.
- Both soundex and metaphone are independent from the pattern we are matching against. This gives us an opportunity to introduce an index if the performance of this query is critical:
create index if not exists cust_sound on huge_cust (soundex(last_name));
Having these points in mind I have benchmarked the queries and the results are as follows:
|Method||Execution time [ms]|
|soundex||47.004 ± 2.477|
|metaphone (length 5)||49.011 ± 0.674|
|metaphone (length 10)||48.615 ± 1.874|
|levenshtein (distance 3)||95.896 ± 0.655|
|levenshtein (distance 5)||111.059 ± 7.840|
|soundex (with index)||2.864 ± 0.197|
3. Other databases
Soundex algorithm is implemented in many relational databases, for example Oracle, DB2, MySQL, MariaDB, SQL Server and so on. Oracle provides Levenshtein distance implementation with UTL_MATCH package, DB2 provides all presented implementations out of the box. As of this writing I have not found other built-in implementations among the databases mentioned.
4. Experimental setup
The benchmarks were conducted 10 times for each query and the results were averaged. The standard deviation was used as the error measurement.
The results of using
metaphone with index would be very similar to those of using
soundex with index, hence they are not presented in the table.
I am running postgres 13 in a docker container.
If we have the use case presented in introduction, we can consider using functions like
soundex to perform fuzzy searching. There are several arguments for that:
metaphone) is faster than fuzzy searching by an order of magnitude. Combined with index, we can achieve two orders of magnitude faster query. This is not possible for methods like computing Levenshtein distance.
soundexis a built-in function in many databases. No additional packages or implementations are required.
Of course these methods are not as powerful as Levenshtein distance, the most apparent drawbacks include:
- Comparison of only single words.
soundexresult code consists of the first letter of the word followed by three digits. This means, that for all the words starting with a given letter, there are only 1000 distinct results possible. We can run into collisions, for example Burke – Brooks and it is up to us whether they are allowed and if not – how to handle them. There is nothing against combining multiple functions like
metaphonefor a better match.