Poor man’s fuzzy search

0. The problem

When I hear the term fuzzy searching, I think of high computation cost, but this is not always the case. In the great book Art of Computer Programming by D. Knuth, vol. 3, ch. 6: Searching we can read about the soundex algorithm. It was originally used to deal with misspelled English surnames. The other usages could involve for example:

  • Validation of user input data upon registration against the database. Maybe this user already exists, but they misspelled their name. Maybe there’s a typo in the name of the city, etc.
  • Clearing out the meaning or finding the name during speech recognition. Some words are spelled similarly and it is easy to confuse them.

1. Soundex

The soundex algorithm takes a string as an input and produces 4-character-long code as the result. It is obvious that this method is faster than computing the distance between each value and the reference. The question is – how faster? It turns out postgres already has a package for fuzzy searching, that includes everything we need. And even more…

2. Benchmark

I have used the sakila database to conduct the benchmarks. The customer table I wanted to use is small – 599 rows. To make it more significant I have created a new table and populated it with all the names cross-joined with the surnames from the customer table:

create table huge_cust
(
    id         bigint primary key,
    first_name character varying,
    last_name  character varying
);
insert into huge_cust
select nextval('customer_customer_id_seq'), c1.first_name, c2.last_name
from customer c1
         cross join customer c2;

Now the table huge_cust has 358801 rows, which is simply 599 squared.

To use the fuzzy search package we need to create an extension:

create extension if not exists fuzzystrmatch;

Besides soundex and levenshtein methods the package contains implementation of yet another algorithm that turns any string into a code – metaphone. The levenshtein method mentioned computes the Levenshtein distance between two strings fed as parameters.

There are two surnames with similar sound in our database: Burke and Bourque. What is the cost of finding the latter using the former considering mentioned functions? We can find out using explain analyse:

explain analyse
select *
from huge_cust
where soundex(last_name) = soundex('BURKE');

explain analyse
select *
from huge_cust
where metaphone(last_name, 5) = metaphone('BURKE', 5);

explain analyse
select *
from huge_cust
where levenshtein(last_name, 'BURKE') <= 3;

Notice several things:

  • metaphone is parameterized, so the cost of the second query might differ depending on the length of the output (second parameter).
  • Levenshtein distance between Burke and Bourque is 3 (2 insertions and 1 edit), so the threshold is minimal for the strings to be matched. The greater distance we set, the more fuzziness we can allow in our search. Of course this will impact the performance, because there are more operations to check.
  • Both soundex and metaphone are independent from the pattern we are matching against. This gives us an opportunity to introduce an index if the performance of this query is critical:
create index 
if not exists 
cust_sound 
on huge_cust (soundex(last_name));

Having these points in mind I have benchmarked the queries and the results are as follows:

MethodExecution time [ms]
soundex47.004 ± 2.477
metaphone (length 5)49.011 ± 0.674
metaphone (length 10)48.615 ± 1.874
levenshtein (distance 3)95.896 ± 0.655
levenshtein (distance 5)111.059 ± 7.840
soundex (with index)2.864 ± 0.197
Execution times of different fuzzy search functions

3. Other databases

Soundex algorithm is implemented in many relational databases, for example Oracle, DB2, MySQL, MariaDB, SQL Server and so on. Oracle provides Levenshtein distance implementation with UTL_MATCH package, DB2 provides all presented implementations out of the box. As of this writing I have not found other built-in implementations among the databases mentioned.

4. Experimental setup

The benchmarks were conducted 10 times for each query and the results were averaged. The standard deviation was used as the error measurement.

The results of using metaphone with index would be very similar to those of using soundex with index, hence they are not presented in the table.

I am running postgres 13 in a docker container.

5. Conclusions

If we have the use case presented in introduction, we can consider using functions like soundex to perform fuzzy searching. There are several arguments for that:

  • soundex (and metaphone) is faster than fuzzy searching by an order of magnitude. Combined with index, we can achieve two orders of magnitude faster query. This is not possible for methods like computing Levenshtein distance.
  • soundex is a built-in function in many databases. No additional packages or implementations are required.

Of course these methods are not as powerful as Levenshtein distance, the most apparent drawbacks include:

  • Comparison of only single words.
  • soundex result code consists of the first letter of the word followed by three digits. This means, that for all the words starting with a given letter, there are only 1000 distinct results possible. We can run into collisions, for example Burke – Brooks and it is up to us whether they are allowed and if not – how to handle them. There is nothing against combining multiple functions like soundex and metaphone for a better match.

3 thoughts on “Poor man’s fuzzy search

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started