When I hear the term *fuzzy searching*, I think of high computation cost, but this is not always the case. In the great book *Art of Computer Programming* by D. Knuth, vol. 3, ch. 6: *Searching* we can read about the **soundex** algorithm. It was originally used to deal with misspelled English surnames. The other usages could involve for example:

- Validation of user input data upon registration against the database. Maybe this user already exists, but they misspelled their name. Maybe there’s a typo in the name of the city, etc.
- Clearing out the meaning or finding the name during speech recognition. Some words are spelled similarly and it is easy to confuse them.

The soundex algorithm takes a string as an input and produces 4-character-long code as the result. It is obvious that this method is **faster** than computing the distance between each value and the reference. The question is – how faster? It turns out postgres already has a package for fuzzy searching, that includes everything we need. And even more…

I have used the sakila database to conduct the benchmarks. The *customer* table I wanted to use is small – 599 rows. To make it more significant I have created a new table and populated it with all the names cross-joined with the surnames from the *customer* table:

create table huge_cust ( id bigint primary key, first_name character varying, last_name character varying ); insert into huge_cust select nextval('customer_customer_id_seq'), c1.first_name, c2.last_name from customer c1 cross join customer c2;

Now the table *huge_cust* has 358801 rows, which is simply 599 squared.

To use the fuzzy search package we need to create an extension:

create extension if not exists fuzzystrmatch;

Besides `soundex`

and `levenshtein`

methods the package contains implementation of yet another algorithm that turns any string into a code – `metaphone`

. The `levenshtein`

method mentioned computes the Levenshtein distance between two strings fed as parameters.

There are two surnames with similar sound in our database: Burke and Bourque. What is the cost of finding the latter using the former considering mentioned functions? We can find out using `explain analyse`

:

explain analyse select * from huge_cust where soundex(last_name) = soundex('BURKE'); explain analyse select * from huge_cust where metaphone(last_name, 5) = metaphone('BURKE', 5); explain analyse select * from huge_cust where levenshtein(last_name, 'BURKE') <= 3;

Notice several things:

`metaphone`

is**parameterized**, so the cost of the second query might differ depending on the length of the output (second parameter).- Levenshtein distance between Burke and Bourque is 3 (2 insertions and 1 edit), so the threshold is minimal for the strings to be matched.
**The greater distance**we set,**the more fuzziness**we can allow in our search. Of course this will impact the performance, because there are more operations to check. - Both soundex and metaphone are
**independent**from the pattern we are matching against. This gives us an opportunity to introduce an**index**if the performance of this query is critical:

create index if not exists cust_sound on huge_cust (soundex(last_name));

Having these points in mind I have benchmarked the queries and the results are as follows:

Method | Execution time [ms] |
---|---|

soundex | 47.004 ± 2.477 |

metaphone (length 5) | 49.011 ± 0.674 |

metaphone (length 10) | 48.615 ± 1.874 |

levenshtein (distance 3) | 95.896 ± 0.655 |

levenshtein (distance 5) | 111.059 ± 7.840 |

soundex (with index) | 2.864 ± 0.197 |

Soundex algorithm is implemented in many relational databases, for example Oracle, DB2, MySQL, MariaDB, SQL Server and so on. Oracle provides Levenshtein distance implementation with UTL_MATCH package, DB2 provides all presented implementations out of the box. As of this writing I have not found other built-in implementations among the databases mentioned.

The benchmarks were conducted 10 times for each query and the results were averaged. The standard deviation was used as the error measurement.

The results of using `metaphone`

with index would be very similar to those of using `soundex`

with index, hence they are not presented in the table.

I am running postgres 13 in a docker container.

If we have the use case presented in introduction, we can consider using functions like `soundex`

to perform fuzzy searching. There are several arguments for that:

`soundex`

(and`metaphone`

) is**faster**than fuzzy searching by an**order of magnitude**. Combined with**index**, we can achieve**two orders of magnitude**faster query. This is not possible for methods like computing Levenshtein distance.`soundex`

is a**built-in**function in many databases. No additional packages or implementations are required.

Of course these methods are not as powerful as Levenshtein distance, the most apparent drawbacks include:

- Comparison of only
**single**words. `soundex`

result code consists of the first letter of the word followed by three digits. This means, that for all the words starting with a given letter, there are only 1000 distinct results possible. We can run into**collisions**, for example Burke – Brooks and it is up to us whether they are allowed and if not – how to handle them. There is nothing against combining multiple functions like`soundex`

and`metaphone`

for a better match.

But let us start with a slightly different thing. Once upon a time there was a question: what is the difference between `? extends Sometype`

and `T extends Sometype`

? For example: `<? extends Number>`

and `<T extends Number>`

as type parameters of a method. The answers are correct, but they can be a bit misleading, especially when indicating, that in some cases they are equivalent.

Coming back to the original question, the confusion arises, when we want to run the following code:

static void doesntCompile(Map<Integer, List<? extends Number>> map) {} static <T extends Number> void compiles(Map<Integer, List<T>> map) {} static void function(List<? extends Number> outer) { doesntCompile(new HashMap<Integer, List<Integer>>()); compiles(new HashMap<Integer, List<Integer>>()); }

Which fails with the following error:

Example.java:9: error: incompatible types: HashMap<Integer,List<Integer>> cannot be converted to Map<Integer,List<? extends Number>> doesntCompile(new HashMap<Integer, List<Integer>>()); ^

As it is always with this kind of questions, the answer can be found in the JLS. The `doesntCompile`

method is easy to explain. In JLS (§4.5.1) we read, that:

A type argument T_{1} is said to *contain* another type argument T_{2}, written T_{2} `<=`

T_{1}, if the set of types denoted by T_{2} is provably a subset of the set of types denoted by T_{1} under the reflexive and transitive closure of the following rules (where `<:`

denotes subtyping (§4.10)):

`?`

`extends`

T`<=`

`?`

`extends`

S if T`<:`

S`?`

`extends`

T`<=`

`?`

`?`

`super`

T`<=`

`?`

`super`

S if S`<:`

T`?`

`super`

T`<=`

`?`

`?`

`super`

T`<=`

`?`

`extends`

`Object`

- T
`<=`

T - T
`<=`

`?`

`extends`

T - T
`<=`

`?`

`super`

T

This means that `? extends Number`

indeed contains `Integer`

and even, keeping in mind the variance of the `List`

, `List<? extends Number>`

contains `List<Integer>`

. But that is not the case for `Map<Integer, List<? extends Number>>`

and `Map<Integer, List<Integer>>`

. However, after taking a closer look at the list above, we can see, that `Map<Integer, ? extends List<? extends Number>>`

contains `Map<Integer, List<Integer>>`

.

What about the `compiles`

method? We will find the answer if we take a look at another paragraph of the JLS, namely §8.1.2:

A generic class declaration defines a set of parameterized types (§4.5), one for each possible parameterization of the type parameter section by type arguments. All of these parameterized types share the same class at run time.

The type parameter in the method signature, `T`

, is matched against the input type, hence it is effectively assigned `Integer`

. Unfortunately, we need to remember, that generics in Java is just a compile-time feature.

Some time ago I’ve stumbled on this interesting question on stackoverflow (as stated in the title). I could not find any satisfactory answer at that time, so I came up with this solution. I think the problem is interesting enough to make it into a blog post.

The following table shows the expected results. The `x`

column contains the original data whereas `mdn_x`

contains the median computed from current up to 3 preceding rows.

x | mdn_x |
---|---|

1 | 1 |

2 | 1.5 |

3 | 2 |

5 | 2.5 |

8 | 4 |

13 | 6.5 |

21 | 10.5 |

Unfortunately** ordered-set aggregate functions do not support windows**, which would be the most intuitive approach. However, the window size in this example is fixed as 4 – it can be easily calculated using functions like `lead`

and `lag`

. A possible solution would look like this:

select x, (lag(x, 2) over w + lag(x) over w) / 2. as mdn_x from tmp t window w as (rows between 3 preceding and current row) order by 1;

It reads pretty easily – calculate the average of two previous values. It works for all rows except for the first three, because of the `lag`

returning `null`

. We can easily fix this by introducing a `null`

-check:

select x, case when lag(x) over w is null then x when lag(x, 2) over w is null then (x + lag(x) over w) / 2. when lag(x, 3) over w is null then lag(x) over w else (lag(x, 2) over w + lag(x) over w) / 2. end from tmp t window w as (rows between 3 preceding and current row) order by 1;

This is a bit more verbose. Suppose for . The query can be read as:

- If this is the first value – it’s also a median ().
- If this is the second value – the median is the average of both this and previous values ().
- If it’s the third value – take the previous one ().
- Otherwise the moving median is the average of two previous values ().

We can easily see the pattern and writing moving median for any window size should be easy. For `N`

it would be:

select x, (lag(x, (N + 1) / 2) over w + lag(x, N / 2) over w) / 2. as mdn_x from tmp t window w as (rows between N preceding and current row) order by 1;

Or, if we were to consider the corner case:

select x, case when lag(x) over w is null then x when lag(x, 2) over w is null then (x + lag(x) over w) / 2. -- other terms when lag(x, N) over w is null then (lag(x, (N - 1) / 2) over w + lag(x, N / 2) over w) / 2. else (lag(x, 2) over w + lag(x) over w) / 2. end from tmp t window w as (rows between N preceding and current row) order by 1;

The previous solution is flawed, because it needs to be rewritten for every query. Instead of this quick and dirty solution, we need a general one. We can achieve this by using a **custom aggregator**.

When moving along the window, we have to put each value in an ordered array. That is exactly what the following method does.

create function median_sfunc ( state integer[], data integer ) returns integer[] as $$ begin if state is null then return array[data]; -- if the array is null, return a singleton else return state || data; -- otherwise append to the existing array end if; end; $$ language plpgsql;

Once the array is populated, we need to take the middle value if the array’s length is odd and the average of two middle values otherwise. The implementation is quite straightforward.

create function median_ffunc ( state integer[] ) returns double precision as $$ begin return (state[(array_length(state, 1) + 1)/ 2] + state[(array_length(state, 1) + 2) / 2]) / 2.; end; $$ language plpgsql;

Now we can compose the aggregate using an empty array as the initial value:

create aggregate median (integer) ( sfunc = median_sfunc, stype = integer[], finalfunc = median_ffunc );

Finally, we can use the `median`

function in a beautiful, compact yet expressive way:

select x, median(x) over w as mdn_x from tmp t window w as (order by x rows between 3 preceding and current row)

Two things can be noticed about the solution:

- The
`median`

function can be parameterized in any way using the`window`

clause. - The entire implementation can be further generalized for any input type by replacing
`integer[]`

with`anyarray`

and`integer`

with`anyelement`

.

Other constraints and fine-tuning such as making the functions `immutable`

was omitted for brevity.

The presented solution can be further improved by using moving-aggregate mode, but I’ll leave that for another post.

There is nothing scary in implementing own aggregations in Postgres. This way we can make our code **more readable and faster**.

My name is Jędrzej. Currently, I live in Wrocław, Poland. I’m a full-stack developer at iteratec, where my main responsibility is to develop a backend of a web application. I’m also a PhD candidate at Wrocław University of Science and Technology, my major is classifier integration.

I’m interested in functional programming, performance, machine learning and mathematics. I’m experienced in Java (Java EE, Spring, Hibernate), Scala (Spark), Python (Pandas, Numpy, Scikit-learn, Keras), SQL (Postgres, Oracle).

In this blog I will try to post everything that I find interesting.

]]>