0. The problem
Some time ago I’ve stumbled on this interesting question on stackoverflow (as stated in the title). I could not find any satisfactory answer at that time, so I came up with this solution. I think the problem is interesting enough to make it into a blog post.
The following table shows the expected results. The
x column contains the original data whereas
mdn_x contains the median computed from current up to 3 preceding rows.
1. Quick solution
Unfortunately ordered-set aggregate functions do not support windows, which would be the most intuitive approach. However, the window size in this example is fixed as 4 – it can be easily calculated using functions like
lag. A possible solution would look like this:
select x, (lag(x, 2) over w + lag(x) over w) / 2. as mdn_x from tmp t window w as (rows between 3 preceding and current row) order by 1;
It reads pretty easily – calculate the average of two previous values. It works for all rows except for the first three, because of the
null. We can easily fix this by introducing a
select x, case when lag(x) over w is null then x when lag(x, 2) over w is null then (x + lag(x) over w) / 2. when lag(x, 3) over w is null then lag(x) over w else (lag(x, 2) over w + lag(x) over w) / 2. end from tmp t window w as (rows between 3 preceding and current row) order by 1;
This is a bit more verbose. Suppose for . The query can be read as:
- If this is the first value – it’s also a median ().
- If this is the second value – the median is the average of both this and previous values ().
- If it’s the third value – take the previous one ().
- Otherwise the moving median is the average of two previous values ().
We can easily see the pattern and writing moving median for any window size should be easy. For
N it would be:
select x, (lag(x, (N + 1) / 2) over w + lag(x, N / 2) over w) / 2. as mdn_x from tmp t window w as (rows between N preceding and current row) order by 1;
Or, if we were to consider the corner case:
select x, case when lag(x) over w is null then x when lag(x, 2) over w is null then (x + lag(x) over w) / 2. -- other terms when lag(x, N) over w is null then (lag(x, (N - 1) / 2) over w + lag(x, N / 2) over w) / 2. else (lag(x, 2) over w + lag(x) over w) / 2. end from tmp t window w as (rows between N preceding and current row) order by 1;
2. General solution
The previous solution is flawed, because it needs to be rewritten for every query. Instead of this quick and dirty solution, we need a general one. We can achieve this by using a custom aggregator.
When moving along the window, we have to put each value in an ordered array. That is exactly what the following method does.
create function median_sfunc ( state integer, data integer ) returns integer as $$ begin if state is null then return array[data]; -- if the array is null, return a singleton else return state || data; -- otherwise append to the existing array end if; end; $$ language plpgsql;
Once the array is populated, we need to take the middle value if the array’s length is odd and the average of two middle values otherwise. The implementation is quite straightforward.
create function median_ffunc ( state integer ) returns double precision as $$ begin return (state[(array_length(state, 1) + 1)/ 2] + state[(array_length(state, 1) + 2) / 2]) / 2.; end; $$ language plpgsql;
Now we can compose the aggregate using an empty array as the initial value:
create aggregate median (integer) ( sfunc = median_sfunc, stype = integer, finalfunc = median_ffunc );
Finally, we can use the
median function in a beautiful, compact yet expressive way:
select x, median(x) over w as mdn_x from tmp t window w as (order by x rows between 3 preceding and current row)
Two things can be noticed about the solution:
medianfunction can be parameterized in any way using the
- The entire implementation can be further generalized for any input type by replacing
Other constraints and fine-tuning such as making the functions
immutable was omitted for brevity.
The presented solution can be further improved by using moving-aggregate mode, but I’ll leave that for another post.
There is nothing scary in implementing own aggregations in Postgres. This way we can make our code more readable and faster.