Calculating moving median in Postgres

0. The problem

Some time ago I’ve stumbled on this interesting question on stackoverflow (as stated in the title). I could not find any satisfactory answer at that time, so I came up with this solution. I think the problem is interesting enough to make it into a blog post.

The following table shows the expected results. The x column contains the original data whereas mdn_x contains the median computed from current up to 3 preceding rows.

xmdn_x
11
21.5
32
52.5
84
136.5
2110.5

1. Quick solution

Unfortunately ordered-set aggregate functions do not support windows, which would be the most intuitive approach. However, the window size in this example is fixed as 4 – it can be easily calculated using functions like lead and lag. A possible solution would look like this:

select x,
       (lag(x, 2) over w + lag(x) over w) / 2. as mdn_x
from tmp t
    window w as (rows between 3 preceding and current row)
order by 1;

It reads pretty easily – calculate the average of two previous values. It works for all rows except for the first three, because of the lag returning null. We can easily fix this by introducing a null-check:

select x,
       case
           when lag(x) over w is null then x
           when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
           when lag(x, 3) over w is null then lag(x) over w
           else (lag(x, 2) over w + lag(x) over w) / 2.
           end
from tmp t
    window w as (rows between 3 preceding and current row)
order by 1;

This is a bit more verbose. Suppose x_i \leq x_j for i < j. The query can be read as:

  1. If this is the first value – it’s also a median (median(x_1) = x_1).
  2. If this is the second value – the median is the average of both this and previous values (median(x_1, x_2) = \frac{x_1 + x_2}{2}).
  3. If it’s the third value – take the previous one (median(x_1, x_2, x_3) = x_2).
  4. Otherwise the moving median is the average of two previous values (median(x_{i - 3}, x_{i - 2}, x_{i - 1}, x_i) = \frac{x_{i - 2} + x_{i - 1}}{2}).

We can easily see the pattern and writing moving median for any window size should be easy. For N it would be:

select x,
       (lag(x, (N + 1) / 2) over w + lag(x, N / 2) over w) / 2. as mdn_x
from tmp t
    window w as (rows between N preceding and current row)
order by 1;

Or, if we were to consider the corner case:

select x,
       case
           when lag(x) over w is null then x
           when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
           -- other terms
           when lag(x, N) over w is null then (lag(x, (N - 1) / 2) over w + lag(x, N / 2) over w) / 2.
           else (lag(x, 2) over w + lag(x) over w) / 2.
           end
from tmp t
    window w as (rows between N preceding and current row)
order by 1;

2. General solution

The previous solution is flawed, because it needs to be rewritten for every query. Instead of this quick and dirty solution, we need a general one. We can achieve this by using a custom aggregator.

When moving along the window, we have to put each value in an ordered array. That is exactly what the following method does.

create function median_sfunc (
    state integer[], data integer
) returns integer[] as
$$
begin
    if state is null then
        return array[data]; -- if the array is null, return a singleton
    else
        return state || data; -- otherwise append to the existing array
    end if;
end;
$$ language plpgsql;

Once the array is populated, we need to take the middle value if the array’s length is odd and the average of two middle values otherwise. The implementation is quite straightforward.

create function median_ffunc (
    state integer[]
) returns double precision as
$$
begin
    return (state[(array_length(state, 1) + 1)/ 2] 
        + state[(array_length(state, 1) + 2) / 2]) / 2.;
end;
$$ language plpgsql;

Now we can compose the aggregate using an empty array as the initial value:

create aggregate median (integer) (
    sfunc     = median_sfunc,
    stype     = integer[],
    finalfunc = median_ffunc
    );

Finally, we can use the median function in a beautiful, compact yet expressive way:

select x,
       median(x) over w as mdn_x
from tmp t
    window w as (order by x rows between 3 preceding and current row)

3. Conclusions

Two things can be noticed about the solution:

  1. The median function can be parameterized in any way using the window clause.
  2. The entire implementation can be further generalized for any input type by replacing integer[] with anyarray and integer with anyelement.

Other constraints and fine-tuning such as making the functions immutable was omitted for brevity.

The presented solution can be further improved by using moving-aggregate mode, but I’ll leave that for another post.

There is nothing scary in implementing own aggregations in Postgres. This way we can make our code more readable and faster.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website at WordPress.com
Get started