Calculating moving median in Postgres

0. The problem

Some time ago I’ve stumbled on this interesting question on stackoverflow (as stated in the title). I could not find any satisfactory answer at that time, so I came up with this solution. I think the problem is interesting enough to make it into a blog post.

The following table shows the expected results. The x column contains the original data whereas mdn_x contains the median computed from current up to 3 preceding rows.

x	mdn_x
1	1
2	1.5
3	2
5	2.5
8	4
13	6.5
21	10.5

1. Quick solution

Unfortunately ordered-set aggregate functions do not support windows, which would be the most intuitive approach. However, the window size in this example is fixed as 4 – it can be easily calculated using functions like lead and lag. A possible solution would look like this:

select x,
       (lag(x, 2) over w + lag(x) over w) / 2. as mdn_x
from tmp t
    window w as (rows between 3 preceding and current row)
order by 1;

It reads pretty easily – calculate the average of two previous values. It works for all rows except for the first three, because of the lag returning null. We can easily fix this by introducing a null-check:

select x,
       case
           when lag(x) over w is null then x
           when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
           when lag(x, 3) over w is null then lag(x) over w
           else (lag(x, 2) over w + lag(x) over w) / 2.
           end
from tmp t
    window w as (rows between 3 preceding and current row)
order by 1;

This is a bit more verbose. Suppose $x_i \leq x_j$ for $i < j$ . The query can be read as:

If this is the first value – it’s also a median ( $median(x_1) = x_1$ ).
If this is the second value – the median is the average of both this and previous values ( $median(x_1, x_2) = \frac{x_1 + x_2}{2}$ ).
If it’s the third value – take the previous one ( $median(x_1, x_2, x_3) = x_2$ ).
Otherwise the moving median is the average of two previous values ( $median(x_{i - 3}, x_{i - 2}, x_{i - 1}, x_i) = \frac{x_{i - 2} + x_{i - 1}}{2}$ ).

We can easily see the pattern and writing moving median for any window size should be easy. For N it would be:

select x,
       (lag(x, (N + 1) / 2) over w + lag(x, N / 2) over w) / 2. as mdn_x
from tmp t
    window w as (rows between N preceding and current row)
order by 1;

Or, if we were to consider the corner case:

select x,
       case
           when lag(x) over w is null then x
           when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
           -- other terms
           when lag(x, N) over w is null then (lag(x, (N - 1) / 2) over w + lag(x, N / 2) over w) / 2.
           else (lag(x, 2) over w + lag(x) over w) / 2.
           end
from tmp t
    window w as (rows between N preceding and current row)
order by 1;

2. General solution

The previous solution is flawed, because it needs to be rewritten for every query. Instead of this quick and dirty solution, we need a general one. We can achieve this by using a custom aggregator.

When moving along the window, we have to put each value in an ordered array. That is exactly what the following method does.

create function median_sfunc (
    state integer[], data integer
) returns integer[] as
$$
begin
    if state is null then
        return array[data]; -- if the array is null, return a singleton
    else
        return state || data; -- otherwise append to the existing array
    end if;
end;
$$ language plpgsql;

Once the array is populated, we need to take the middle value if the array’s length is odd and the average of two middle values otherwise. The implementation is quite straightforward.

create function median_ffunc (
    state integer[]
) returns double precision as
$$
begin
    return (state[(array_length(state, 1) + 1)/ 2] 
        + state[(array_length(state, 1) + 2) / 2]) / 2.;
end;
$$ language plpgsql;

Now we can compose the aggregate using an empty array as the initial value:

create aggregate median (integer) (
    sfunc     = median_sfunc,
    stype     = integer[],
    finalfunc = median_ffunc
    );

Finally, we can use the median function in a beautiful, compact yet expressive way:

select x,
       median(x) over w as mdn_x
from tmp t
    window w as (order by x rows between 3 preceding and current row)

3. Conclusions

Two things can be noticed about the solution:

The median function can be parameterized in any way using the window clause.
The entire implementation can be further generalized for any input type by replacing integer[] with anyarray and integer with anyelement.

Other constraints and fine-tuning such as making the functions immutable was omitted for brevity.

The presented solution can be further improved by using moving-aggregate mode, but I’ll leave that for another post.

There is nothing scary in implementing own aggregations in Postgres. This way we can make our code more readable and faster.

Thx! This was a nice solution. I’ve tweaked median_ffunc to be able to so a other given column, like eg this:

select *, median(rand) over w as mdn_x
from
(
select *, (random()*10 )::int::numeric rand
FROM
generate_series(1,10,1) as id)as z
window w as (ORDER BY id rows between 4 preceding and current row
)
order by 1;

create function median_sfunc (
state numeric[], data numeric
) returns numeric[] as
$$
begin
if state is null then
return array[data]; — if the array is null, return a singleton
else
return state || data; — otherwise append to the existing array
end if;
end;
$$ language plpgsql;

create or replace function median_ffunc (
state NUMERIC[]
) returns numeric as
$$
DECLARE
VARarrlen INT = ARRAY_LENGTH(state, 1);
VARdocalc INT = CASE WHEN VARarrlen=1 OR MOD(VARarrlen , 2) = 1 THEN 1 ELSE 2 END;
begin
/*
Tests:
SELECT sct.median_ffunc(‘{-1,1}’::NUMERIC[]) = 0 UNION ALL
SELECT sct.median_ffunc(‘{1,1}’::NUMERIC[]) = 1 UNION ALL
SELECT sct.median_ffunc(‘{1,2,4}’::NUMERIC[]) = 2 UNION ALL
SELECT sct.median_ffunc(‘{1,2}’::NUMERIC[]) = 1.5 UNION ALL
SELECT sct.median_ffunc(‘{1,1}’::NUMERIC[]) = 1 UNION ALL
SELECT sct.median_ffunc(‘{1}’::NUMERIC[]) = 1;

Notes:
– This does not handle NULLs
*/
return
(
SELECT SUM(val) / VARdocalc
FROM
(
SELECT val
FROM UNNEST(state) val
ORDER BY 1
OFFSET (CASE WHEN COALESCE(( (VARarrlen+1 )/ 2) -1, 0) <= 0 THEN 0 ELSE ((VARarrlen+1)/ 2) -1 END)
LIMIT VARdocalc
) d
)::NUMERIC;
end;
$$ language plpgsql;

create aggregate median (numeric) (
sfunc = median_sfunc,
stype = numeric[],
finalfunc = median_ffunc
);

LikeLike

4 thoughts on “Calculating moving median in Postgres”

Hairstyles says:

January 26, 2022 at 6:32 am

This is very fascinating, You are an overly skilled blogger. I have joined your rss feed and sit up for seeking more of your excellent post. Also, I’ve shared your website in my social networks!

LikeLike

Anonymous says:

May 31, 2022 at 4:18 pm

Nice solution. But are you not missing the sorting of the array in median_ffunc(). Without sorting, the median will not be returned.

LikeLike

1. andronicust says:
  
  May 31, 2022 at 4:56 pm
  
  Thank you, I hope it helped. You’re right that the function only picks the middle element, so you need to specify the ordering when you’re using the window function as I did with `window w as (order by x rows between 3 preceding and current row)`
  
  LikeLike
  
Oscar van Baten says:

January 8, 2024 at 10:18 pm

Thx! This was a nice solution. I’ve tweaked median_ffunc to be able to so a other given column, like eg this:

select *, median(rand) over w as mdn_x
from
(
select *, (random()*10 )::int::numeric rand
FROM
generate_series(1,10,1) as id)as z
window w as (ORDER BY id rows between 4 preceding and current row
)
order by 1;

create function median_sfunc (
state numeric[], data numeric
) returns numeric[] as
$$
begin
if state is null then
return array[data]; — if the array is null, return a singleton
else
return state || data; — otherwise append to the existing array
end if;
end;
$$ language plpgsql;

create or replace function median_ffunc (
state NUMERIC[]
) returns numeric as
$$
DECLARE
VARarrlen INT = ARRAY_LENGTH(state, 1);
VARdocalc INT = CASE WHEN VARarrlen=1 OR MOD(VARarrlen , 2) = 1 THEN 1 ELSE 2 END;
begin
/*
Tests:
SELECT sct.median_ffunc(‘{-1,1}’::NUMERIC[]) = 0 UNION ALL
SELECT sct.median_ffunc(‘{1,1}’::NUMERIC[]) = 1 UNION ALL
SELECT sct.median_ffunc(‘{1,2,4}’::NUMERIC[]) = 2 UNION ALL
SELECT sct.median_ffunc(‘{1,2}’::NUMERIC[]) = 1.5 UNION ALL
SELECT sct.median_ffunc(‘{1,1}’::NUMERIC[]) = 1 UNION ALL
SELECT sct.median_ffunc(‘{1}’::NUMERIC[]) = 1;

Notes:
– This does not handle NULLs
*/
return
(
SELECT SUM(val) / VARdocalc
FROM
(
SELECT val
FROM UNNEST(state) val
ORDER BY 1
OFFSET (CASE WHEN COALESCE(( (VARarrlen+1 )/ 2) -1, 0) <= 0 THEN 0 ELSE ((VARarrlen+1)/ 2) -1 END)
LIMIT VARdocalc
) d
)::NUMERIC;
end;
$$ language plpgsql;

create aggregate median (numeric) (
sfunc = median_sfunc,
stype = numeric[],
finalfunc = median_ffunc
);

LikeLike

0. The problem

1. Quick solution

2. General solution

3. Conclusions

Share this:

Related

4 thoughts on “Calculating moving median in Postgres”

Leave a comment Cancel reply