0. The problem
Some time ago I’ve stumbled on this interesting question on stackoverflow (as stated in the title). I could not find any satisfactory answer at that time, so I came up with this solution. I think the problem is interesting enough to make it into a blog post.
The following table shows the expected results. The x
column contains the original data whereas mdn_x
contains the median computed from current up to 3 preceding rows.
x | mdn_x |
---|---|
1 | 1 |
2 | 1.5 |
3 | 2 |
5 | 2.5 |
8 | 4 |
13 | 6.5 |
21 | 10.5 |
1. Quick solution
Unfortunately ordered-set aggregate functions do not support windows, which would be the most intuitive approach. However, the window size in this example is fixed as 4 – it can be easily calculated using functions like lead
and lag
. A possible solution would look like this:
select x,
(lag(x, 2) over w + lag(x) over w) / 2. as mdn_x
from tmp t
window w as (rows between 3 preceding and current row)
order by 1;
It reads pretty easily – calculate the average of two previous values. It works for all rows except for the first three, because of the lag
returning null
. We can easily fix this by introducing a null
-check:
select x,
case
when lag(x) over w is null then x
when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
when lag(x, 3) over w is null then lag(x) over w
else (lag(x, 2) over w + lag(x) over w) / 2.
end
from tmp t
window w as (rows between 3 preceding and current row)
order by 1;
This is a bit more verbose. Suppose for
. The query can be read as:
- If this is the first value – it’s also a median (
).
- If this is the second value – the median is the average of both this and previous values (
).
- If it’s the third value – take the previous one (
).
- Otherwise the moving median is the average of two previous values (
).
We can easily see the pattern and writing moving median for any window size should be easy. For N
it would be:
select x,
(lag(x, (N + 1) / 2) over w + lag(x, N / 2) over w) / 2. as mdn_x
from tmp t
window w as (rows between N preceding and current row)
order by 1;
Or, if we were to consider the corner case:
select x,
case
when lag(x) over w is null then x
when lag(x, 2) over w is null then (x + lag(x) over w) / 2.
-- other terms
when lag(x, N) over w is null then (lag(x, (N - 1) / 2) over w + lag(x, N / 2) over w) / 2.
else (lag(x, 2) over w + lag(x) over w) / 2.
end
from tmp t
window w as (rows between N preceding and current row)
order by 1;
2. General solution
The previous solution is flawed, because it needs to be rewritten for every query. Instead of this quick and dirty solution, we need a general one. We can achieve this by using a custom aggregator.
When moving along the window, we have to put each value in an ordered array. That is exactly what the following method does.
create function median_sfunc (
state integer[], data integer
) returns integer[] as
$$
begin
if state is null then
return array[data]; -- if the array is null, return a singleton
else
return state || data; -- otherwise append to the existing array
end if;
end;
$$ language plpgsql;
Once the array is populated, we need to take the middle value if the array’s length is odd and the average of two middle values otherwise. The implementation is quite straightforward.
create function median_ffunc (
state integer[]
) returns double precision as
$$
begin
return (state[(array_length(state, 1) + 1)/ 2]
+ state[(array_length(state, 1) + 2) / 2]) / 2.;
end;
$$ language plpgsql;
Now we can compose the aggregate using an empty array as the initial value:
create aggregate median (integer) (
sfunc = median_sfunc,
stype = integer[],
finalfunc = median_ffunc
);
Finally, we can use the median
function in a beautiful, compact yet expressive way:
select x,
median(x) over w as mdn_x
from tmp t
window w as (order by x rows between 3 preceding and current row)
3. Conclusions
Two things can be noticed about the solution:
- The
median
function can be parameterized in any way using thewindow
clause. - The entire implementation can be further generalized for any input type by replacing
integer[]
withanyarray
andinteger
withanyelement
.
Other constraints and fine-tuning such as making the functions immutable
was omitted for brevity.
The presented solution can be further improved by using moving-aggregate mode, but I’ll leave that for another post.
There is nothing scary in implementing own aggregations in Postgres. This way we can make our code more readable and faster.
This is very fascinating, You are an overly skilled blogger. I have joined your rss feed and sit up for seeking more of your excellent post. Also, I’ve shared your website in my social networks!
LikeLike
Nice solution. But are you not missing the sorting of the array in median_ffunc(). Without sorting, the median will not be returned.
LikeLike
Thank you, I hope it helped. You’re right that the function only picks the middle element, so you need to specify the ordering when you’re using the window function as I did with `window w as (order by x rows between 3 preceding and current row)`
LikeLike