0. Problem statement
One of the most missing features in Postgres in my opinion was always a possibility to ignore nulls in analytic functions. Other databases support
ignore nulls clause which makes the task much easier. Let’s start by creating a simple, two-column table and populating it with data:
create table tbl ( id bigint primary key, a int ); insert into tbl (id, a) select s, case when random() < .5 then s end from generate_series(0, 100) s;
The problem is to generate a column (
lnn – last not null) with the last non-null value ordered by
1. How we would do it in Oracle
As mentioned, in Oracle the solution is straightforward:
select id, a, last_value(a) ignore nulls over (order by id) lnn from tbl;
The query passes through all the rows ordered by
id starting from the first one up to the current (according to
over clause) and retrieves the last not null value (or
null if it’s not present).
2. Nested query in Postgres
One of the possible solutions in Postgres is a nested query:
select *, ( select a from tbl it where it.id <= t.id and it.a is not null order by id desc limit 1 ) lnn from tbl t;
This solution works and yields the expected results, but has two major drawbacks:
- It’s much more convoluted than the Oracle’s version – you need to think harder about what it does, whereas the intent of the
ignore nullsversion is obvious.
- It is not reusable.
Fortunately, there is yet another possible solution.
3. Aggregates in Postgres
Although there is no
ignore nulls syntax, Postgres provides us a way to write an equivalent analytic function using
create aggregate. One of the parameters it takes is
sfunc which is a reducing function. In our case it should simply take the second value if it’s not null and the first one otherwise:
create function coalesce_r_sfunc(state anyelement, value anyelement) returns anyelement immutable parallel safe as $$ select coalesce(value, state); $$ language sql;
I’ve named the function
coalesce_r_sfunc, because the builtin
coalesce returns the first non-null argument, whereas our function – the last one out of exactly two. Now we can define our custom analytic function:
create aggregate find_last_ignore_nulls(anyelement) ( sfunc = coalesce_r_sfunc, stype = anyelement );
We can use it just like an ordinary analytic function (notice how similar the query is to the one for Oracle):
select *, find_last_ignore_nulls(a) over (order by id) from tbl;
You might have noticed the notation inconsistency – I’m talking about the analytic functions whereas the syntax is
create aggregate. It’s because the defined function can be used both as an analytic and aggregate function, depending on the context. Since our function processes multiple rows and returns a result for every row – it is clearly analytic.
4. Performance considerations
To run the benchmarks, let us propagate the table with a million rows:
insert into tbl (id, a) select s, case when random() < .5 then s end from generate_series(0, 1000000) s;
We also need to prepare two scripts:
ns.sql containing the nested select query and
a.sql containing the analytic query. The benchmarks yielded the following results:
root@5686cdefbcdf:/# pgbench -f ns.sql -U jb -t 10 postgres starting vacuum... transaction type: ns.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 10/10 latency average = 7104.174 ms tps = 0.140762 (including connections establishing) tps = 0.140769 (excluding connections establishing)
root@5686cdefbcdf:/# pgbench -f a.sql -U jb -t 10 postgres starting vacuum... transaction type: a.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 10/10 latency average = 3442.054 ms tps = 0.290524 (including connections establishing) tps = 0.290603 (excluding connections establishing)
The number of transactions per second (tps) for the analytic function is larger by the magnitude of 2 than for the nested query. Latency is only half.
Despite the lack of
ignore nulls syntax in Postgres, we can mimic the required behavior by defining a custom aggregate. The code is really simple and it has several advantages:
- The code is very readable.
- Aggregations are reusable, there is no need to rewrite the whole queries.
- It’s twice as fast compared to the equivalent nested selects.
This was just a sample of what can be done using custom aggregates, I will dive deeper into this topic in the upcoming posts.