Ignore nulls in Postgres

0. Problem statement

One of the most missing features in Postgres in my opinion was always a possibility to ignore nulls in analytic functions. Other databases support ignore nulls clause which makes the task much easier. Let’s start by creating a simple, two-column table and populating it with data:

create table tbl (
    id bigint primary key,
    a int
);
insert into tbl (id, a)
select s, case when random() < .5 then s end
from generate_series(0, 100) s;

The problem is to generate a column (lnn – last not null) with the last non-null value ordered by id:

idalln
0<null><null>
1<null><null>
222
3<null>2
4<null>2
5<null>2
666
777
8<null>7
999
Table with nullable column and a column with gaps filled with last not null value

1. How we would do it in Oracle

As mentioned, in Oracle the solution is straightforward:

select id, a, last_value(a) ignore nulls over (order by id) lnn
from tbl;

The query passes through all the rows ordered by id starting from the first one up to the current (according to over clause) and retrieves the last not null value (or null if it’s not present).

2. Nested query in Postgres

One of the possible solutions in Postgres is a nested query:

select *, (
    select a
    from tbl it
    where it.id <= t.id
    and it.a is not null
    order by id desc
    limit 1
) lnn
from tbl t;

This solution works and yields the expected results, but has two major drawbacks:

  • It’s much more convoluted than the Oracle’s version – you need to think harder about what it does, whereas the intent of the ignore nulls version is obvious.
  • It is not reusable.

Fortunately, there is yet another possible solution.

3. Aggregates in Postgres

Although there is no ignore nulls syntax, Postgres provides us a way to write an equivalent analytic function using create aggregate. One of the parameters it takes is sfunc which is a reducing function. In our case it should simply take the second value if it’s not null and the first one otherwise:

create function coalesce_r_sfunc(state anyelement, value anyelement)
    returns anyelement
    immutable parallel safe
as
$$
select coalesce(value, state);
$$ language sql;

I’ve named the function coalesce_r_sfunc, because the builtin coalesce returns the first non-null argument, whereas our function – the last one out of exactly two. Now we can define our custom analytic function:

create aggregate find_last_ignore_nulls(anyelement) (
    sfunc = coalesce_r_sfunc,
    stype = anyelement
);

We can use it just like an ordinary analytic function (notice how similar the query is to the one for Oracle):

select *, find_last_ignore_nulls(a) over (order by id)
from tbl;

You might have noticed the notation inconsistency – I’m talking about the analytic functions whereas the syntax is create aggregate. It’s because the defined function can be used both as an analytic and aggregate function, depending on the context. Since our function processes multiple rows and returns a result for every row – it is clearly analytic.

4. Performance considerations

To run the benchmarks, let us propagate the table with a million rows:

insert into tbl (id, a)
select s, case when random() < .5 then s end
from generate_series(0, 1000000) s;

We also need to prepare two scripts: ns.sql containing the nested select query and a.sql containing the analytic query. The benchmarks yielded the following results:

root@5686cdefbcdf:/# pgbench -f ns.sql -U jb -t 10 postgres
starting vacuum...
transaction type: ns.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 7104.174 ms
tps = 0.140762 (including connections establishing)
tps = 0.140769 (excluding connections establishing)
root@5686cdefbcdf:/# pgbench -f a.sql -U jb -t 10 postgres
starting vacuum...
transaction type: a.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
latency average = 3442.054 ms
tps = 0.290524 (including connections establishing)
tps = 0.290603 (excluding connections establishing)

The number of transactions per second (tps) for the analytic function is larger by the magnitude of 2 than for the nested query. Latency is only half.

5. Conclusions

Despite the lack of ignore nulls syntax in Postgres, we can mimic the required behavior by defining a custom aggregate. The code is really simple and it has several advantages:

  • The code is very readable.
  • Aggregations are reusable, there is no need to rewrite the whole queries.
  • It’s twice as fast compared to the equivalent nested selects.

This was just a sample of what can be done using custom aggregates, I will dive deeper into this topic in the upcoming posts.

Here are the db-fiddles for Oracle and Postgres. The code is also available on github.

11 thoughts on “Ignore nulls in Postgres

  1. Hi,

    I think that is better solution (faster than Coalesce), try use IMMUTABLE STRICT ex.

    CREATE FUNCTION fiind_last ( anyelement, anyelement )
    RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE AS $$
    SELECT $2;
    $$

    Liked by 1 person

    1. Thank you for your comment! Actually making it immutable and so on is just a matter of optimization. I’m avoiding that to not clutter the code – I want it to be readable to explain the concept.
      When it comes to your query – it doesn’t work. It always selects the second element, that’s not what we need. `select fiind_last(1, null);` returns `null`.

      Like

  2. Hi,

    Yes, you have rigth, this function doesn’t work outside the aggregate. You should be use it only in aggregate. The trick is STRICT parameter. But, of couse you post is very good and helpful. Many people don’t try find other solutions as the default functions, procedures, aggregates. That I think that you did goog work.

    Liked by 1 person

  3. Hi! Thank you for the post, the solution is very nice and clean.

    What I like even more is that on big tables (like 1M rows) it became obvious your approach needs much less memory.

    With `explain (analyze, buffers)` (sorry, but Worldpress is awful in posting block of codes, so I skip the full output) I have found a huge difference between 8073 and 4015470 blocks (~64MB vs ~32GB) of memory to be read.

    Liked by 1 person

      1. Oh, and as the discussion pointed out by Emre shows it’s unlikely that PostgreSQL will ever implement the IGNORE NULLS syntax due to its bad syntactic design.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with WordPress.com
Get started