Postgres Upserts

Since Postgres 9.5, Postgres has supported a useful a feature called UPSERT. For a reason I can’t figure out, this feature is referred to as UPSERT, even though there is no UPSERT SQL command. In addition to being a useful feature, UPSERT is fairly interesting from a “behind the scenes” perspective as well.

If you haven’t noticed yet, the word “upsert” is a portmanteau of the words “update” and “insert”. As a feature, UPSERT allows you to insert a new data if that data does not already exist and specify an action to be performed instead if that data does already exist. More specifically, when there is a unique constraint on a column (a constraint specifying all values of a column are distinct from each other), UPSERT allow to say “insert this row if it does not violate the unique constraint, otherwise perform this action to resolve the conflict”.

As an example, let’s say we have a counters table where each row represents a counter. The table has two columns, id and value, where the id specifies the counter we are referring to, and value is the number of times the counter has been incremented. It would be nice if we could increment a counter without needing to create the counter in advance. This is a problem for UPSERT. First let’s create the table:

CREATE TABLE counters (id bigint UNIQUE, value bigint);

It’s important the the id column is marked as unique. Without that we would be unable to use UPSERT.

To write an UPSERT query, you first write a normal INSERT for the case when the constraint is not violated. In this case, when a counter with a given id does not already exist, we want to create a new counter with the given id and the value 1. An INSERT that does this looks like:

INSERT INTO counters (id, value)
SELECT <id> AS id, 1 AS value;

Then to make it an UPSERT, you add to the end of it ON CONFLICT (<unique column>) DO <action>. The action can either be NOTHING, in which case the query will be ignored, or it can be UPDATE SET <column1> = <expr1>, <column2> = <expr2> … This will modify the existing row and update the corresponding columns to the new values. In this case we want to use the UPDATE form to increment the value of the counter. The whole query winds up looking like:

INSERT INTO counters
SELECT <id> AS id, 0 AS value
    ON CONFLICT (id) DO 
    UPDATE SET value = counters.value + 1;

When you run the above command with a given id, it will create a new counter with the value 1 if a counter with the id does not already exist. Otherwise it will increment the value of the existing counter. Here’s some examples of its use:

> SELECT * FROM counters;
 id | value
----+-------
(0 rows)

> INSERT INTO counters
  SELECT 0 AS id, 1 AS VALUE
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + 1;

> SELECT * FROM counters;
 id | value
----+-------
  0 |     1
(1 row)

> INSERT INTO counters
  SELECT 0 AS id, 1 AS VALUE
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + 1;

> SELECT * FROM counters;
 id | value
----+-------
  0 |     2
(1 row)

> INSERT INTO counters
  SELECT 0 AS id, 1 AS VALUE
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + 1;

> SELECT * FROM counters;
 id | value
----+-------
  0 |     3
(1 row)

> INSERT INTO counters
  SELECT 1 AS id, 1 AS VALUE
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + 1;

> SELECT * FROM counters;
 id | value
----+-------
  0 |     3
  1 |     1
(2 rows)

> INSERT INTO counters
  SELECT 1 AS id, 1 AS VALUE
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + 1;

> SELECT * FROM counters;
 id | value
----+-------
  0 |     3
  1 |     2

One last bit about UPSERT, you can use the faux table excluded to refer to the new row being inserted. This is useful if you either want to values of the old row with values of the new row, or make the values of the row a combination of the values of the old and new rows. As an example, let’s say we want to extend the counter example to increment by an arbitrary amount. That can be done with:

INSERT INTO counters
SELECT <id> AS id, <amount> AS VALUE
    ON CONFLICT (id) DO
    UPDATE SET value = counters.value + excluded.value;

This even works if you are incrementing multiple counters simultaneously all by different amounts.

What makes UPSERT so interesting to me is that it works even in concurrent situations. UPSERT still works even if other INSERT and UPDATE queries are all running simultaneously! Prior to the UPSERT feature there was a fairly complex method to emulate UPSERT. That method involved using PL/pgSQL to alternate between running INSERT and UPDATE statements until one of them succeeded. The statements need to be ran in a loop because it is possible for a different INSERT to run before the UPSERT INSERT was ran, and a row could be deleted before the UPDATE could be ran. The UPSERT feature takes care of all of this for you, while at the same time providing a single command for the common pattern inserting data if it does not already exist and otherwise modifying the old data!

Postgres Transaction Isolation Levels

This post is part 3 in a three part series exploring transaction isolation issues in Postgres. Part 1 introduced some of the common problems that come up with transaction isolation in Postgres. Part 2 introduced row level locks, one way of dealing with some of the problems around transaction isolation. This part introduces the different transaction isolation levels and how they affect the different problems introduced in part 1.

The main, and easiest way to deal with the transaction isolation issues introduced in Part 1 is through changing the transaction isolation level. In Postgres each transaction has it’s own isolation level. The isolation level of a transaction determines the isolation semantics of the transaction. In other words, the transaction isolation level determines what the state of the database statements ran within the transaction see, as well as how concurrent modifications to the same row are resolved.

Postgres provides a total of three different transaction isolation levels: read committed, repeatable read, and serializable. Read committed is the default and provides the semantics we’ve been discussing so far. The isolation semantics described in part 1 around SELECTUPDATE, and DELETE are the semantics provided under read committed. The repeatable read and serializable isolation levels each provide a different set of semantics for SELECTUPDATE, and DELETE. The semantics under repeatable read are as follows:

SELECT

Whenever a repeatable read transaction starts, it takes a snapshot of the current state of the database. All queries within the transaction use this snapshot. This behavior is opposed to the behavior of read committed in which each query has it’s own snapshot. The effects of all transactions that committed before the transaction are visible to all queries within the transaction while all transactions that either failed or committed after the transaction started are invisible to all statements ran in the transaction.

UPDATE and DELETE

Under repeatable read, UPDATE and DELETE use the snapshot created when the transaction started to find all rows to be modified. When an UPDATE or DELETE statement attempts to modify a row currently being modified by another transaction, it waits until the other transaction commits or aborts. If the other transaction aborts, the statement modifies the row and continues. On the other hand, if the other transaction commits, the UPDATE/DELETE statement will abort.

This behavior is completely different from that provided by read committed and somewhat surprising. In read committed, if two statements attempt to modify the same row at the same time, the modifications will be performed one after the other. In repeatable read, one of the statements will be aborted to prevent any isolation issues! This is generally the reason why people prefer read committed over repeatable read. When someone uses repeatable read, their code has to be prepared to retry transactions if any of them fail.


On it’s own, repeatable read prevents all of the transaction isolation problems but one. The only class of issues repeatable read does not prevent are serialization anomalies. That is what the serializable transaction isolation level is for. The serializable isolation level behaves exactly like repeatable read except it specifically detects when two transactions will not serialize correctly and will abort one of them to prevent a serialization anomaly. Like repeatable read, if you use serializable you should have code ready to retry aborted transactions.

To change the isolation level Postgres uses, you can set default_transaction_isolation to the desired level. All of the examples in the rest of this post make use of the repeatable read isolation levels, unless explicitly mentioned otherwise.

Now that we’ve got an understanding of the semantics behind the other isolation levels, let’s take a look at how they affect the examples in part 1.

Non-Repeatable Reads

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n 
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n 
---
 1

S1> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n 
---
 2

S3> COMMIT;

With the repeatable read isolation level, S1 avoids the non-repeatable read since, unlike read committed, the second query makes use of a snapshot created at the start of the transaction. This is where repeatable read gets it’s name, it can be used to avoid non-repeatable reads.

Lost Updates

Due to the semantics around conflicting updates, repeatable read does prevent lost updates, but it does so in a less than ideal way:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
S2> COMMIT;

S1> UPDATE ints SET n = 2; -- Computed server side.
ERROR:  could not serialize access due to concurrent UPDATE

S1> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

What happened here is that since two UPDATEs attempted to modify the same row at the same time, Postgres aborted one of them to prevent a lost update. For this reason, if you are just trying to avoid lost updates, you should prefer to use row level locks with SELECT … FOR UPDATE under read committed. That will allow both UPDATEs to be performed, without either UPDATE being lost or aborted.

Phantom Reads

Repeatable read eliminates phantom reads for the same reason it eliminates non-repeatable reads:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     0

S0> COMMIT;

S2> BEGIN;
S2> SELECT COUNT(*) FROM ints;
 count
-------
     1

S2> COMMIT;

Preventing phantom reads is one reason why you would prefer to use the repeatable read isolation level instead of row level locks.

Skipped Modification

Just like with a lost update, repeatable read will abort one of the transactions in a skipped modification:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> INSERT INTO ints SELECT 2;
S0> SELECT * FROM ints;
 n
---
 1
 2

S0> COMMIT

S1> BEGIN;
S1> UPDATE ints SET n = n+1;

S2> BEGIN;
S2> DELETE FROM ints WHERE n = 2;
-- S2 blocks since the DELETE is trying to modify a row
-- currently being updated.

S1> COMMIT;
-- S2 aborts with the error:
ERROR:  could not serialize access due to concurrent update

S2> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2
 3

S3> COMMIT;

S2 is aborted for the same reason S1 is aborted in the lost update example. The DELETE tries to modify a row which was modified after the snapshot was taken. Since the version of the row in the snapshot is out of date, Postgres aborts the transaction to prevent any isolation issues.

Serialization Anomalies

Unlike all of the other interactions, repeatable read does not eliminate serialization anomalies:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0
(1 row)
 
S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0
(1 ROW)
 
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;
 
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

Fortunately, if you do want to completely prevent serialization anomalies, you can use the serializable isolation level. If we use serializable in the example instead of repeatable read, here is what happens:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0

S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> INSERT INTO ints SELECT 1;
ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.

S0> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 1

S3> COMMIT;

What happened here is that Postgres detected that the pattern of reads and writes wouldn’t serialize properly, so it aborted one of the transactions.


Like row level locks, the repeatable read and serializable isolation levels are simple, yet at the same time they introduce a lot of complexity. Use of either repeatable read or serializable dramatically increases the chances that any given transaction will fail, making code that interacts with the database much more complicated, and database performance much less predictable.

In general, if you can, you should try to use the read committed isolation level and write your code in such a way that you don’t run into the different isolation issues mentioned in part 1. If you absolutely have to, you can use the tools mentioned in these last two posts to fend off all of the isolation issues.

Postgres Row Level Locks

This post is part 2 in a three part series exploring transaction isolation in Postgres. Part 1 introduced some of the problems that come up under the default isolation settings. This part and the next one introduce common ways of eliminating those problems.

As mentioned in part 1, there are many different cases where concurrent transactions can interact with each other. One of the most basic ways to avoid a few of the interactions is to use row level locks. Row level locks are a way of preventing concurrent modification to a row. Let’s take a look at how row-level locks change the behavior of each of the examples in the part 1:

Non-Repeatable Reads

To recap, a non-repeatable read is when a single transaction rereads a single row twice and finds it to be different the second time. Here’s the example from part 1 demonstrating non-repeatable reads:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n
---
 2

S1> COMMIT;

If S1 were to acquire a row level lock on the row as soon as it first read the row, S2 would unable from updating the rows as long as S1 holds the row level lock. To acquire row level locks with a SELECT statement, you just add FOR SHARE to the end of the SELECT. The lock will be released once the transaction commits. Rewriting the example with FOR SHARE, it now looks like:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints FOR SHARE;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = n+1;
-- Blocks because the row being updated is locked by 
-- another transaction.

S1> SELECT * FROM ints;
 n
---
 1

S1> COMMIT;
-- The UPDATE in S2 completes since S1 released its lock
-- on the row.

S2> COMMIT;

What FOR SHARE does is acquire a “read lock” on each of the rows returned by the SELECT statement. Once the read locks are acquired, no other transaction will be able to update or delete the rows until the transaction holding the read locks commits and releases the locks. Multiple transactions can possess read locks on the same row, and it is impossible for any transaction to update a row as long as at least one other transaction has a read lock on the row.

Lost Updates

Since multiple transactions can acquire read locks on a single row at the same time, FOR SHARE doesn’t exactly handle lost updates well. Here’s the example of a lost update from part 1 rewritten to use FOR SHARE:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
-- Blocks because S1 has a read lock on the row.

S1> UPDATE ints SET n = 2; -- Computed server side.
ERROR:  deadlock detected

S1> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

What happened here is the read lock acquired by S1 is preventing the UPDATE in S2 from running and the read lock acquired by S2 is preventing the UPDATE in S1 from running. Postgres detects that this is a deadlock and aborts the transaction in S1.

To get around this issue, you’ll want to use a variation of FOR SHARE called FOR UPDATEFOR UPDATE also acquires a locks on the rows being selected, but instead of acquiring read locks on each row, it acquires write locks on each row. A write lock is similar to a read lock, but as long as a transaction has a write lock on the row, no other transaction can have a read or write lock on the same row. In fact, a write lock is what UPDATE and DELETE grab before they modify a row. Let’s take a look at what happens when we instead use FOR UPDATE instead of FOR SHARE:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints FOR UPDATE;
 n
---
 1

S2> SELECT * FROM ints FOR UPDATE;
-- Blocks because S1 has a write lock.

S1> UPDATE ints SET n = 2; -- Computed server side.
S1> COMMIT;
-- S1 releases the write lock and S2 unblocks. 
-- The SELECT in S2 returns:
 n
---
 2

S2> UPDATE ints SET n = 3; -- Computed server side.
S2> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 3

S3> COMMIT;

By using FOR UPDATE, S1 signals it’s about to modify the rows it is selecting. S2 also wants to modify the row, so when it sees S1 is about to modify the row, it waits for S1 to complete before reading the row. This guarantees that S2 sees an up to date value. In effect, by using FOR UPDATE, S1 makes the read and write performed on the row happen in a single step.

Phantom Reads

Although row level locks are able to prevent non-repeatable reads, they are unable to prevent phantom reads, even though they seem to be similar issues. If we add FOR SHARE to the example of a phantom read in the part 1:

S0> BEGIN;

S0> SELECT count(*) 
S0> FROM (SELECT * FROM ints FOR SHARE) sub;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     1

S0> COMMIT;

The phantom read still happens because FOR SHARE only acquires locks on the rows already in the table. It does not prevent new rows from being inserted into the table.

Skipped Modification

Row level locks don’t help here. The UPDATE and DELETE statements already grab row level write locks on the rows being modified. FOR SHARE and FOR UPDATE cannot be used since no SELECT statement is involved in a skipped modification.

Serialization Anomalies

Row level locks do not help here for the same reason they don’t help phantom reads. There’s nothing to prevent the INSERT statements from inserting the new rows.


Overall row level locks are a easy way of handling two of the isolation issues that come up. Fortunately, those two are the ones that most commonly come up. The downside of row level locks, is that while easy, they do add a fair amount of complexity and overhead. You now have to worry about one transaction holding a lock for too long, which you mostly wouldn’t have to worry about otherwise. Next, we’ll take a look at a different approach to solving some of the interactions – the other transaction isolation levels.

Postgres Transactions Aren’t Fully Isolated

This post is part 1 of a three part series examining transaction isolation in Postgres. This post covers the issues the come up under the default Postgres settings. Parts 2 and 3 will cover the different ways to eliminate the problems.

Isolation is one of the main tenants of ACID. A database’s transactions are considered to be fully isolated if two transactions performed at the same time do not interact with each other in any way. A formal definition of full isolation is that a database’s transactions are fully isolated if when multiple transactions are performed at the same time, there exists an ordering of the transactions such that executing the transactions concurrently behaves the same as executing the transactions one by one in that order.

By default, Postgres does not provide full transaction isolation (the same holds true for most SQL databases under the default settings). That means there are a variety of cases where two transactions can interact with each other. These cases are usually unexpected by Postgres users and commonly result in bugs in those users’s code. Let’s first take a look at the exact isolation semantics Postgres provides. Then we can take look at the different interactions and use the isolation semantics to explain why each interaction occurs.

For the isolation semantics, you will only need to understand how Postgres handles SELECT statements and how Postgres handles statements that modify data such as UPDATE and DELETE.

SELECT

Whenever a SELECT statement is in Postgres, Postgres takes a snapshot of the current state of the database and runs the SELECT against that snapshot. The effects of all transactions that committed before the snapshot are visible to the SELECT statement. On the other hand, the effects of all transactions that either failed or committed after the snapshot was taken are completely invisible to the SELECT statement. It’s important to note that a new snapshot is created for each SELECT statement and not for each transaction. That means two SELECT statements in the same transaction will each have their own snapshot of the database.

UPDATE and DELETE

UPDATE and DELETE statements first create a snapshot of the database and use that snapshot to find all rows that match the WHERE clause associated with the statement. Once a row is found, the statement will try to lock it. If any of the rows found are already locked by a different UPDATE or DELETE statement in another transaction, the UPDATE or DELETE statement will wait for the other transaction to commit or abort.

If the other transaction aborts, the UPDATE or DELETE statement will act as if the other transaction had never existed. Otherwise, if the other transaction commits, the UPDATE or DELETE will reevaluate the WHERE clause on the new version of the row and update/delete the new version if it satisfies the predicate. If the new version does not satisfy the WHERE clause, it will be left unmodified. This behavior is used because it is a reasonable way to handle multiple concurrent modifications to the same row.


Now that we’ve got an idea of the semantics behind SELECTUPDATE, and DELETE, let’s take a look at a few of the different cases where two transactions can interact with each other. In the examples below, I use S0, S1, … to refer to different sessions running concurrently. All of the examples make use of a table ints of bigints that starts out empty at the beginning of each example.

Non-Repeatable Reads

A non-repeatable read is when a transaction reads the exact same row twice, but finds the row to be different the second time it is read. As an example:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n
---
 2

S1> COMMIT;

In this case, even though both of the SELECT *‘s were ran in the same transaction, they returned different results. This is due to each SELECT statement in the transaction creating a new snapshot instead of their being a single snapshot per transaction. The first SELECT creates a snapshot where n=1 and returns that row. Then a second transaction runs an UPDATE command that changes n=2. Then the second SELECT creates a new snapshot where n=2 and returns that.

Lost Updates

It is a somewhat common pattern (which should be avoided if possible) for a server to retrieve a value from the database, compute a new value based on the old value in the database, and write that new value back into the database. If two of these operations happen simultaneously, it can result in what’s known as a “lost update”1. As a simple example, let’s say two different servers try to increment the value in the ints table at the same time:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
S2> COMMIT;

S1> UPDATE ints SET n = 2; -- Computed server side.
S1> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

With the example above, session S1 had read the value 1. Then S2 increments the value in ints and sets it to 2. When S1 goes to set the value in ints, it doesn’t know that the value had been incremented by S2 after S1 had read it, so once S1 sets the value in ints to 2, the increment done by S2 was completely lost.

Phantom Reads

A phantom read occurs when the set of rows found by a statement change due to another transaction inserting new rows. A simple example is the following:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     1

S0> COMMIT;

In Postgres, phantom reads have the same cause as non-repeatable reads, namely snapshots being created per query instead of per transaction. Historically phantom reads are considered different from non-repeatable reads because in some databases the causes between the two are completely different. Phantom reads are also a slightly more difficult problem to solve, including in Postgres.

Skipped Modification

(I couldn’t find the actual name for this kind of interaction, so I’m just calling it a skipped modification. If you know the actual name, I would appreciate you leaving a comment at the bottom of this page.)

A skipped modification can happen due to how Postgres handles concurrent modifications of a single row. It’s easiest if we start with an example and examine what exactly is going on. This example was taken from the Postgres docs:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> INSERT INTO ints SELECT 2;
S0> SELECT * FROM ints;
 n
---
 1
 2

S0> COMMIT;

S1> BEGIN;
S1> UPDATE ints SET n = n+1;

S2> BEGIN;
S2> DELETE FROM ints WHERE n = 2;
-- S2 blocks since the DELETE is trying to modify a row 
-- currently being updated.

S1> COMMIT;
-- S2 unblocks.

S2> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2
 3

S3> COMMIT;

When I first saw this, I couldn’t believe that this was possible! The session S0 inserts two rows, one with the value 1 and one with the value 2. Then session S1 runs an UPDATE that increments the value of all of the rows by 1. Then session S2 runs a DELETE and deletes all rows with the value 2. Intuitively, the DELETE statement should delete at least one row. It should either run before the UPDATE statement and delete the row that started with 2, or it should run after the UPDATE statement and delete the row that started with 1. Once both transactions commit and S3 displays the results, it appears the DELETE statement deleted neither row!

This is due to how concurrent UPDATE/DELETE‘s are handled. The DELETE first looks for rows with n=2. It finds the one row that started with n=2 and attempts to lock it. Since that row is currently being updated by another transaction and is already locked, the DELETE waits for transaction containing the UPDATE to finish. Once that transaction finishes, the row now has n=3. The DELETE statement now rechecks the WHERE clause against the row and sees that it’s now false, so it ignores the row and finishes. Because of the rules for concurrent UPDATE/DELETE‘s, a row is only touched if it satisfies the condition both before and after it’s updated!

Serialization Anomalies

A serialization anomaly occurs when the results of two transactions are correct independently of each other, but it’s impossible to execute the transactions in any order one by one and achieve the same results. As an example:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0
(1 row)

S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0
(1 ROW)

S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

There is no possible way to serialize S0 and S1. There are only two possible orderings of S0 and S1. Either S0 ran and then S1 did or S1 ran first and then S0 did. If S0 was ran first, the INSERT at the end of S0 came before the count(*) at the beginning of S1 which returned zero rows. If S1 ran first, the INSERT in S2 came before the count(*) in S0 which returned zero rows. Either way, a count(*) said there were zero rows after a row was inserted into the table.


When I first saw these examples, I was astounded at how unexpected some of the results were. In my next posts, we’ll take a look at some of the different tools for dealing with each of these different cases. That includes both the stronger isolation levels than the default provided by Postgres, as well as explicit row-level locking.

Postgres Window Functions

Postgres window functions are an extremely useful feature. They let you calculate a value for each row based on the values of other rows. For our examples, let’s say we have a table ints full of random numbers:

CREATE TABLE ints (n bigint);

INSERT INTO ints 
SELECT floor(random() * 1000000) 
FROM generate_series(1,1000);

The most basic window functions behave similar to regular aggregations. The difference is that for window functions, the result of the aggregation. Since the result is included in each row you can also select regular columns out of each row. To change a regular aggregation function into a window function, you just add “OVER ()” after the aggregation function. For example, here is a query that includes the sum across all rows in each row, along with the value in each row:

> SELECT n, sum(n) OVER () FROM ints;

   n    |    sum    
--------+-----------
 481023 | 498397678
 772520 | 498397678
 709081 | 498397678
 292436 | 498397678
...

This allows you to easily express certain calculations. For example, you can calculate what fraction of the total sum each row is:

> SELECT n, n / (sum(n) OVER ()) AS fraction
 FROM ints;

   n    |          fraction          
--------+----------------------------
 481023 |     0.00096513892667052113
 772520 |     0.00155000722134182977
 709081 |     0.00142272131532683425
 292436 |     0.00058675233234132363
...

Window functions also have an equivalent to GROUP BY. By adding “PARTITION BY <expression>” inside of the parens after the OVER, you can calculate the window function over different subsets of the data. As an example, here is the above query, but it instead tells you how much of a fraction each value is of the total sum of all numbers with the same digit:

> SELECT n, n / (sum(n) OVER (PARTITION BY n % 10)) AS fraction 
FROM ints;

   n    |          fraction
--------+----------------------------
 457940 |     0.00951268915957674204
 595290 |     0.01236583117832999688
 111670 |     0.00231969690013961389
 830300 |     0.01724764337947453579
...

From this query, we can tell that 595290 is ~1.2% of the total sum of all numbers in the table that end in 0.

So far we have seen examples that aren’t too difficult to replicate with ordinary SQL. Window functions become harder to replicate with ordinary SQL when you introduce ORDER BY into the OVER clause. First of all, introducing ORDER BY completely changes the behavior of all regular aggregates. Now, instead of including the total aggregation in each row, a rolling aggregation included. For example, using ORDER BY with sum gives you a rolling sum:

> SELECT n, SUM(n) OVER (ORDER BY n ASC) FROM ints;

   n    |    sum    
--------+-----------
    689 |       689
   1197 |      1886
   1201 |      3087
   3405 |      6492
...

You can also combine ORDER BY and PARTITION BY to get a rolling sum for the sum of numbers with a given last digit:

> SELECT n, SUM(n) OVER (PARTITION BY n % 10 ORDER BY n ASC) FROM ints;

   n    |   sum    
--------+----------
  16900 |    16900
  23230 |    40130
  26540 |    66670
  42310 |   108980
...
   1201 |     1201
  18371 |    19572
  19221 |    38793
  36371 |    75164
...

I haven’t mentioned it yet, but there are actually many different functions that are designed specifically for working as window functions. These functions aren’t available as regular aggregation functions. These functions will return values based on the row’s value relative to the other rows. Two of these functions that deserve special attention are row_number and lag. The function row_number tells you each row’s ranking among the other rows based on the ordering used. For example, for the first row according the ordering, row_number will return 1. For the second it will return 2 and so on:

> SELECT n, row_number() OVER (ORDER BY n ASC) FROM ints;

   n    | row_number 
--------+------------
    689 |          1
   1197 |          2
   1201 |          3
   3405 |          4
...

The function lag on the other hand evaluates to the value of an expression on nth previous value according to the ordering. A simple calculation you can perform with lag is calculating the difference between each row and the previous row:

> SELECT n, n - lag(n, 1) OVER (ORDER BY n ASC) AS diff FROM ints;

   n    | diff 
--------+------
    689 |     
   1197 |  508
   1201 |    4
   3405 | 2204
...

Overall, window functions are a pretty awesome features. They dramatically increase the querying power of SQL and make it so much easier to express many different queries. Ultimately, window functions are another tool in the toolbox that Postgres provides that makes Postgres so great.

SELECT DISTINCT ON … ORDER BY

A common problem I’ve seen that many people are unfamiliar with how to solve is obtaining all of the rows that are the maximum of one value grouped by another value. For example, you may have a pets table with columns nameage, and owner_id, and you may want to obtain the name of the oldest pet for each owner. This seems like a problem that could be solved by a group by, but it actually cannot be using the built-in aggregation functions1. If you try writing a query with a group by, you’ll try to write something like:

SELECT owner_id, <agg for the name of the oldest pet>
GROUP BY owner_id;

Believe it or not, there is no built-in aggregation that can be used to get the desired result. The easiest way to solve this problem is with SELECT DISTINCT ON … ORDER BY2. The general format of such a query looks like:

SELECT DISTINCT ON (<group by columns>) <selected columns>
FROM <table>
ORDER BY <group by columns>, <order by columns>;

In this specific case, a query that solves the problem is:

SELECT DISTINCT ON (owner_id) owner_id, name
FROM pets
ORDER BY owner_id, age DESC;

Let’s breakdown this query and see how it works. The first part you need to understand is the behavior of SELECT DISTINCT ONSELECT DISTINCT ON returns one row for every unique combination of values in the parens. For example, the query:

SELECT DISTINCT ON (owner_id) owner_id, name FROM pets;

Will return the owner_id and name of one pet belonging to each distinct pet owner. One important detail of the behavior of SELECT DISTINCT ON is that it will always return the first row it encounters for each distinct pet owner. That means you can use SELECT DISTINCT ON with ORDER BY to get the row with the highest/lowest value of a given column for each combination of the values being distincted on. The query above uses this fact to get the pet with the oldest age for each owner.

One additional detail with the query is you need to include the columns being distincted on in the ORDER BY. This is necessary because of how Postgres executes a query with DISTINCT ON internally. Postgres executes such a query by first sorting by the fields being distincted on. Sorting the rows this way places all rows with the same values of the fields being distincted on together. Postgres can then just iterate through all of the rows and add a new row to the result of the query whenever it sees a row whose values of the columns differ from the row before it. SELECT DISTINCT ON … ORDER BY takes advantage of this behavior by first sorting by the columns being distincted on and then sorting by the columns being ordered by. That way all rows with the same values of the fields being distincted on are clumped together, and the first row in each clump is the row desired in the output.

The Missing Postgres Scan: The Loose Index Scan

Postgres has many different types of scans builtin. The list includes sequential scans, index scans, and bitmap scans. One useful scan type Postgres does not have builtin that other databases do is the loose index scan. Both MySQL and Oracle support loose index scans. Fortunately for Postgres users, loose index scans can be emulated through a recursive CTE.

At a high level, the idea behind the loose index scan is that rows necessary for the query are at predictable locations in the index. All of the index scans Postgres currently supports requires reading all of the values from the index between two values. Although Postgres does not do it, for certain queries it is possible to skip over large parts of the index because it is possible to determine that values in that part of the index do not affect the result of the query. For a specific example, let’s say we want to calculate the number of distinct elements in a table:

CREATE TABLE ints (n BIGINT);

INSERT INTO ints SELECT floor(random() * 10) 
FROM generate_series(1, 10000000);

CREATE INDEX ON ints (n);

EXPLAIN ANALYZE SELECT COUNT(DISTINCT n) FROM ints;

This example creates a table with 10,000,000 rows, all of which contain a random number from 0 to 9. Then it counts the number of distinct elements using a naive COUNT(DISTINCT n). The plan for the query is available here.

As you can see from the query plan, using a naive COUNT(DISTINCT n) takes six seconds. Of this time, 650 milliseconds is used to read the data from disk while the rest of the time is used to perform the aggregation and determine how many distinct elements there are. A sequential scan is only performed because the query requires reading all of the rows from the database and a sequential scan is the fastest way to do it.

Most of the work Postgres is doing for this query is unnecessary. For example, once Postgres sees a row with a zero, it knows all other rows that contain a zero will not affect the result. With this idea in mind, we can write the following query which performs a loose index scan:

> EXPLAIN ANALYZE
> WITH RECURSIVE temp (i) AS (
>     (SELECT n FROM ints ORDER BY n ASC LIMIT 1)
>   UNION ALL
>     (SELECT n FROM temp, 
>                    LATERAL (SELECT n
>                             FROM ints
>                             WHERE n > i
>                             ORDER BY n ASC
>                             LIMIT 1) sub)
> )
> SELECT COUNT(*) FROM temp;

The plan is available here.

This new query takes less than one millisecond! Let’s try dissecting this query. First of all, this query makes use of a recursive CTE. If you aren’t familiar with how to interpret recursive CTEs, I suggest you read my blog post on how to understand them. The first part of the recursive CTE is:

SELECT n FROM ints ORDER BY n ASC LIMIT 1

This simply gets the smallest number from the table. As for the recursive part of the recursive CTE:

SELECT n FROM temp, 
              LATERAL (SELECT n
                       FROM ints
                       WHERE n > i
                       ORDER BY n ASC
                       LIMIT 1) sub

This query takes the current element in the working table and uses a lateral join to fetch the smallest element greater than the current element. In other words, the next largest element. By first fetching the smallest value in the table and then repeatedly fetching the next largest, the recursive CTE selects the distinct elements from the table! From there, we just need to count the number of distinct elements to determine the number of distinct elements in the table.

By using an index to find each element, only a single index scan needs to be performed per distinct element. Since there are only 10 distinct elements in the table, only 10 index scans need to be performed1. The loose index scan is so much faster than the regular aggregation because once it sees a given value, it skips over all other rows with the same value.

Besides counting the number of distinct elements in a table, there is one other main use case where loose index scans are helpful: if you ever want to calculate the min of one column grouped by another column. If you have a specific value of the column you are grouping by, you can use a compound index on (grouping column, other column) to quickly find the smallest value of the other column for that one value. By iterating over the distinct values of the column you are grouping by, you can answer the query fairly quickly. Without using a loose index scan here, the query would read entire table, which as we already know, is pretty slow.

Postgres Lateral Joins

Personally, lateral joins are one of my favorite Postgres features. They are simple, while at the same time they let you write queries that would be nearly impossible to write otherwise. I find it surprising lateral joins were only introduced into Postgres four years ago given how useful they are. I also find it surprising there are no good explanations of lateral joins online given how simple they are.

A lateral join is simply a foreach loop in SQL. There are actually two different ways of writing a lateral join. The simplest form, which is the one you probably want looks like the following:

SELECT <columns>
FROM <table reference>,
     LATERAL <inner subquery>;

(For context, a table reference is either a table or a subquery.)

The above query will iterate through each row in the table reference and evaluate the inner subquery for each row, exactly like a foreach loop would. The rows returned by the inner subquery are then added to the result of the lateral join. The primary reason for using a lateral join is that the inner subquery can refer to the fields of the row from the table reference and use that to determine what rows to return. To help illustrate what this means, here is Python code that imitates a lateral join:

result = []
for row1 in table_reference():
    for row2 in inner_subquery(row1):
        result += (row1, row2)

As you should be able to see, the rows iterated over in the inner for loop are a function of the current row being iterated over in the outer for loop.

As an example of where a lateral join is useful, let’s say we have a table people with an id column and an age column, and a table pets with columns id, owner_id and age. For the query, let’s say for each person over 30, we want to obtain the id of the oldest pet owned by that person. Anytime you hear the words “for each”, you are probably looking for a lateral join. This specific example can be done with the following query:

SELECT people_sub.id, pets_sub.id
FROM (SELECT id FROM people WHERE age > 30) people_sub,
     LATERAL (SELECT pets.id
              FROM pets
              WHERE pets.owner_id = people_sub.id
              ORDER BY pets.age DESC
              LIMIT 1) pets_sub;

For each person over 30, this query will evaluate the inner subquery and add the results of the subquery to the results of the full query. The inner subquery, given a row for a person, will find the id of the oldest pet owned by that person, if they own a pet. That’s all there is to the lateral join. Believe it or not, most of the other ways of writing the above query are much trickier and usually are not as efficient1.

The other syntax for writing a lateral join is the following:

SELECT <columns>
FROM <table reference>
     JOIN LATERAL <outer subquery>
ON TRUE;

The advantage of this form is you can use any join qualifier (e.g. LEFT JOIN LATERAL). If in the example above, we also wanted to obtain people who didn’t have a pet at all, we would have needed to use a lateral left join to get those people. Most of the time though, the first lateral join form is good enough.

To recap, a lateral join is basically a foreach loop in SQL. Anytime you want to perform some calculation for each row in a subquery, you should try to see if a lateral join suits your use case. Once you understand a lateral join, you’ll realize it makes all sorts of queries much easier to write.

Understanding Postgres Recursive CTEs

Unlike ordinary CTEs, which can be thought of as syntactic sugar, recursive CTEs make SQL a more powerful language. As the name suggests a recursive CTE makes it possible to express recursion in SQL. The ability to express recursion makes recursive CTEs useful for querying any hierarchical data. Additionally, recursive CTEs even make SQL Turing complete!

In general, most queries that make use of a recursive CTEs look similar to the following template1:

WITH RECURSIVE <name> (<columns>) AS (
    <initial query>
  UNION ALL
    <recursive query>
)
<query>

As an example, here is a query that makes use of a recursive CTE to generate the integers from 1 to 100:

WITH RECURSIVE ints (n) AS (
    SELECT 1
  UNION ALL
    SELECT n+1 FROM ints WHERE n+1<= 100
)
SELECT n FROM ints;

A recursive CTE executes according to the following steps:

  1. Evaluate the initial query and store the results in the “working table”. The working table is simply a temporary table that is used during the evaluation of the recursive CTE.
  2. Exit if the working table is empty.
  3. Add all of the rows in the working table to the result of the recursive CTE2.
  4. Evaluate the recursive query on the working table and replace the working table with the results of the recursive query. In the example above, that means the working table is substituted in for ints in the recursive query.
  5. Goto step 2.

Once the recursive CTE has been evaluated, the results of it can be used in the rest of the query like an ordinary CTE3.

Let’s walk through the example I gave above. First the recursive CTE evaluates the initial query which is just SELECT 1, and stores the results, a single row with the value 1, in the working table. Then since the working table is not empty, the recursive CTE proceeds to step 3 which adds the row with value 1 to the result of the recursive CTE. Next, the recursive query:

SELECT n+1 FROM ints WHERE n+1 <= 100

Is ran on the working table. That means the recursive query is ran with ints referring to the current working table. In this case, since the working table is only a single row with the value 1, the recursive query returns a single row with the value 2. This new row becomes the new working table. Steps 2-4 then repeat with the working table containing only a single row with the value 2. This process repeats with the values 3, 4, … until the working table has a single row with the value 100.

This time, the recursive query returns no rows, due to the filter condition in the recursive query. It’s at this point the recursive CTE terminates. During the execution of the CTE, each of the values 1 to 100 was added to the result set, so the recursive CTE terminates with the values 1-100 as the result. I also wrote a Python snippet equivalent to above recursive CTE which you can find here. Looking at it may help with understanding how to think about recursive CTEs.

Let’s now look at a more complicated example. An example that actually includes hierarchical data. Let’s say we have an employees table that models the management hierarchy at a business. The table has three columns: an id column, a name column, and a manager_id column. Let’s say we want to find everyone who works under Alice (someone who reports to someone who reports to someone … who reports to Alice), including Alice herself. That can be done with the following query:

WITH RECURSIVE temp (name, id) AS (
    SELECT name, id
    FROM employees 
    WHERE name = 'Alice'
  UNION ALL
    SELECT employees.name, employees.id 
    FROM temp, employees
    WHERE employees.manager_id = temp.id
)
SELECT name FROM temp;

Let’s look at the individual queries that make up the recursive CTE in the example. The initial query finds the row for Alice and adds it to the working table. The recursive query takes all people in the working table and uses a join to find all people who report to someone in the working table. Going through the steps above for the execution of a recursive CTE, what this query does is first find the row for Alice and puts it in the working table. Then the recursive query finds all people who report to Alice. Next the recursive query finds all people who report to someone who reports to Alice. This process repeats until the query has found all people who work under Alice.

There are all sorts of surprising computations you can perform with recursive CTEs. One of my favorite examples is a SQL query that generates the Mandelbrot Set. This is mainly possible recursive CTEs make SQL turing complete. With recursive CTEs, it is possible to perform any computation you want with SQL!

Postgres CTEs

CTEs, also know as Common Table Expressions, are a way of refactoring a query. They let you give the result of a subquery a name and reuse the result multiple times. The general syntax of a CTE looks like the following:

WITH <name1> AS (<subquery1>),
     <name2> AS (<subquery2>),
     ...
     <nameN> AS (<subqueryN>)
  query

This will first evaluate each of the subqueries and assign the results of each query to the corresponding name. Then within the final query, you can use each name to refer to the results of the corresponding query.

As an example, I frequently do experiments where I try different combinations of indexes. To test which configuration is better, I’ll run many different queries against both configuration. To analyze the results I create a table runs which contains result of each run of each query. The table has the columns query_id which is the id of the query being ran, config which specifies which version of indexes were used, and duration which is the duration of that run of the query. To get the ten queries where which sped up the most going from one configuration to another, I’ll use a query similar to the following:

WITH a_runs AS (SELECT * FROM runs WHERE config = 'a'),
     b_runs AS (SELECT * FROM runs WHERE config = 'b')
  SELECT a_runs.query_id, a_runs.duration, b_runs.duration
  FROM a_runs, b_runs
  WHERE a_runs.query_id = b_runs.query_id
  ORDER BY a_runs.duration - b_runs.duration DESC
  LIMIT 10;

This query uses a CTE to define a_runs and b_runs as the runs of queries under configuration a, and the runs under configuration b respectively. From then on the query uses a_runs and b_runs to join the results of each configuration together.

There are two main differences between a CTE and just substituting the subqueries into the final query. First of all, a CTE can usually make a query easier to read. This alone is a pretty good reason for using CTEs. Second, using a CTE is currently equivalent equivalent to creating new tables with the results of each subquery, and then using the corresponding table every time you use the name corresponding to that query. This behavior has both some pros and some cons.

This can be advantageous, because it guarantees that each subquery is only evaluated once. This allows you to reuse the results of each subquery without reevaluating the subquery. The main downside of this behavior of CTEs is that it prevents Postgres from making use of certain query plans. There are some cases where Postgres can use information outside of the subquery to optimize the subquery and avoid doing some work. CTEs prevent Postgres from doing any of this1. If you want to read more about CTEs as an optimization fence, 2nd Quadrant has a blog post that goes into more detail.