Postgres JSONB

JSONB is a nifty Postgres type that allows you to store unstructured data inside of Postgres. A common use case of JSONB is to represent a mapping from a set of keys to arbitrary values. JSONB is nice for this because the set of keys can be completely different for each value. It is also possible to express hierarchical data through JSONB.

As an example of where JSONB is incredibly useful, the company I work for, Heap, makes heavy use of JSONB. At Heap, we use JSONB to store events that happen on our customers’ websites. These events include pageviews, clicks, as well as custom events created by our customers. All of these different kinds of events have completely different properties. This makes JSONB a great tool for our use case. More concretely, we have an events table with a fairly simple schema:

CREATE TABLE events (
    user_id bigint,
    time bigint, 
    data jsonb
);

With JSONB, this simple schema is able to take care of most of our use cases. For example, a click event on the “login” button may look something like the following:

INSERT INTO events
SELECT 0 AS user_id,
       1498800692837 AS time,
       '{"type": "click",
         "target_text": "login",
         "page": "/login"}'::jsonb AS data;

And a pageview on the homepage may look like:

INSERT INTO events
SELECT 1 AS user_id,
       1498800692837 AS time,
       '{"type": "pageview",
         "page": "/home",
         "referrer": "www.google.com"}'::jsonb AS data;

JSONB lets us easily express all of these different kinds of events. Then when we want to query the data, it’s fairly easy to get the data out of the data column. For example, if to see what pages are viewed the most frequently, we can run a query such as:

SELECT (data ->> 'page'), count(*)
FROM events
WHERE (data ->> 'type') = 'pageview'
GROUP BY (data ->> 'page');

We use this same general idea to power all of the analysis Heap is able to perform. This includes funnels (of people that did A, how many later did B) as well as retention queries (of people that did A, how many people did B within N weeks).

Of course JSONB isn’t free. Due to our heavy use of JSONB, we’ve ran into a decent number of issues with JSONB. One problem is that the keys need to be repeated in every event. This winds up wasting a lot of space. I did an experiment where I pulled out most of the fields we store in JSONB and found that we could save ~30% of our disk usage by not using JSONB!

Another problem that is much worse is the lack of statistics. Normally Postgres collects statistics about the different columns of a table. This includes a histogram of each column as well as an estimate of the number of distinct elements in the column. At query time, Postgres uses these statistics to determine what query plan to use. Currently for JSONB, Postgres has no way of collecting statistics over it. In certain cases, this leads Postgres to making some very bad query plans. My manager goes into both of these issues in more depth in a blog post he wrote on our company blog.

Depending on your exact needs, JSONB can be a god send. JSONB makes it easy to store whatever data you want in Postgres without worrying about an overarching common format for all of your data.

Avoiding Torn Pages

For the Postgres file layout, Postgres reads and writes data to disk 8kb at a time. Most operating systems make use of a smaller page size, such as 4kb. If Postgres is running on one of these operating systems, an interesting edge case can occur. Since Postgres writes to disk in units of 8kb and the OS writes to disk in units of 4kb, if the power went out at just the right time, it is possible that only 4kb of an 8kb write Postgres was performing were written to disk. This edge case is sometimes referred to as “torn pages”. Postgres does have a way of working around torn pages, but it does increase the amount of I/O Postgres needs to perform.

Under normal circumstances, Postgres uses a technique called WAL to prevent data loss. At a high level, WAL works by creating a log on disk of all changes made by a transaction to the database, before the changes themselves are persisted to disk. Since creating a single continuous log on disk is much cheaper than performing random writes to disk, WAL reduces the amount of I/O Postgres needs to perform without the risk of data loss. If Postgres crashes, Postgres will be able to recover all of the changes that weren’t persisted to disk by replaying the WAL log.

Although keeping track of all of the changes made does allow Postgres to recover from common crashes in which every write was either done completely or not at all, it does not let Postgres recover from a torn page. Due to the specifics with the implementation of WAL log1, in the case of a torn page, the changes alone do not provide Postgres with enough information to determine what changes should be applied to each half of the page.

To recover from torn pages, Postgres does something called “full-page writes”. Whenever Postgres makes a change to a page, it writes a full copy of the page to the WAL log. That way, when using the WAL log to recover from a crash, Postgres does not need to pay attention to the contents of the page stored on disk. Postgres is able to  recover the entire state of the page just from the WAL log, sidestepping the problem of torn pages entirely! To avoid constantly writing full copies of every page to WAL, Postgres checks if a full copy of the page was recently written to WAL, and if so, will not write a full copy of the page since it will still be able to recover the complete page from WAL.

There is actually a parameter, full_page_writes, that allows you to disable this behavior. If you care about preventing data corruption, there are very few cases in which you should disable it. The only real case is if the OS/filesystem has built-in protection against torn pages. For example, the ZFS filesystem provides it’s own transactional guarantees and prevents torn pages. This is largely due to the copy-on-write nature of ZFS.

Postgres TOAST

TOAST aka The Oversized Attribute Storage Technique aka the best thing since sliced bread, is a technique Postgres uses to store large values. Under the normal Postgres data layout, every row is stored on 8kb pages, with no row spanning multiple pages. To support rows that can be larger than 8kb, Postgres makes use of TOAST.

If Postgres is about to store a row and the row is over 2kb, TOAST will kick in and Postgres will first attempt to shrink the row by compressing the large variable width fields of the row. If the row is still over 2kb after compressing the fields of it, Postgres will then repeatedly store the large fields outside of the row until the row is under 2kb1. To do so, Postgres splits up the compressed value into individual chunks of ~2kb2. Each of those chunks are then stored in a “TOAST table”. Every regular Postgres table has a corresponding TOAST table in which the TOASTed values are stored. A value stored in this manner is commonly referred to as “TOASTed”.

Every TOAST table has three columns. It has a column called chunk_id which is used to distinguish the specific TOASTed values each chunk is for. All chunks of the same toasted value have the same chunk_id. The chunk_id is what is stored in the original row and is what allows Postgres to determine what chunks are for a given value. The second field of a TOAST table is chunk_seq which determines the ordering of the chunks with the same chunk_id. The first chunk of a TOASTed value has chunk_seq=0, the second has chunk_seq=1 and so on. The last column is chunk_data which contains the actual data for the TOASTed value.

At query time, when a TOASTed value is needed, Postgres will use an index on the TOAST table on (chunk_id, chunk_seq) to lookup all of the chunks with a given chunk_id sorted by chunk_seq. From there, it can stitch all of the chunks back together and decompress that result to obtain the original value of the row.

Under certain circumstances, TOAST can actually make queries faster. If a TOASTed field isn’t needed to answer a query, Postgres doesn’t have to read the chunks for the TOASTed value, and can skip reading the value into memory. In some cases, this will dramatically reduce the amount of disk I/O Postgres needs to perform to answer a query.

You can actually access the TOAST table for a given table directly and inspect the values stored in it. As a demonstration, let’s create a table messages that has a single column message:

CREATE TABLE messages (message text);

We can then insert a few random strings to be TOASTed:

INSERT INTO messages
SELECT (SELECT 
        string_agg(chr(floor(random() * 26)::int + 65), '')
        FROM generate_series(1,10000)) 
FROM generate_series(1,10);

Now that we have a table with values we know are toasted, we first need to lookup the name of the TOAST table for the messages table. We can do this by looking up the name of the corresponding TOAST table with the following query:

> SELECT reltoastrelid::regclass 
> FROM pg_class 
> WHERE relname = 'messages';
      reltoastrelid      
-------------------------
 pg_toast.pg_toast_59611
(1 row)

The query pulls information from the pg_class Postgres table, which is a table where Postgres stores metadata about tables.

Now that we have the name of the TOAST table, we can read from it just like any other table:

> SELECT * FROM pg_toast.pg_toast_59611;
 chunk_id | chunk_seq | chunk_data
----------+-----------+------------
    59617 |         0 | \x4c4457...
    59617 |         1 | \x424d4b...
...

Note that the chunk_data is the binary of the compressed version of the field, so it isn’t exactly human readable.

Overall, TOAST is a clever technique that reuses the ordinary Postgres storage technique to store larger values. It’s completely transparent to the user, yet if you really want to, you can dig into your TOAST tables and see exactly how Postgres is storing your data.

The File Layout of Postgres Tables

Although Postgres may seem magical, it really isn’t. When data is stored in Postgres, Postgres in turn stores that data in regular files in the filesystem. In this blog post, we’ll take a look at how Postgres uses files to represent data stored in the database.

First of all, each table in Postgres is represented by one or more underlying files. Each 1GB chunk of the table is stored in a separate file. It is actually pretty easy to find the actual underlying files for a table. To do so, you first need to find the Postgres data directory, which is the directory in which Postgres keeps all of your data. You can find where the data directory is by running SHOW DATA_DIRECTORY;. When I run it locally, I see the following:

> SHOW DATA_DIRECTORY;
        data_directory        
------------------------------
 /var/lib/postgresql/9.5/main
(1 row)

Now that you know where the Postgres data directory is, you will need to find where the files for the specific table we are looking for is located. To do so, you can use the pg_relation_filepath function with the name of the table you want to find the file for. The function will return the relative filepath of the files from the data directory. Here is what I see when I run the command on a table I have locally:

> SELECT pg_relation_filepath('people');
 pg_relation_filepath 
----------------------
 base/16387/51330
(1 row)

Together with the location of the location of the data directory, this gives us the location of the files for the people table. All of the files are stored in /var/lib/postgresql/9.5/main/base/16387/. The first GB of the table is stored in a file called 51330, the second in a file called 51330.1, the third in 51330.2, and so on. You can actually read and write data to the file yourself, but I heavily suggest not doing so as you will most likely wind up corrupting your database.

Now that we’ve found the actual files, let’s walk through how each file is laid out. Each file is broken up into 8kb chunks, called “pages”1. For example, a 1.5GB table will be stored across two files and 196,608 pages2 and look like the following:

Each row is stored on a single page (with the exception of when a row is too large, in which case a technique called TOAST is used). Pages are the unit of which Postgres reads and writes data to the filesystem. Whenever Postgres reads a row it needs to answer a query from disk , Postgres reads the entire page the row is on. When Postgres writes to a row on a page, it writes a whole new copy of the entire page to disk at one time. Postgres operates in this way for numerous reasons which are outside of the scope of this blog post.

Pages themselves have the following format:

The header is 24 bytes and contains various metadata about the page, including a checksum and information necessary for WAL. The row offsets is of pointers into the rows field, with the Nth pointer pointing to the Nth row. The offsets can be used to quickly lookup an arbitrary row of a page. If we emphasize the individual rows on the page, the page winds up looking like:

The first thing you likely noticed is that the first rows are stored at the back of the page. That is so the offsets and the actual row data can both grow towards the middle. If a new row is inserted, we can allocate a new offset from the front of the free space, and allocate the space for the row from the back of the free space.

As for each row, they have a format that looks like the following:

The header of each row is 23 bytes and includes the transaction ids for MVCC as well as other metadata about the row. Based on the table schema, each field of the row is either a fixed width type or a variable width type. If the field is fixed width, Postgres already knows how long the field is and just stores the field data directly in the row.

If the field is variable width there are two possibilities for how the field is stored. Under normal circumstances, it would be stored directly the row with a header detailing how large the field is. In certain special cases, or when it’s impossible to store the field directly in the row, the field will be stored outside of the row using a technique using TOAST, which we will take a look at in my next post.

To recap, each row is stored on an 8kb page along with several other rows. Each page in turn is part of a 1GB file. While processing a query, when Postgres needs to fetch a row from disk, Postgres will read the entire page the row is stored on. This is, at a high level, how Postgres represents data stored in it on disk.

How to Write a Postgres SQL UDF

Being able to write a Postgres UDF (user-defined function) is a simple skill that goes a long way. SQL UDFs let you give a name to part or all of a SQL query and use that name to refer to that SQL code. It works just like any user-defined function in your favorite programming language.

As a simple example, in the last post we came up with a query for incrementing a counter in a table of counters:

INSERT INTO counters
SELECT <id> AS id, <amount> AS VALUE
    ON CONFLICT (id) DO
    UPDATE SET value = counters.value + excluded.value;

When we used this query multiple times, we had to copy and paste it once for each time we used it. To avoid this problem, we could define a UDF that runs the query and then only increment the counters through the UDF. In general, most of the time when you define a SQL UDF, you’ll use code like the following:

CREATE OR REPLACE FUNCTION <function name>(<arguments>)
RETURNS <return type> AS $$
  <queries to run>
$$ LANGUAGE SQL;

This will define a UDF with the given name that runs the queries in the body whenever it is called. Inside of the queries, you’ll be able to refer to any of the arguments passed to the function. If we convert the query we had for incrementing a counter into a UDF, we wind up with the following UDF definition:

CREATE OR REPLACE FUNCTION 
increment_counter(counter_id bigint, amount bigint)
-- Use void as the return type because this function 
-- returns no value.
RETURNS void AS $$
  INSERT INTO counters
  SELECT counter_id AS id, amount AS value
      ON CONFLICT (id) DO
      UPDATE SET value = counters.value + excluded.value;
$$ LANGUAGE SQL;

With this UDF we can now use the UDF instead of the original query:

> SELECT * FROM counters;
 id | value 
----+-------
(0 rows)

> SELECT increment_counter(1, 10);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    10
(1 row)

> SELECT increment_counter(1, 5);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
(1 row)

> SELECT increment_counter(2, 5);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
  2 |     5
(2 rows)

> SELECT increment_counter(3, 20);

> SELECT * FROM counters;
 id | value 
----+-------
  1 |    15
  2 |     5
  3 |    20
(3 rows)

This is much better than what we had before.

One of the more interesting classes of UDFs are those that return rows instead of a single result. To define such a UDF, you specify SETOF TABLE (<columns>) as the return type. For example, if we wanted a UDF that returned the top N counters, we could define one as such:

CREATE OR REPLACE FUNCTION top_counters(n bigint)
RETURNS TABLE (id bigint, value bigint) AS $$
  SELECT * FROM counters ORDER BY value DESC LIMIT n;
$$ LANGUAGE SQL;

Then we can use it like:

> SELECT * FROM top_counters(2);
 id | value 
----+-------
  3 |    20
  1 |    15
(2 rows)

You can then use the function as part of a larger SQL query. For example, if you wanted to find the sum of the values of the top 10 counters, you could do that with the following straightforward SQL query:

SELECT sum(value) FROM top_counters(10);

To recap, UDFs are a great way to simplify SQL queries. I find them to be especially useful when I am reusing the same subquery in a bunch of different places.

Postgres Transaction Isolation Levels

This post is part 3 in a three part series exploring transaction isolation issues in Postgres. Part 1 introduced some of the common problems that come up with transaction isolation in Postgres. Part 2 introduced row level locks, one way of dealing with some of the problems around transaction isolation. This part introduces the different transaction isolation levels and how they affect the different problems introduced in part 1.

The main, and easiest way to deal with the transaction isolation issues introduced in Part 1 is through changing the transaction isolation level. In Postgres each transaction has it’s own isolation level. The isolation level of a transaction determines the isolation semantics of the transaction. In other words, the transaction isolation level determines what the state of the database statements ran within the transaction see, as well as how concurrent modifications to the same row are resolved.

Postgres provides a total of three different transaction isolation levels: read committed, repeatable read, and serializable. Read committed is the default and provides the semantics we’ve been discussing so far. The isolation semantics described in part 1 around SELECTUPDATE, and DELETE are the semantics provided under read committed. The repeatable read and serializable isolation levels each provide a different set of semantics for SELECTUPDATE, and DELETE. The semantics under repeatable read are as follows:

SELECT

Whenever a repeatable read transaction starts, it takes a snapshot of the current state of the database. All queries within the transaction use this snapshot. This behavior is opposed to the behavior of read committed in which each query has it’s own snapshot. The effects of all transactions that committed before the transaction are visible to all queries within the transaction while all transactions that either failed or committed after the transaction started are invisible to all statements ran in the transaction.

UPDATE and DELETE

Under repeatable read, UPDATE and DELETE use the snapshot created when the transaction started to find all rows to be modified. When an UPDATE or DELETE statement attempts to modify a row currently being modified by another transaction, it waits until the other transaction commits or aborts. If the other transaction aborts, the statement modifies the row and continues. On the other hand, if the other transaction commits, the UPDATE/DELETE statement will abort.

This behavior is completely different from that provided by read committed and somewhat surprising. In read committed, if two statements attempt to modify the same row at the same time, the modifications will be performed one after the other. In repeatable read, one of the statements will be aborted to prevent any isolation issues! This is generally the reason why people prefer read committed over repeatable read. When someone uses repeatable read, their code has to be prepared to retry transactions if any of them fail.


On it’s own, repeatable read prevents all of the transaction isolation problems but one. The only class of issues repeatable read does not prevent are serialization anomalies. That is what the serializable transaction isolation level is for. The serializable isolation level behaves exactly like repeatable read except it specifically detects when two transactions will not serialize correctly and will abort one of them to prevent a serialization anomaly. Like repeatable read, if you use serializable you should have code ready to retry aborted transactions.

To change the isolation level Postgres uses, you can set default_transaction_isolation to the desired level. All of the examples in the rest of this post make use of the repeatable read isolation levels, unless explicitly mentioned otherwise.

Now that we’ve got an understanding of the semantics behind the other isolation levels, let’s take a look at how they affect the examples in part 1.

Non-Repeatable Reads

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n 
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n 
---
 1

S1> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n 
---
 2

S3> COMMIT;

With the repeatable read isolation level, S1 avoids the non-repeatable read since, unlike read committed, the second query makes use of a snapshot created at the start of the transaction. This is where repeatable read gets it’s name, it can be used to avoid non-repeatable reads.

Lost Updates

Due to the semantics around conflicting updates, repeatable read does prevent lost updates, but it does so in a less than ideal way:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
S2> COMMIT;

S1> UPDATE ints SET n = 2; -- Computed server side.
ERROR:  could not serialize access due to concurrent UPDATE

S1> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

What happened here is that since two UPDATEs attempted to modify the same row at the same time, Postgres aborted one of them to prevent a lost update. For this reason, if you are just trying to avoid lost updates, you should prefer to use row level locks with SELECT … FOR UPDATE under read committed. That will allow both UPDATEs to be performed, without either UPDATE being lost or aborted.

Phantom Reads

Repeatable read eliminates phantom reads for the same reason it eliminates non-repeatable reads:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     0

S0> COMMIT;

S2> BEGIN;
S2> SELECT COUNT(*) FROM ints;
 count
-------
     1

S2> COMMIT;

Preventing phantom reads is one reason why you would prefer to use the repeatable read isolation level instead of row level locks.

Skipped Modification

Just like with a lost update, repeatable read will abort one of the transactions in a skipped modification:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> INSERT INTO ints SELECT 2;
S0> SELECT * FROM ints;
 n
---
 1
 2

S0> COMMIT

S1> BEGIN;
S1> UPDATE ints SET n = n+1;

S2> BEGIN;
S2> DELETE FROM ints WHERE n = 2;
-- S2 blocks since the DELETE is trying to modify a row
-- currently being updated.

S1> COMMIT;
-- S2 aborts with the error:
ERROR:  could not serialize access due to concurrent update

S2> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2
 3

S3> COMMIT;

S2 is aborted for the same reason S1 is aborted in the lost update example. The DELETE tries to modify a row which was modified after the snapshot was taken. Since the version of the row in the snapshot is out of date, Postgres aborts the transaction to prevent any isolation issues.

Serialization Anomalies

Unlike all of the other interactions, repeatable read does not eliminate serialization anomalies:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0
(1 row)
 
S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0
(1 ROW)
 
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;
 
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

Fortunately, if you do want to completely prevent serialization anomalies, you can use the serializable isolation level. If we use serializable in the example instead of repeatable read, here is what happens:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0

S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> INSERT INTO ints SELECT 1;
ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.

S0> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 1

S3> COMMIT;

What happened here is that Postgres detected that the pattern of reads and writes wouldn’t serialize properly, so it aborted one of the transactions.


Like row level locks, the repeatable read and serializable isolation levels are simple, yet at the same time they introduce a lot of complexity. Use of either repeatable read or serializable dramatically increases the chances that any given transaction will fail, making code that interacts with the database much more complicated, and database performance much less predictable.

In general, if you can, you should try to use the read committed isolation level and write your code in such a way that you don’t run into the different isolation issues mentioned in part 1. If you absolutely have to, you can use the tools mentioned in these last two posts to fend off all of the isolation issues.

Postgres Row Level Locks

This post is part 2 in a three part series exploring transaction isolation in Postgres. Part 1 introduced some of the problems that come up under the default isolation settings. This part and the next one introduce common ways of eliminating those problems.

As mentioned in part 1, there are many different cases where concurrent transactions can interact with each other. One of the most basic ways to avoid a few of the interactions is to use row level locks. Row level locks are a way of preventing concurrent modification to a row. Let’s take a look at how row-level locks change the behavior of each of the examples in the part 1:

Non-Repeatable Reads

To recap, a non-repeatable read is when a single transaction rereads a single row twice and finds it to be different the second time. Here’s the example from part 1 demonstrating non-repeatable reads:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n
---
 2

S1> COMMIT;

If S1 were to acquire a row level lock on the row as soon as it first read the row, S2 would unable from updating the rows as long as S1 holds the row level lock. To acquire row level locks with a SELECT statement, you just add FOR SHARE to the end of the SELECT. The lock will be released once the transaction commits. Rewriting the example with FOR SHARE, it now looks like:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints FOR SHARE;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = n+1;
-- Blocks because the row being updated is locked by 
-- another transaction.

S1> SELECT * FROM ints;
 n
---
 1

S1> COMMIT;
-- The UPDATE in S2 completes since S1 released its lock
-- on the row.

S2> COMMIT;

What FOR SHARE does is acquire a “read lock” on each of the rows returned by the SELECT statement. Once the read locks are acquired, no other transaction will be able to update or delete the rows until the transaction holding the read locks commits and releases the locks. Multiple transactions can possess read locks on the same row, and it is impossible for any transaction to update a row as long as at least one other transaction has a read lock on the row.

Lost Updates

Since multiple transactions can acquire read locks on a single row at the same time, FOR SHARE doesn’t exactly handle lost updates well. Here’s the example of a lost update from part 1 rewritten to use FOR SHARE:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
-- Blocks because S1 has a read lock on the row.

S1> UPDATE ints SET n = 2; -- Computed server side.
ERROR:  deadlock detected

S1> ROLLBACK;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

What happened here is the read lock acquired by S1 is preventing the UPDATE in S2 from running and the read lock acquired by S2 is preventing the UPDATE in S1 from running. Postgres detects that this is a deadlock and aborts the transaction in S1.

To get around this issue, you’ll want to use a variation of FOR SHARE called FOR UPDATEFOR UPDATE also acquires a locks on the rows being selected, but instead of acquiring read locks on each row, it acquires write locks on each row. A write lock is similar to a read lock, but as long as a transaction has a write lock on the row, no other transaction can have a read or write lock on the same row. In fact, a write lock is what UPDATE and DELETE grab before they modify a row. Let’s take a look at what happens when we instead use FOR UPDATE instead of FOR SHARE:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints FOR UPDATE;
 n
---
 1

S2> SELECT * FROM ints FOR UPDATE;
-- Blocks because S1 has a write lock.

S1> UPDATE ints SET n = 2; -- Computed server side.
S1> COMMIT;
-- S1 releases the write lock and S2 unblocks. 
-- The SELECT in S2 returns:
 n
---
 2

S2> UPDATE ints SET n = 3; -- Computed server side.
S2> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 3

S3> COMMIT;

By using FOR UPDATE, S1 signals it’s about to modify the rows it is selecting. S2 also wants to modify the row, so when it sees S1 is about to modify the row, it waits for S1 to complete before reading the row. This guarantees that S2 sees an up to date value. In effect, by using FOR UPDATE, S1 makes the read and write performed on the row happen in a single step.

Phantom Reads

Although row level locks are able to prevent non-repeatable reads, they are unable to prevent phantom reads, even though they seem to be similar issues. If we add FOR SHARE to the example of a phantom read in the part 1:

S0> BEGIN;

S0> SELECT count(*) 
S0> FROM (SELECT * FROM ints FOR SHARE) sub;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     1

S0> COMMIT;

The phantom read still happens because FOR SHARE only acquires locks on the rows already in the table. It does not prevent new rows from being inserted into the table.

Skipped Modification

Row level locks don’t help here. The UPDATE and DELETE statements already grab row level write locks on the rows being modified. FOR SHARE and FOR UPDATE cannot be used since no SELECT statement is involved in a skipped modification.

Serialization Anomalies

Row level locks do not help here for the same reason they don’t help phantom reads. There’s nothing to prevent the INSERT statements from inserting the new rows.


Overall row level locks are a easy way of handling two of the isolation issues that come up. Fortunately, those two are the ones that most commonly come up. The downside of row level locks, is that while easy, they do add a fair amount of complexity and overhead. You now have to worry about one transaction holding a lock for too long, which you mostly wouldn’t have to worry about otherwise. Next, we’ll take a look at a different approach to solving some of the interactions – the other transaction isolation levels.

Postgres Transactions Aren’t Fully Isolated

This post is part 1 of a three part series examining transaction isolation in Postgres. This post covers the issues the come up under the default Postgres settings. Parts 2 and 3 will cover the different ways to eliminate the problems.

Isolation is one of the main tenants of ACID. A database’s transactions are considered to be fully isolated if two transactions performed at the same time do not interact with each other in any way. A formal definition of full isolation is that a database’s transactions are fully isolated if when multiple transactions are performed at the same time, there exists an ordering of the transactions such that executing the transactions concurrently behaves the same as executing the transactions one by one in that order.

By default, Postgres does not provide full transaction isolation (the same holds true for most SQL databases under the default settings). That means there are a variety of cases where two transactions can interact with each other. These cases are usually unexpected by Postgres users and commonly result in bugs in those users’s code. Let’s first take a look at the exact isolation semantics Postgres provides. Then we can take look at the different interactions and use the isolation semantics to explain why each interaction occurs.

For the isolation semantics, you will only need to understand how Postgres handles SELECT statements and how Postgres handles statements that modify data such as UPDATE and DELETE.

SELECT

Whenever a SELECT statement is in Postgres, Postgres takes a snapshot of the current state of the database and runs the SELECT against that snapshot. The effects of all transactions that committed before the snapshot are visible to the SELECT statement. On the other hand, the effects of all transactions that either failed or committed after the snapshot was taken are completely invisible to the SELECT statement. It’s important to note that a new snapshot is created for each SELECT statement and not for each transaction. That means two SELECT statements in the same transaction will each have their own snapshot of the database.

UPDATE and DELETE

UPDATE and DELETE statements first create a snapshot of the database and use that snapshot to find all rows that match the WHERE clause associated with the statement. Once a row is found, the statement will try to lock it. If any of the rows found are already locked by a different UPDATE or DELETE statement in another transaction, the UPDATE or DELETE statement will wait for the other transaction to commit or abort.

If the other transaction aborts, the UPDATE or DELETE statement will act as if the other transaction had never existed. Otherwise, if the other transaction commits, the UPDATE or DELETE will reevaluate the WHERE clause on the new version of the row and update/delete the new version if it satisfies the predicate. If the new version does not satisfy the WHERE clause, it will be left unmodified. This behavior is used because it is a reasonable way to handle multiple concurrent modifications to the same row.


Now that we’ve got an idea of the semantics behind SELECTUPDATE, and DELETE, let’s take a look at a few of the different cases where two transactions can interact with each other. In the examples below, I use S0, S1, … to refer to different sessions running concurrently. All of the examples make use of a table ints of bigints that starts out empty at the beginning of each example.

Non-Repeatable Reads

A non-repeatable read is when a transaction reads the exact same row twice, but finds the row to be different the second time it is read. As an example:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> UPDATE ints SET n = 2;
S2> COMMIT;

S1> SELECT * FROM ints;
 n
---
 2

S1> COMMIT;

In this case, even though both of the SELECT *‘s were ran in the same transaction, they returned different results. This is due to each SELECT statement in the transaction creating a new snapshot instead of their being a single snapshot per transaction. The first SELECT creates a snapshot where n=1 and returns that row. Then a second transaction runs an UPDATE command that changes n=2. Then the second SELECT creates a new snapshot where n=2 and returns that.

Lost Updates

It is a somewhat common pattern (which should be avoided if possible) for a server to retrieve a value from the database, compute a new value based on the old value in the database, and write that new value back into the database. If two of these operations happen simultaneously, it can result in what’s known as a “lost update”1. As a simple example, let’s say two different servers try to increment the value in the ints table at the same time:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

S1> BEGIN;
S1> SELECT * FROM ints;
 n
---
 1

S2> BEGIN;
S2> SELECT * FROM ints;
 n
---
 1

S2> UPDATE ints SET n = 2; -- Computed server side.
S2> COMMIT;

S1> UPDATE ints SET n = 2; -- Computed server side.
S1> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2

S3> COMMIT;

With the example above, session S1 had read the value 1. Then S2 increments the value in ints and sets it to 2. When S1 goes to set the value in ints, it doesn’t know that the value had been incremented by S2 after S1 had read it, so once S1 sets the value in ints to 2, the increment done by S2 was completely lost.

Phantom Reads

A phantom read occurs when the set of rows found by a statement change due to another transaction inserting new rows. A simple example is the following:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0

S1> BEGIN;
S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> SELECT count(*) FROM ints;
 count
-------
     1

S0> COMMIT;

In Postgres, phantom reads have the same cause as non-repeatable reads, namely snapshots being created per query instead of per transaction. Historically phantom reads are considered different from non-repeatable reads because in some databases the causes between the two are completely different. Phantom reads are also a slightly more difficult problem to solve, including in Postgres.

Skipped Modification

(I couldn’t find the actual name for this kind of interaction, so I’m just calling it a skipped modification. If you know the actual name, I would appreciate you leaving a comment at the bottom of this page.)

A skipped modification can happen due to how Postgres handles concurrent modifications of a single row. It’s easiest if we start with an example and examine what exactly is going on. This example was taken from the Postgres docs:

S0> BEGIN;
S0> INSERT INTO ints SELECT 1;
S0> INSERT INTO ints SELECT 2;
S0> SELECT * FROM ints;
 n
---
 1
 2

S0> COMMIT;

S1> BEGIN;
S1> UPDATE ints SET n = n+1;

S2> BEGIN;
S2> DELETE FROM ints WHERE n = 2;
-- S2 blocks since the DELETE is trying to modify a row 
-- currently being updated.

S1> COMMIT;
-- S2 unblocks.

S2> COMMIT;

S3> BEGIN;
S3> SELECT * FROM ints;
 n
---
 2
 3

S3> COMMIT;

When I first saw this, I couldn’t believe that this was possible! The session S0 inserts two rows, one with the value 1 and one with the value 2. Then session S1 runs an UPDATE that increments the value of all of the rows by 1. Then session S2 runs a DELETE and deletes all rows with the value 2. Intuitively, the DELETE statement should delete at least one row. It should either run before the UPDATE statement and delete the row that started with 2, or it should run after the UPDATE statement and delete the row that started with 1. Once both transactions commit and S3 displays the results, it appears the DELETE statement deleted neither row!

This is due to how concurrent UPDATE/DELETE‘s are handled. The DELETE first looks for rows with n=2. It finds the one row that started with n=2 and attempts to lock it. Since that row is currently being updated by another transaction and is already locked, the DELETE waits for transaction containing the UPDATE to finish. Once that transaction finishes, the row now has n=3. The DELETE statement now rechecks the WHERE clause against the row and sees that it’s now false, so it ignores the row and finishes. Because of the rules for concurrent UPDATE/DELETE‘s, a row is only touched if it satisfies the condition both before and after it’s updated!

Serialization Anomalies

A serialization anomaly occurs when the results of two transactions are correct independently of each other, but it’s impossible to execute the transactions in any order one by one and achieve the same results. As an example:

S0> BEGIN;
S0> SELECT count(*) FROM ints;
 count
-------
     0
(1 row)

S1> BEGIN;
S1> SELECT count(*) FROM ints;
 count
-------
     0
(1 ROW)

S1> INSERT INTO ints SELECT 1;
S1> COMMIT;

S0> INSERT INTO ints SELECT 1;
S0> COMMIT;

There is no possible way to serialize S0 and S1. There are only two possible orderings of S0 and S1. Either S0 ran and then S1 did or S1 ran first and then S0 did. If S0 was ran first, the INSERT at the end of S0 came before the count(*) at the beginning of S1 which returned zero rows. If S1 ran first, the INSERT in S2 came before the count(*) in S0 which returned zero rows. Either way, a count(*) said there were zero rows after a row was inserted into the table.


When I first saw these examples, I was astounded at how unexpected some of the results were. In my next posts, we’ll take a look at some of the different tools for dealing with each of these different cases. That includes both the stronger isolation levels than the default provided by Postgres, as well as explicit row-level locking.

Postgres Window Functions

Postgres window functions are an extremely useful feature. They let you calculate a value for each row based on the values of other rows. For our examples, let’s say we have a table ints full of random numbers:

CREATE TABLE ints (n bigint);

INSERT INTO ints 
SELECT floor(random() * 1000000) 
FROM generate_series(1,1000);

The most basic window functions behave similar to regular aggregations. The difference is that for window functions, the result of the aggregation. Since the result is included in each row you can also select regular columns out of each row. To change a regular aggregation function into a window function, you just add “OVER ()” after the aggregation function. For example, here is a query that includes the sum across all rows in each row, along with the value in each row:

> SELECT n, sum(n) OVER () FROM ints;

   n    |    sum    
--------+-----------
 481023 | 498397678
 772520 | 498397678
 709081 | 498397678
 292436 | 498397678
...

This allows you to easily express certain calculations. For example, you can calculate what fraction of the total sum each row is:

> SELECT n, n / (sum(n) OVER ()) AS fraction
 FROM ints;

   n    |          fraction          
--------+----------------------------
 481023 |     0.00096513892667052113
 772520 |     0.00155000722134182977
 709081 |     0.00142272131532683425
 292436 |     0.00058675233234132363
...

Window functions also have an equivalent to GROUP BY. By adding “PARTITION BY <expression>” inside of the parens after the OVER, you can calculate the window function over different subsets of the data. As an example, here is the above query, but it instead tells you how much of a fraction each value is of the total sum of all numbers with the same digit:

> SELECT n, n / (sum(n) OVER (PARTITION BY n % 10)) AS fraction 
FROM ints;

   n    |          fraction
--------+----------------------------
 457940 |     0.00951268915957674204
 595290 |     0.01236583117832999688
 111670 |     0.00231969690013961389
 830300 |     0.01724764337947453579
...

From this query, we can tell that 595290 is ~1.2% of the total sum of all numbers in the table that end in 0.

So far we have seen examples that aren’t too difficult to replicate with ordinary SQL. Window functions become harder to replicate with ordinary SQL when you introduce ORDER BY into the OVER clause. First of all, introducing ORDER BY completely changes the behavior of all regular aggregates. Now, instead of including the total aggregation in each row, a rolling aggregation included. For example, using ORDER BY with sum gives you a rolling sum:

> SELECT n, SUM(n) OVER (ORDER BY n ASC) FROM ints;

   n    |    sum    
--------+-----------
    689 |       689
   1197 |      1886
   1201 |      3087
   3405 |      6492
...

You can also combine ORDER BY and PARTITION BY to get a rolling sum for the sum of numbers with a given last digit:

> SELECT n, SUM(n) OVER (PARTITION BY n % 10 ORDER BY n ASC) FROM ints;

   n    |   sum    
--------+----------
  16900 |    16900
  23230 |    40130
  26540 |    66670
  42310 |   108980
...
   1201 |     1201
  18371 |    19572
  19221 |    38793
  36371 |    75164
...

I haven’t mentioned it yet, but there are actually many different functions that are designed specifically for working as window functions. These functions aren’t available as regular aggregation functions. These functions will return values based on the row’s value relative to the other rows. Two of these functions that deserve special attention are row_number and lag. The function row_number tells you each row’s ranking among the other rows based on the ordering used. For example, for the first row according the ordering, row_number will return 1. For the second it will return 2 and so on:

> SELECT n, row_number() OVER (ORDER BY n ASC) FROM ints;

   n    | row_number 
--------+------------
    689 |          1
   1197 |          2
   1201 |          3
   3405 |          4
...

The function lag on the other hand evaluates to the value of an expression on nth previous value according to the ordering. A simple calculation you can perform with lag is calculating the difference between each row and the previous row:

> SELECT n, n - lag(n, 1) OVER (ORDER BY n ASC) AS diff FROM ints;

   n    | diff 
--------+------
    689 |     
   1197 |  508
   1201 |    4
   3405 | 2204
...

Overall, window functions are a pretty awesome features. They dramatically increase the querying power of SQL and make it so much easier to express many different queries. Ultimately, window functions are another tool in the toolbox that Postgres provides that makes Postgres so great.

SELECT DISTINCT ON … ORDER BY

A common problem I’ve seen that many people are unfamiliar with how to solve is obtaining all of the rows that are the maximum of one value grouped by another value. For example, you may have a pets table with columns nameage, and owner_id, and you may want to obtain the name of the oldest pet for each owner. This seems like a problem that could be solved by a group by, but it actually cannot be using the built-in aggregation functions1. If you try writing a query with a group by, you’ll try to write something like:

SELECT owner_id, <agg for the name of the oldest pet>
GROUP BY owner_id;

Believe it or not, there is no built-in aggregation that can be used to get the desired result. The easiest way to solve this problem is with SELECT DISTINCT ON … ORDER BY2. The general format of such a query looks like:

SELECT DISTINCT ON (<group by columns>) <selected columns>
FROM <table>
ORDER BY <group by columns>, <order by columns>;

In this specific case, a query that solves the problem is:

SELECT DISTINCT ON (owner_id) owner_id, name
FROM pets
ORDER BY owner_id, age DESC;

Let’s breakdown this query and see how it works. The first part you need to understand is the behavior of SELECT DISTINCT ONSELECT DISTINCT ON returns one row for every unique combination of values in the parens. For example, the query:

SELECT DISTINCT ON (owner_id) owner_id, name FROM pets;

Will return the owner_id and name of one pet belonging to each distinct pet owner. One important detail of the behavior of SELECT DISTINCT ON is that it will always return the first row it encounters for each distinct pet owner. That means you can use SELECT DISTINCT ON with ORDER BY to get the row with the highest/lowest value of a given column for each combination of the values being distincted on. The query above uses this fact to get the pet with the oldest age for each owner.

One additional detail with the query is you need to include the columns being distincted on in the ORDER BY. This is necessary because of how Postgres executes a query with DISTINCT ON internally. Postgres executes such a query by first sorting by the fields being distincted on. Sorting the rows this way places all rows with the same values of the fields being distincted on together. Postgres can then just iterate through all of the rows and add a new row to the result of the query whenever it sees a row whose values of the columns differ from the row before it. SELECT DISTINCT ON … ORDER BY takes advantage of this behavior by first sorting by the columns being distincted on and then sorting by the columns being ordered by. That way all rows with the same values of the fields being distincted on are clumped together, and the first row in each clump is the row desired in the output.