A common problem I’ve seen that many people are unfamiliar with how to solve is obtaining all of the rows that are the maximum of one value grouped by another value. For example, you may have a pets table with columns nameage, and owner_id, and you may want to obtain the name of the oldest pet for each owner. This seems like a problem that could be solved by a group by, but it actually cannot be using the built-in aggregation functions1. If you try writing a query with a group by, you’ll try to write something like:

SELECT owner_id, <agg for the name of the oldest pet>
GROUP BY owner_id;

Believe it or not, there is no built-in aggregation that can be used to get the desired result. The easiest way to solve this problem is with SELECT DISTINCT ON … ORDER BY2. The general format of such a query looks like:

SELECT DISTINCT ON (<group by columns>) <selected columns>
FROM <table>
ORDER BY <group by columns>, <order by columns>;

In this specific case, a query that solves the problem is:

SELECT DISTINCT ON (owner_id) owner_id, name
FROM pets
ORDER BY owner_id, age DESC;

Let’s breakdown this query and see how it works. The first part you need to understand is the behavior of SELECT DISTINCT ONSELECT DISTINCT ON returns one row for every unique combination of values in the parens. For example, the query:

SELECT DISTINCT ON (owner_id) owner_id, name FROM pets;

Will return the owner_id and name of one pet belonging to each distinct pet owner. One important detail of the behavior of SELECT DISTINCT ON is that it will always return the first row it encounters for each distinct pet owner. That means you can use SELECT DISTINCT ON with ORDER BY to get the row with the highest/lowest value of a given column for each combination of the values being distincted on. The query above uses this fact to get the pet with the oldest age for each owner.

One additional detail with the query is you need to include the columns being distincted on in the ORDER BY. This is necessary because of how Postgres executes a query with DISTINCT ON internally. Postgres executes such a query by first sorting by the fields being distincted on. Sorting the rows this way places all rows with the same values of the fields being distincted on together. Postgres can then just iterate through all of the rows and add a new row to the result of the query whenever it sees a row whose values of the columns differ from the row before it. SELECT DISTINCT ON … ORDER BY takes advantage of this behavior by first sorting by the columns being distincted on and then sorting by the columns being ordered by. That way all rows with the same values of the fields being distincted on are clumped together, and the first row in each clump is the row desired in the output.

Postgres Lateral Joins

Personally, lateral joins are one of my favorite Postgres features. They are simple, while at the same time they let you write queries that would be nearly impossible to write otherwise. I find it surprising lateral joins were only introduced into Postgres four years ago given how useful they are. I also find it surprising there are no good explanations of lateral joins online given how simple they are.

A lateral join is simply a foreach loop in SQL. There are actually two different ways of writing a lateral join. The simplest form, which is the one you probably want looks like the following:

SELECT <columns>
FROM <table reference>,
     LATERAL <inner subquery>;

(For context, a table reference is either a table or a subquery.)

The above query will iterate through each row in the table reference and evaluate the inner subquery for each row, exactly like a foreach loop would. The rows returned by the inner subquery are then added to the result of the lateral join. The primary reason for using a lateral join is that the inner subquery can refer to the fields of the row from the table reference and use that to determine what rows to return. To help illustrate what this means, here is Python code that imitates a lateral join:

result = []
for row1 in table_reference():
    for row2 in inner_subquery(row1):
        result += (row1, row2)

As you should be able to see, the rows iterated over in the inner for loop are a function of the current row being iterated over in the outer for loop.

As an example of where a lateral join is useful, let’s say we have a table people with an id column and an age column, and a table pets with columns id, owner_id and age. For the query, let’s say for each person over 30, we want to obtain the id of the oldest pet owned by that person. Anytime you hear the words “for each”, you are probably looking for a lateral join. This specific example can be done with the following query:

FROM (SELECT id FROM people WHERE age > 30) people_sub,
              FROM pets
              WHERE pets.owner_id =
              ORDER BY pets.age DESC
              LIMIT 1) pets_sub;

For each person over 30, this query will evaluate the inner subquery and add the results of the subquery to the results of the full query. The inner subquery, given a row for a person, will find the id of the oldest pet owned by that person, if they own a pet. That’s all there is to the lateral join. Believe it or not, most of the other ways of writing the above query are much trickier and usually are not as efficient3.

The other syntax for writing a lateral join is the following:

SELECT <columns>
FROM <table reference>
     JOIN LATERAL <outer subquery>

The advantage of this form is you can use any join qualifier (e.g. LEFT JOIN LATERAL). If in the example above, we also wanted to obtain people who didn’t have a pet at all, we would have needed to use a lateral left join to get those people. Most of the time though, the first lateral join form is good enough.

To recap, a lateral join is basically a foreach loop in SQL. Anytime you want to perform some calculation for each row in a subquery, you should try to see if a lateral join suits your use case. Once you understand a lateral join, you’ll realize it makes all sorts of queries much easier to write.

Understanding Postgres Recursive CTEs

Unlike ordinary CTEs, which can be thought of as syntactic sugar, recursive CTEs make SQL a more powerful language. As the name suggests a recursive CTE makes it possible to express recursion in SQL. The ability to express recursion makes recursive CTEs useful for querying any hierarchical data. Additionally, recursive CTEs even make SQL Turing complete!

In general, most queries that make use of a recursive CTEs look similar to the following template4:

WITH RECURSIVE <name> (<columns>) AS (
    <initial query>
    <recursive query>

As an example, here is a query that makes use of a recursive CTE to generate the integers from 1 to 100:

    SELECT 1
    SELECT n+1 FROM ints WHERE n+1<= 100

A recursive CTE executes according to the following steps:

  1. Evaluate the initial query and store the results in the “working table”. The working table is simply a temporary table that is used during the evaluation of the recursive CTE.
  2. Exit if the working table is empty.
  3. Add all of the rows in the working table to the result of the recursive CTE5.
  4. Evaluate the recursive query on the working table and replace the working table with the results of the recursive query. In the example above, that means the working table is substituted in for ints in the recursive query.
  5. Goto step 2.

Once the recursive CTE has been evaluated, the results of it can be used in the rest of the query like an ordinary CTE6.

Let’s walk through the example I gave above. First the recursive CTE evaluates the initial query which is just SELECT 1, and stores the results, a single row with the value 1, in the working table. Then since the working table is not empty, the recursive CTE proceeds to step 3 which adds the row with value 1 to the result of the recursive CTE. Next, the recursive query:

SELECT n+1 FROM ints WHERE n+1 <= 100

Is ran on the working table. That means the recursive query is ran with ints referring to the current working table. In this case, since the working table is only a single row with the value 1, the recursive query returns a single row with the value 2. This new row becomes the new working table. Steps 2-4 then repeat with the working table containing only a single row with the value 2. This process repeats with the values 3, 4, … until the working table has a single row with the value 100.

This time, the recursive query returns no rows, due to the filter condition in the recursive query. It’s at this point the recursive CTE terminates. During the execution of the CTE, each of the values 1 to 100 was added to the result set, so the recursive CTE terminates with the values 1-100 as the result. I also wrote a Python snippet equivalent to above recursive CTE which you can find here. Looking at it may help with understanding how to think about recursive CTEs.

Let’s now look at a more complicated example. An example that actually includes hierarchical data. Let’s say we have an employees table that models the management hierarchy at a business. The table has three columns: an id column, a name column, and a manager_id column. Let’s say we want to find everyone who works under Alice (someone who reports to someone who reports to someone … who reports to Alice), including Alice herself. That can be done with the following query:

WITH RECURSIVE temp (name, id) AS (
    SELECT name, id
    FROM employees 
    WHERE name = 'Alice'
    FROM temp, employees
    WHERE employees.manager_id =
SELECT name FROM temp;

Let’s look at the individual queries that make up the recursive CTE in the example. The initial query finds the row for Alice and adds it to the working table. The recursive query takes all people in the working table and uses a join to find all people who report to someone in the working table. Going through the steps above for the execution of a recursive CTE, what this query does is first find the row for Alice and puts it in the working table. Then the recursive query finds all people who report to Alice. Next the recursive query finds all people who report to someone who reports to Alice. This process repeats until the query has found all people who work under Alice.

There are all sorts of surprising computations you can perform with recursive CTEs. One of my favorite examples is a SQL query that generates the Mandelbrot Set. This is mainly possible recursive CTEs make SQL turing complete. With recursive CTEs, it is possible to perform any computation you want with SQL!

Postgres CTEs

CTEs, also know as Common Table Expressions, are a way of refactoring a query. They let you give the result of a subquery a name and reuse the result multiple times. The general syntax of a CTE looks like the following:

WITH <name1> AS (<subquery1>),
     <name2> AS (<subquery2>),
     <nameN> AS (<subqueryN>)

This will first evaluate each of the subqueries and assign the results of each query to the corresponding name. Then within the final query, you can use each name to refer to the results of the corresponding query.

As an example, I frequently do experiments where I try different combinations of indexes. To test which configuration is better, I’ll run many different queries against both configuration. To analyze the results I create a table runs which contains result of each run of each query. The table has the columns query_id which is the id of the query being ran, config which specifies which version of indexes were used, and duration which is the duration of that run of the query. To get the ten queries where which sped up the most going from one configuration to another, I’ll use a query similar to the following:

WITH a_runs AS (SELECT * FROM runs WHERE config = 'a'),
     b_runs AS (SELECT * FROM runs WHERE config = 'b')
  SELECT a_runs.query_id, a_runs.duration, b_runs.duration
  FROM a_runs, b_runs
  WHERE a_runs.query_id = b_runs.query_id
  ORDER BY a_runs.duration - b_runs.duration DESC
  LIMIT 10;

This query uses a CTE to define a_runs and b_runs as the runs of queries under configuration a, and the runs under configuration b respectively. From then on the query uses a_runs and b_runs to join the results of each configuration together.

There are two main differences between a CTE and just substituting the subqueries into the final query. First of all, a CTE can usually make a query easier to read. This alone is a pretty good reason for using CTEs. Second, using a CTE is currently equivalent equivalent to creating new tables with the results of each subquery, and then using the corresponding table every time you use the name corresponding to that query. This behavior has both some pros and some cons.

This can be advantageous, because it guarantees that each subquery is only evaluated once. This allows you to reuse the results of each subquery without reevaluating the subquery. The main downside of this behavior of CTEs is that it prevents Postgres from making use of certain query plans. There are some cases where Postgres can use information outside of the subquery to optimize the subquery and avoid doing some work. CTEs prevent Postgres from doing any of this1. If you want to read more about CTEs as an optimization fence, 2nd Quadrant has a blog post that goes into more detail.

Postgres Heap Only Tuples

Heap only tuples, otherwise known as HOT, is an optimization Postgres uses to reduce the amount of I/O necessary for updates. Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index. This requires a lot more I/O because the row has to be reinserted into every index over the table. Readding the row to each index is necessary because the physical location of the new version of the row is different from the physical location of the old version.

To reduce the amount of I/O necessary for UPDATE, the Postgres team added HOT to Postgres. The idea behind HOT is relatively simple. When updating a row, if it is possible, Postgres will place the new copy of the row right next to the old copy of the row. By doing this and marking the old copy row so that Postgres knows the new copy of the row is right next it, Postgres can avoid updating all of the indexes. During an index scan in which the new copy of the row satisfies the filter, Postgres will encounter the old copy of the row. Since the old copy has been specially marked, Postgres will know the new copy is right next to it and can use that instead. This way, Postgres can pretend that all of the indexes point to the new copy of the row, without ever needing to modify the indexes.

Currently HOT is only possible when none of the columns being updated are indexed. If any of the columns being updated are indexed, HOT is no longer possible. There are several issues that come up if HOT were attempted in this situation. Specially when an index scan is performed on the index on the column which was updated, and the old copy of the row satisfied the predicate of the scan, but not the new one. In this situation, Postgres will try to use the index to quickly find all rows that satisfy the predicate of the query, and in the case of the tuple updated with HOT, will get back the new copy of the row which does not satisfy the predicate. By adding this restriction, Postgres can guarantee that when it tries to find rows that satisfy a predicate through an index, if the predicate is satisfied by the old copy of the row, it is also satisfied by the new copy and vice/versa.

There is currently in development, an extension to HOT, called WARM, that works even when an index on an updated column exists. The idea behind WARM, is to place the new row next to the old row and reinsert the row in the indexes on the columns that changed. This dramatically complicates the situation mentioned above as now Postgres needs someway to determine whether the row actually satisfies the filter or not.

Postgres Vacuum Full

One detail to note about the Postgres vacuum is that it doesn’t not return the space it frees back to the OS. Instead, it only marks the space as free, so it can later be used new rows. That means if you have a 100GB and delete half the rows from it, it will still take up 100GB, even if you run VACUUM on it. To work around this problem, Postgres has a command VACUUM FULL.

VACUUM FULL takes an existing table and creates a completely new copy of the table. This will allow the OS to reclaim all of the free space previously used by rows, but it does have a few downsides. The current implementation of VACUUM FULL completely locks the table being vacuumed. That means you won’t be able to run queries or insert data into the table until the VACUUM FULL completes. In general, if you want to reclaim all of the empty space in a table, instead of using VACUUM FULL, you should look at pg_repack. The pg_repack extension is a Postgres extension that provides a command equivalent to VACUUM FULL that still allows you to run queries on the table being compacted.

Transaction Id Wraparound in Postgres

Transaction id wraparound has been the cause of numerous outages. Both Sentry and Joyent have blog posts detailing day long outages caused by transaction id wraparound. There are also many other companies that have ran into transaction id wraparound, but haven’t said so publicly.

Txid wraparound is a problem due to MVCC. MVCC relies on being able to take the txids of two transactions and determine which of the transactions came first. One detail of how Postgres implements MVCC is that txids in Postgres are only 32-bit integers. That means there are only 232, or about four billion, possible txids1. Four billion may sound like a lot, but high write volume workloads can reach four billion transactions within a matter of weeks.

After four billion transactions, if Postgres didn’t do anything about it, Postgres would start to have to use duplicate txids. To enable the reuse of txids, Postgres assigns txids sequentially from a cycle. The cycle wraps back around to zero, going something like: 0, 1, 2, …, 232-1, 0, 1, … To determine which of two txids is older, the 231 txids before a txid in the cycle are considered older, and the 231 after it are considered larger. Under this scheme, Postgres has to make sure that all txids currently in use are within a range of 231 of each other. That way all of the txids in use form a consistent ordering.

Postgres makes sure only a range of txids are in use at a time by periodically removing all of the old txids still in use. If the old txids aren’t cleared quickly enough, there will eventually be a new txid that is both newer than the newest txids and simultaneously appears older than the oldest txids. This is txid wraparound. Txid wraparound gets its name because the cycle of txids has completely wrapped around. If Postgres let txid wraparound happen, some transactions in the past would appear to be in the future, causing massive data corruption. Instead, if Postgres ever gets close to txid wraparound, it will cease normal operations to prevent data corruption. This is usually what people mean when they say txid wraparound caused an outage. That Postgres got close enough to txid wraparound that it stopped accepting new transactions.

To clear out old txids, Postgres uses a special vacuum sometimes called the anti-wraparound vacuum. If a table has a txid in it that is older than a certain age, Postgres will run an anti-wraparound vacuum on that table. All the anti-wraparound vacuum does is scan through the entire table and replace every reference to an old txid with a special txid2. The special txid is automatically considered older than every other txid. This scheme works because Postgres never needs to determine which of two old txids is older. In MVCC, at least one of the two txids in any comparison will be relatively new.

As long as the anti-wraparound vacuum is able to keep up, you do not need to worry about txid wraparound. If for any reason, the anti-wraparound vacuum is unable to complete its job quickly enough (the settings could be such that it cannot keep up with the rate of new transactions) the database can get close to txid wraparound. This will cause Postgres to cease normal operations, likely causing an outage.

Ultimately, you usually don’t have to worry about txid wraparound. As long as you have reasonable autovacuum settings, and aren’t creating tons of transactions all of the time, the anti-wraparound vacuum should be able to prevent txid wraparound just fine. You especially shouldn’t need to worry about it if you are running a recent version of Postgres. It’s somewhat technical, but in Postgres 9.6, the anti-wraparound vacuum now has a bit in the visibility map which allows it to determine whether or not a page possibly has old txids. This makes it possible for the anti-wraparound vacuum to skip all pages that Postgres knows for certain don’t have any old txids, making it much faster and require dramatically less I/O. This makes the anti-wraparound vacuum much less likely to fall behind. Robert Haas, one of the Postgres core contributors, has a blog post detailing the change.

The Postgres Vacuum

As mentioned in my post on MVCC, when a row is deleted from Postgres, the physical row is not actually deleted. All that really happens is the physical row is marked as deleted. That way, queries that started before the row was deleted can pretend the row was not deleted, which provides a simpler model for understanding how queries behave when multiple queries are executing concurrently (if you want to understand the reasoning for this behavior, you should read my posts on transactions and MVCC). One additional step necessarily is the physical removal of the row from disk. This is done by the Postgres vacuum.

The vacuum is a fairly simple idea. Once a certain number of rows are updated or deleted from a table1, the vacuum will automatically run on the table. The vacuum works by iterating through all of the rows of the table and freeing the space for rows that are no longer needed. The vacuum determines if a row can be freed by checking that there are no queries that started before the row was deleted. Once every query started before the row was deleted has finished, it is impossible for any query in the future to see the row, so Postgres is free to reclaim the space. Once the vacuum has freed the space for a row, Postgres is allowed to insert a new row into that space.

In order to reduce the amount of work needed to be performed by the vacuum, Postgres uses something called the “visibility map”, which also happens to be used for other purposes such as index-only scans. The visibility map stores an array of bits, one for each page in each table2. (For context, every row is stored on a page with multiple rows being stored on each page.) The bit for a given page is set if Postgres knows, for certain, that every row on the page is visible to every currently running query, and will be to every new query. The vacuum is able to skip any pages that have their bit set to one in the visibility map, since that implies there are no rows on the page that can be freed.

Currently, the only way bits in the visibility map are flipped on is through the vacuum if the vacuum determines every row on the page is visible. The bit for a page is flipped off if a new row is put on the page, or an existing row is deleted. In either of these cases, it is possible there are some transactions for which the row is not visible.

Postgres WAL

Write-ahead logging, or as it’s commonly referred to, WAL, is an optimization Postgres uses to minimize disk I/O while still preventing data loss. Intuitively, whenever a transaction completes, a record of every single change that transaction made must have been written out to persistent storage. The changes that need to be written out include every single row that was inserted, deleted, or updated. If transactions completed before writing all of the changes made to disk and the power ever went out before they were, Postgres would suffer data loss. When Postgres would come back up, it will see the transaction completed1, but Postgres will have no record of some of the changes made by the transaction.

The naive way to make sure all changes are written to disk is to actually perform those changes on disk. Every time a transaction modifies a row it would perform a write to the location of that row on disk. The downside to this approach is it requires performing tons of random writes to disk, one to each row that was changed by the transaction. That many random writes winds up being inefficient. To reduce the amount of necessary I/O, Postgres instead uses write-ahead logging.

For WAL a transaction has to do two things to modify a row. First it has to modify a copy of the row in memory2. Queries running know to first check if there is a copy of the row in memory first, so queries will always use the new copy instead of using the old copy on disk. Second, a record describing the change made (what the change was and where) is added to the “WAL log”. The WAL log is simply a record of all changes that have been made and is stored on disk.

WAL prevents data loss since the WAL log can be used to restore the proper state of the database. When Postgres is coming back up from a crash, the first thing it does is go through all of the records in the WAL log that may not have been propagated to disk and perform those changes if they weren’t performs already. WAL also performs less disk I/O since writes to the WAL log can be batched together and flushed to disk in a single write. By writing multiple WAL records to disk at the same time, the total amount of disk I/O needed is dramatically reduced. A transaction just has to make sure to flush the WAL log to disk before completing3. That way there is a record of all changes made by that transaction on disk.

One last detail of WAL is how the data eventually gets written to disk. There is a background job called the “checkpointer”. The checkpointer is constantly going through all of the changes made in memory and not on disk yet, and performing those changes on disk. Since the checkpointer is writing changes to disk after the fact, it can perform changes to multiple rows near each other on disk all at the same time.

The Postgres Commit Log

In the last post, I discussed MVCC and how it allows Postgres to create snapshots of the current state of the database. One piece I didn’t mention is how Postgres handles failed transactions. The answer is through an object called the commit log (commonly abbreviated as “clog”). Use of the commit log also helps Postgres implement atomicity. In other words, through the commit log, Postgres is able to guarantee a transaction will either complete successfully or have no effect on the database at all.

As mentioned in the last post, MVCC is a technique Postgres uses to determine whether a row should be visible to a query. A requisite for MVCC is the capability to determine whether a transaction completed successfully before a given query started. MVCC as described in the last post only checks whether, for a given query, the transaction had started before the query and was not running when the query started. Based on that information alone, the query either could have either finished successfully or failed. By adding the capability to check whether a finished transaction had completed successfully or not, MVCC as previously described works. Postgres supports this capability through the commit log.

The commit log itself is fairly simple, it’s just a log of the txids (transaction ids) of all transactions that successfully completed. The last step a transaction must accomplish before successfully completing is writing its own txid to the commit log. Once a transaction has written it’s txid to the commit log, that transaction is considered to have successfully completed. If a transaction fails for any reason (manually rolled back, power outage, etc), it’s txid is never written to the commit log. If a transaction is no longer running, Postgres can determine whether the query succeeded or failed by looking for it’s txid in the commit log. If the txid of the transaction is in the commit log, the transaction succeeded, otherwise the transaction failed.

In addition to performing the checks for MVCC, by adding a check to the commit log, Postgres now handles failed transactions properly. The process for determining the visibility of a row with respect to a given query now looks like the following. First check that the transaction that inserted the row started before the query, was not running when the query started, and wrote its txid to the commit log. If all of these checks succeed, the row is visible unless all of the checks also hold true for the transaction that deleted the row. If the insert transaction started before the query, was not running when the query started, but did not insert its txid in the commit log, Postgres knows the transaction failed and the row should not be visible. If the insert transaction succeeded, but the deletion transaction failed, the row should still be visible.

One issue with this scheme is that it is inefficient. It’s hugely inefficient for every query to check the commit log multiple times for each row it wants to determine the visibility of. To get around this inefficiency, Postgres introduces the concepts of “hint bits”. Every row has some additional bits for keeping track of the success or failure of the transactions that inserted the row and deleted the row. Whenever Postgres looks up the status of a transaction that inserted or deleted a row in the commit log, it will then set a bit in the row corresponding to the result of the lookup. That way, any future query can just look at the hint bit and avoid a lookup to the commit log. There are a total of four bits, one for each pair of success/failure and insertion/deletion.

Personally, I find the commit log to be an extremely simple, but clever idea. It not only allows Postgres to handle failed transactions, but allows it to handle transactions that failed as a result of a crash! As we’ll see, there are many more clever ideas throughout the rest of Postgres.