Understanding Postgres Recursive CTEs

Unlike ordinary CTEs, which can be thought of as syntactic sugar, recursive CTEs make SQL a more powerful language. As the name suggests a recursive CTE makes it possible to express recursion in SQL. The ability to express recursion makes recursive CTEs useful for querying any hierarchical data. Additionally, recursive CTEs even make SQL Turing complete!

In general, most queries that make use of a recursive CTEs look similar to the following template1:

WITH RECURSIVE <name> (<columns>) AS (
    <initial query>
    <recursive query>

As an example, here is a query that makes use of a recursive CTE to generate the integers from 1 to 100:

    SELECT 1
    SELECT n+1 FROM ints WHERE n+1<= 100

A recursive CTE executes according to the following steps:

  1. Evaluate the initial query and store the results in the “working table”. The working table is simply a temporary table that is used during the evaluation of the recursive CTE.
  2. Exit if the working table is empty.
  3. Add all of the rows in the working table to the result of the recursive CTE2.
  4. Evaluate the recursive query on the working table and replace the working table with the results of the recursive query. In the example above, that means the working table is substituted in for ints in the recursive query.
  5. Goto step 2.

Once the recursive CTE has been evaluated, the results of it can be used in the rest of the query like an ordinary CTE3.

Let’s walk through the example I gave above. First the recursive CTE evaluates the initial query which is just SELECT 1, and stores the results, a single row with the value 1, in the working table. Then since the working table is not empty, the recursive CTE proceeds to step 3 which adds the row with value 1 to the result of the recursive CTE. Next, the recursive query:

SELECT n+1 FROM ints WHERE n+1 <= 100

Is ran on the working table. That means the recursive query is ran with ints referring to the current working table. In this case, since the working table is only a single row with the value 1, the recursive query returns a single row with the value 2. This new row becomes the new working table. Steps 2-4 then repeat with the working table containing only a single row with the value 2. This process repeats with the values 3, 4, … until the working table has a single row with the value 100.

This time, the recursive query returns no rows, due to the filter condition in the recursive query. It’s at this point the recursive CTE terminates. During the execution of the CTE, each of the values 1 to 100 was added to the result set, so the recursive CTE terminates with the values 1-100 as the result. I also wrote a Python snippet equivalent to above recursive CTE which you can find here. Looking at it may help with understanding how to think about recursive CTEs.

Let’s now look at a more complicated example. An example that actually includes hierarchical data. Let’s say we have an employees table that models the management hierarchy at a business. The table has three columns: an id column, a name column, and a manager_id column. Let’s say we want to find everyone who works under Alice (someone who reports to someone who reports to someone … who reports to Alice), including Alice herself. That can be done with the following query:

WITH RECURSIVE temp (name, id) AS (
    SELECT name, id
    FROM employees 
    WHERE name = 'Alice'
    SELECT employees.name, employees.id 
    FROM temp, employees
    WHERE employees.manager_id = temp.id
SELECT name FROM temp;

Let’s look at the individual queries that make up the recursive CTE in the example. The initial query finds the row for Alice and adds it to the working table. The recursive query takes all people in the working table and uses a join to find all people who report to someone in the working table. Going through the steps above for the execution of a recursive CTE, what this query does is first find the row for Alice and puts it in the working table. Then the recursive query finds all people who report to Alice. Next the recursive query finds all people who report to someone who reports to Alice. This process repeats until the query has found all people who work under Alice.

There are all sorts of surprising computations you can perform with recursive CTEs. One of my favorite examples is a SQL query that generates the Mandelbrot Set. This is mainly possible recursive CTEs make SQL turing complete. With recursive CTEs, it is possible to perform any computation you want with SQL!

Postgres CTEs

CTEs, also know as Common Table Expressions, are a way of refactoring a query. They let you give the result of a subquery a name and reuse the result multiple times. The general syntax of a CTE looks like the following:

WITH <name1> AS (<subquery1>),
     <name2> AS (<subquery2>),
     <nameN> AS (<subqueryN>)

This will first evaluate each of the subqueries and assign the results of each query to the corresponding name. Then within the final query, you can use each name to refer to the results of the corresponding query.

As an example, I frequently do experiments where I try different combinations of indexes. To test which configuration is better, I’ll run many different queries against both configuration. To analyze the results I create a table runs which contains result of each run of each query. The table has the columns query_id which is the id of the query being ran, config which specifies which version of indexes were used, and duration which is the duration of that run of the query. To get the ten queries where which sped up the most going from one configuration to another, I’ll use a query similar to the following:

WITH a_runs AS (SELECT * FROM runs WHERE config = 'a'),
     b_runs AS (SELECT * FROM runs WHERE config = 'b')
  SELECT a_runs.query_id, a_runs.duration, b_runs.duration
  FROM a_runs, b_runs
  WHERE a_runs.query_id = b_runs.query_id
  ORDER BY a_runs.duration - b_runs.duration DESC
  LIMIT 10;

This query uses a CTE to define a_runs and b_runs as the runs of queries under configuration a, and the runs under configuration b respectively. From then on the query uses a_runs and b_runs to join the results of each configuration together.

There are two main differences between a CTE and just substituting the subqueries into the final query. First of all, a CTE can usually make a query easier to read. This alone is a pretty good reason for using CTEs. Second, using a CTE is currently equivalent equivalent to creating new tables with the results of each subquery, and then using the corresponding table every time you use the name corresponding to that query. This behavior has both some pros and some cons.

This can be advantageous, because it guarantees that each subquery is only evaluated once. This allows you to reuse the results of each subquery without reevaluating the subquery. The main downside of this behavior of CTEs is that it prevents Postgres from making use of certain query plans. There are some cases where Postgres can use information outside of the subquery to optimize the subquery and avoid doing some work. CTEs prevent Postgres from doing any of this1. If you want to read more about CTEs as an optimization fence, 2nd Quadrant has a blog post that goes into more detail.

Postgres Heap Only Tuples

Heap only tuples, otherwise known as HOT, is an optimization Postgres uses to reduce the amount of I/O necessary for updates. Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index. This requires a lot more I/O because the row has to be reinserted into every index over the table. Readding the row to each index is necessary because the physical location of the new version of the row is different from the physical location of the old version.

To reduce the amount of I/O necessary for UPDATE, the Postgres team added HOT to Postgres. The idea behind HOT is relatively simple. When updating a row, if it is possible, Postgres will place the new copy of the row right next to the old copy of the row. By doing this and marking the old copy row so that Postgres knows the new copy of the row is right next it, Postgres can avoid updating all of the indexes. During an index scan in which the new copy of the row satisfies the filter, Postgres will encounter the old copy of the row. Since the old copy has been specially marked, Postgres will know the new copy is right next to it and can use that instead. This way, Postgres can pretend that all of the indexes point to the new copy of the row, without ever needing to modify the indexes.

Currently HOT is only possible when none of the columns being updated are indexed. If any of the columns being updated are indexed, HOT is no longer possible. There are several issues that come up if HOT were attempted in this situation. Specially when an index scan is performed on the index on the column which was updated, and the old copy of the row satisfied the predicate of the scan, but not the new one. In this situation, Postgres will try to use the index to quickly find all rows that satisfy the predicate of the query, and in the case of the tuple updated with HOT, will get back the new copy of the row which does not satisfy the predicate. By adding this restriction, Postgres can guarantee that when it tries to find rows that satisfy a predicate through an index, if the predicate is satisfied by the old copy of the row, it is also satisfied by the new copy and vice/versa.

There is currently in development, an extension to HOT, called WARM, that works even when an index on an updated column exists. The idea behind WARM, is to place the new row next to the old row and reinsert the row in the indexes on the columns that changed. This dramatically complicates the situation mentioned above as now Postgres needs someway to determine whether the row actually satisfies the filter or not.

Postgres Vacuum Full

One detail to note about the Postgres vacuum is that it doesn’t not return the space it frees back to the OS. Instead, it only marks the space as free, so it can later be used new rows. That means if you have a 100GB and delete half the rows from it, it will still take up 100GB, even if you run VACUUM on it. To work around this problem, Postgres has a command VACUUM FULL.

VACUUM FULL takes an existing table and creates a completely new copy of the table. This will allow the OS to reclaim all of the free space previously used by rows, but it does have a few downsides. The current implementation of VACUUM FULL completely locks the table being vacuumed. That means you won’t be able to run queries or insert data into the table until the VACUUM FULL completes. In general, if you want to reclaim all of the empty space in a table, instead of using VACUUM FULL, you should look at pg_repack. The pg_repack extension is a Postgres extension that provides a command equivalent to VACUUM FULL that still allows you to run queries on the table being compacted.

Transaction Id Wraparound in Postgres

Transaction id wraparound has been the cause of numerous outages. Both Sentry and Joyent have blog posts detailing day long outages caused by transaction id wraparound. There are also many other companies that have ran into transaction id wraparound, but haven’t said so publicly.

Txid wraparound is a problem due to MVCC. MVCC relies on being able to take the txids of two transactions and determine which of the transactions came first. One detail of how Postgres implements MVCC is that txids in Postgres are only 32-bit integers. That means there are only 232, or about four billion, possible txids1. Four billion may sound like a lot, but high write volume workloads can reach four billion transactions within a matter of weeks.

After four billion transactions, if Postgres didn’t do anything about it, Postgres would start to have to use duplicate txids. To enable the reuse of txids, Postgres assigns txids sequentially from a cycle. The cycle wraps back around to zero, going something like: 0, 1, 2, …, 232-1, 0, 1, … To determine which of two txids is older, the 231 txids before a txid in the cycle are considered older, and the 231 after it are considered larger. Under this scheme, Postgres has to make sure that all txids currently in use are within a range of 231 of each other. That way all of the txids in use form a consistent ordering.

Postgres makes sure only a range of txids are in use at a time by periodically removing all of the old txids still in use. If the old txids aren’t cleared quickly enough, there will eventually be a new txid that is both newer than the newest txids and simultaneously appears older than the oldest txids. This is txid wraparound. Txid wraparound gets its name because the cycle of txids has completely wrapped around. If Postgres let txid wraparound happen, some transactions in the past would appear to be in the future, causing massive data corruption. Instead, if Postgres ever gets close to txid wraparound, it will cease normal operations to prevent data corruption. This is usually what people mean when they say txid wraparound caused an outage. That Postgres got close enough to txid wraparound that it stopped accepting new transactions.

To clear out old txids, Postgres uses a special vacuum sometimes called the anti-wraparound vacuum. If a table has a txid in it that is older than a certain age, Postgres will run an anti-wraparound vacuum on that table. All the anti-wraparound vacuum does is scan through the entire table and replace every reference to an old txid with a special txid2. The special txid is automatically considered older than every other txid. This scheme works because Postgres never needs to determine which of two old txids is older. In MVCC, at least one of the two txids in any comparison will be relatively new.

As long as the anti-wraparound vacuum is able to keep up, you do not need to worry about txid wraparound. If for any reason, the anti-wraparound vacuum is unable to complete its job quickly enough (the settings could be such that it cannot keep up with the rate of new transactions) the database can get close to txid wraparound. This will cause Postgres to cease normal operations, likely causing an outage.

Ultimately, you usually don’t have to worry about txid wraparound. As long as you have reasonable autovacuum settings, and aren’t creating tons of transactions all of the time, the anti-wraparound vacuum should be able to prevent txid wraparound just fine. You especially shouldn’t need to worry about it if you are running a recent version of Postgres. It’s somewhat technical, but in Postgres 9.6, the anti-wraparound vacuum now has a bit in the visibility map which allows it to determine whether or not a page possibly has old txids. This makes it possible for the anti-wraparound vacuum to skip all pages that Postgres knows for certain don’t have any old txids, making it much faster and require dramatically less I/O. This makes the anti-wraparound vacuum much less likely to fall behind. Robert Haas, one of the Postgres core contributors, has a blog post detailing the change.

The Postgres Vacuum

As mentioned in my post on MVCC, when a row is deleted from Postgres, the physical row is not actually deleted. All that really happens is the physical row is marked as deleted. That way, queries that started before the row was deleted can pretend the row was not deleted, which provides a simpler model for understanding how queries behave when multiple queries are executing concurrently (if you want to understand the reasoning for this behavior, you should read my posts on transactions and MVCC). One additional step necessarily is the physical removal of the row from disk. This is done by the Postgres vacuum.

The vacuum is a fairly simple idea. Once a certain number of rows are updated or deleted from a table1, the vacuum will automatically run on the table. The vacuum works by iterating through all of the rows of the table and freeing the space for rows that are no longer needed. The vacuum determines if a row can be freed by checking that there are no queries that started before the row was deleted. Once every query started before the row was deleted has finished, it is impossible for any query in the future to see the row, so Postgres is free to reclaim the space. Once the vacuum has freed the space for a row, Postgres is allowed to insert a new row into that space.

In order to reduce the amount of work needed to be performed by the vacuum, Postgres uses something called the “visibility map”, which also happens to be used for other purposes such as index-only scans. The visibility map stores an array of bits, one for each page in each table2. (For context, every row is stored on a page with multiple rows being stored on each page.) The bit for a given page is set if Postgres knows, for certain, that every row on the page is visible to every currently running query, and will be to every new query. The vacuum is able to skip any pages that have their bit set to one in the visibility map, since that implies there are no rows on the page that can be freed.

Currently, the only way bits in the visibility map are flipped on is through the vacuum if the vacuum determines every row on the page is visible. The bit for a page is flipped off if a new row is put on the page, or an existing row is deleted. In either of these cases, it is possible there are some transactions for which the row is not visible.

Postgres WAL

Write-ahead logging, or as it’s commonly referred to, WAL, is an optimization Postgres uses to minimize disk I/O while still preventing data loss. Intuitively, whenever a transaction completes, a record of every single change that transaction made must have been written out to persistent storage. The changes that need to be written out include every single row that was inserted, deleted, or updated. If transactions completed before writing all of the changes made to disk and the power ever went out before they were, Postgres would suffer data loss. When Postgres would come back up, it will see the transaction completed1, but Postgres will have no record of some of the changes made by the transaction.

The naive way to make sure all changes are written to disk is to actually perform those changes on disk. Every time a transaction modifies a row it would perform a write to the location of that row on disk. The downside to this approach is it requires performing tons of random writes to disk, one to each row that was changed by the transaction. That many random writes winds up being inefficient. To reduce the amount of necessary I/O, Postgres instead uses write-ahead logging.

For WAL a transaction has to do two things to modify a row. First it has to modify a copy of the row in memory2. Queries running know to first check if there is a copy of the row in memory first, so queries will always use the new copy instead of using the old copy on disk. Second, a record describing the change made (what the change was and where) is added to the “WAL log”. The WAL log is simply a record of all changes that have been made and is stored on disk.

WAL prevents data loss since the WAL log can be used to restore the proper state of the database. When Postgres is coming back up from a crash, the first thing it does is go through all of the records in the WAL log that may not have been propagated to disk and perform those changes if they weren’t performs already. WAL also performs less disk I/O since writes to the WAL log can be batched together and flushed to disk in a single write. By writing multiple WAL records to disk at the same time, the total amount of disk I/O needed is dramatically reduced. A transaction just has to make sure to flush the WAL log to disk before completing3. That way there is a record of all changes made by that transaction on disk.

One last detail of WAL is how the data eventually gets written to disk. There is a background job called the “checkpointer”. The checkpointer is constantly going through all of the changes made in memory and not on disk yet, and performing those changes on disk. Since the checkpointer is writing changes to disk after the fact, it can perform changes to multiple rows near each other on disk all at the same time.

The Postgres Commit Log

In the last post, I discussed MVCC and how it allows Postgres to create snapshots of the current state of the database. One piece I didn’t mention is how Postgres handles failed transactions. The answer is through an object called the commit log (commonly abbreviated as “clog”). Use of the commit log also helps Postgres implement atomicity. In other words, through the commit log, Postgres is able to guarantee a transaction will either complete successfully or have no effect on the database at all.

As mentioned in the last post, MVCC is a technique Postgres uses to determine whether a row should be visible to a query. A requisite for MVCC is the capability to determine whether a transaction completed successfully before a given query started. MVCC as described in the last post only checks whether, for a given query, the transaction had started before the query and was not running when the query started. Based on that information alone, the query either could have either finished successfully or failed. By adding the capability to check whether a finished transaction had completed successfully or not, MVCC as previously described works. Postgres supports this capability through the commit log.

The commit log itself is fairly simple, it’s just a log of the txids (transaction ids) of all transactions that successfully completed. The last step a transaction must accomplish before successfully completing is writing its own txid to the commit log. Once a transaction has written it’s txid to the commit log, that transaction is considered to have successfully completed. If a transaction fails for any reason (manually rolled back, power outage, etc), it’s txid is never written to the commit log. If a transaction is no longer running, Postgres can determine whether the query succeeded or failed by looking for it’s txid in the commit log. If the txid of the transaction is in the commit log, the transaction succeeded, otherwise the transaction failed.

In addition to performing the checks for MVCC, by adding a check to the commit log, Postgres now handles failed transactions properly. The process for determining the visibility of a row with respect to a given query now looks like the following. First check that the transaction that inserted the row started before the query, was not running when the query started, and wrote its txid to the commit log. If all of these checks succeed, the row is visible unless all of the checks also hold true for the transaction that deleted the row. If the insert transaction started before the query, was not running when the query started, but did not insert its txid in the commit log, Postgres knows the transaction failed and the row should not be visible. If the insert transaction succeeded, but the deletion transaction failed, the row should still be visible.

One issue with this scheme is that it is inefficient. It’s hugely inefficient for every query to check the commit log multiple times for each row it wants to determine the visibility of. To get around this inefficiency, Postgres introduces the concepts of “hint bits”. Every row has some additional bits for keeping track of the success or failure of the transactions that inserted the row and deleted the row. Whenever Postgres looks up the status of a transaction that inserted or deleted a row in the commit log, it will then set a bit in the row corresponding to the result of the lookup. That way, any future query can just look at the hint bit and avoid a lookup to the commit log. There are a total of four bits, one for each pair of success/failure and insertion/deletion.

Personally, I find the commit log to be an extremely simple, but clever idea. It not only allows Postgres to handle failed transactions, but allows it to handle transactions that failed as a result of a crash! As we’ll see, there are many more clever ideas throughout the rest of Postgres.

Postgres MVCC

MVCC, which stands for multiversion concurrency control, is one of the main techniques Postgres uses to implement transactions. MVCC lets Postgres run many queries that touch the same rows simultaneously, while keeping those queries isolated from each other.

Postgres handles transaction isolation by using MVCC to create a concept called “snapshots”. Whenever a query starts, it takes a snapshot of the current state of the database. Only the effects of transactions that committed before the snapshot was created are visible to the query. Rows inserted by transactions that didn’t commit before the query started are invisible to the query, and rows deleted by a transaction that didn’t commit before the query started are still visible.

MVCC works by assigning every transaction a serially incremented transaction id (commonly abbreviated txid), with earlier transactions having smaller txids. Whenever a query starts, it records the next txid to be issued, and the txid of every transaction currently running. Collectively this information allows Postgres to determine if a transaction had completed before the query started. A transaction occurred before the query started if the txid of the transaction is both smaller than the next txid to be issued when the query started and is not in the list of txids of transactions that were running when the query started.

As part of MVCC, every row records both the txid of the transaction that inserted it, and if the row was deleted, the txid of the transaction that deleted it. With this information and the ability to determine whether a transaction completed before a query started, Postgres has enough information to implement snapshots. A row should be included in the snapshot of the query if that row was inserted by a transaction that completed before the query started and not deleted by a transaction that completed before the query.

Under MVCC, an insert is handled by inserting a new row with the txid of the current transaction as the insert id, deletes are handled by setting the delete txid of the row being deleted to the txid of the current transaction, and an update is just a delete followed by an insert of the new row1. Since a row isn’t physically deleted when a query deletes it, the physical disk space will need to freed sometime later. This is typically handled by a background job called the vacuum.

MVCC is ultimately great for two reasons. First, it offers a simple model for reasoning about the behavior of queries. All queries create a snapshot of the database when they start and only see the data in that snapshot. Second it offers decent performance. With MVCC, there is almost no need for locks. MVCC makes it impossible for read queries to block on write queries or for write queries to block on read queries.

Transactions in Postgres

Transactions are one of the main features of Postgres that make Postgres a great database. When code is ran in a transaction, Postgres provides a set of guarantees about the behavior of the code. Altogether, this set of guarantees is commonly referred to as ACID, with each letter in ACID standing for a different guarantee.

First of all there two main ways to run SQL code in a transaction. The first way is use BEGIN and COMMIT. Any code between the statements BEGIN and COMMIT is ran in a single transaction. For example, if we have a table representing bank accounts and want to run several statements in a single transaction, we can do the following:

UPDATE accounts SET balance = balance + 100 WHERE id = 10;
UPDATE accounts SET balance = balance - 100 WHERE id = 11;

This runs the two update statements in a single transaction. The other main way of running code in a transaction is to just run a single statement. Any single statement sent to Postgres implicitly has a BEGIN and COMMIT attached to it. Running the query:

SELECT COUNT(*) FROM accounts;

Is equivalent to running:

SELECT * FROM accounts;

As for the ACID guarantees, let’s take a look at each one.


The first ACID guarantee is called “atomicity”. For a set of statements ran in a transaction, atomicity guarantees that either every statement in the transaction succeeds, or the entire transaction fails and none of the statements in the transaction have any effect. Using the first snippet above as an example, after running it, either both updates will have completed or neither will have. This means either 100 dollars will be transferred from account #11 to account #10, or the money transfer will fail completely. It is impossible for the balance of one account to be changed and for the balance of the other to remain unchanged. For many applications, atomicity is absolutely critical. It is completely unacceptable for a bank to complete only one side of a money transfer.

Atomicity is guaranteed even if the power goes out while Postgres is in the middle of executing a transaction! When Postgres comes back up after a crash, it will act as if all of the transactions running at the time of the crash had never happened at all! You can also manually abort a currently running transaction by running the command ROLLBACK before running COMMIT. Running ROLLBACK will undo every command ran after BEGIN.


The second ACID guarantee is consistency. Consistency matters when database constraints are used. If a constraint holds true before the transaction starts, the constraint will also hold true when the transaction completes, if it completes. As an example, there may be a unique constraint on a column which guarantees all values of that column are distinct. Whenever a transaction completes, Postgres guarantees that the unique constraint still holds true. This means if two transactions try to insert the same value into the column at the same time, Postgres has to abort one of them, since if they both complete, the unique constraint will be violated. Consistency can be summarized as the database will be in a consistent state both before a transaction begins and after it completes.


The third guarantee is isolation. If transactions are fully isolated from each other, two transactions will have absolutely no effect on each other at all. Under the default settings, Postgres provides only partial transaction isolation. That means that transactions can still interact with each other in some cases, but they will be isolated from each other in most cases.

You can change the level of isolation Postgres uses to make transactions completely isolated from each other, although there are quite a few downsides to doing so. If it isn’t possible for Postgres to keep two transactions completely isolated, it will abort one of them, and that transaction will have to be retried later. Raising the isolation level makes it much more likely that any given transaction will fail. I’m not going to go into all of the details in this post, but if you’re curious, you can read about all of the different transaction isolation levels and all of the tradeoffs that come up in the Postgres manual. Generally speaking, the default isolation level is good enough for most use cases.


The last ACID guarantee is durability. Once Postgres says it has committed a transaction, it will not lose the data for that transaction. Even if there is a power outage immediately after the transaction committed, Postgres will not lose any data that was just committed. All durability means is Postgres won’t lose your data.

Overall Postgres provides a set of properties that make it easy to reason about Postgres queries. Without all of the ACID guarantees, it will be much harder to use Postgres, and there would be many use cases where Postgres could not be used at all. How Postgres implements each part of ACID is fairly interesting in and of itself, and has far reaching consequences throughout all of Postgres. We will see how Postgres implements the different guarantees of ACID as we look at more parts of Postgres.