Postgres Window Functions

Postgres window functions are an extremely useful feature. They let you calculate a value for each row based on the values of other rows. For our examples, let’s say we have a table ints full of random numbers:

CREATE TABLE ints (n bigint);

INSERT INTO ints 
SELECT floor(random() * 1000000) 
FROM generate_series(1,1000);

The most basic window functions behave similar to regular aggregations. The difference is that for window functions, the result of the aggregation is included in every row. Since the result is included in each row you can also select regular columns out of each row. To change a regular aggregation function into a window function, you just add “OVER ()” after the aggregation function. For example, here is a query that includes the sum across all rows in each row, along with the value in each row:

> SELECT n, sum(n) OVER () FROM ints;

   n    |    sum    
--------+-----------
 481023 | 498397678
 772520 | 498397678
 709081 | 498397678
 292436 | 498397678
...

This allows you to easily express certain calculations. For example, you can calculate what fraction of the total sum each row is:

> SELECT n, n / (sum(n) OVER ()) AS fraction
 FROM ints;

   n    |          fraction          
--------+----------------------------
 481023 |     0.00096513892667052113
 772520 |     0.00155000722134182977
 709081 |     0.00142272131532683425
 292436 |     0.00058675233234132363
...

Window functions also have an equivalent to GROUP BY. By adding “PARTITION BY <expression>” inside of the parens after the OVER, you can calculate the window function over different subsets of the data. As an example, here is the above query, but it instead tells you how much of a fraction each value is of the total sum of all numbers with the same digit:

> SELECT n, n / (sum(n) OVER (PARTITION BY n % 10)) AS fraction 
FROM ints;

   n    |          fraction
--------+----------------------------
 457940 |     0.00951268915957674204
 595290 |     0.01236583117832999688
 111670 |     0.00231969690013961389
 830300 |     0.01724764337947453579
...

From this query, we can tell that 595290 is ~1.2% of the total sum of all numbers in the table that end in 0.

So far we have seen examples that aren’t too difficult to replicate with ordinary SQL. Window functions become harder to replicate with ordinary SQL when you introduce ORDER BY into the OVER clause. First of all, introducing ORDER BY completely changes the behavior of all regular aggregates. Now, instead of including the total aggregation in each row, a rolling aggregation included. For example, using ORDER BY with sum gives you a rolling sum:

> SELECT n, SUM(n) OVER (ORDER BY n ASC) FROM ints;

   n    |    sum    
--------+-----------
    689 |       689
   1197 |      1886
   1201 |      3087
   3405 |      6492
...

You can also combine ORDER BY and PARTITION BY to get a rolling sum for the sum of numbers with a given last digit:

> SELECT n, SUM(n) OVER (PARTITION BY n % 10 ORDER BY n ASC) FROM ints;

   n    |   sum    
--------+----------
  16900 |    16900
  23230 |    40130
  26540 |    66670
  42310 |   108980
...
   1201 |     1201
  18371 |    19572
  19221 |    38793
  36371 |    75164
...

I haven’t mentioned it yet, but there are actually many different functions that are designed specifically for working as window functions. These functions aren’t available as regular aggregation functions. These functions will return values based on the row’s value relative to the other rows. Two of these functions that deserve special attention are row_number and lag. The function row_number tells you each row’s ranking among the other rows based on the ordering used. For example, for the first row according the ordering, row_number will return 1. For the second it will return 2 and so on:

> SELECT n, row_number() OVER (ORDER BY n ASC) FROM ints;

   n    | row_number 
--------+------------
    689 |          1
   1197 |          2
   1201 |          3
   3405 |          4
...

The function lag on the other hand evaluates to the value of an expression on nth previous value according to the ordering. A simple calculation you can perform with lag is calculating the difference between each row and the previous row:

> SELECT n, n - lag(n, 1) OVER (ORDER BY n ASC) AS diff FROM ints;

   n    | diff 
--------+------
    689 |     
   1197 |  508
   1201 |    4
   3405 | 2204
...

Overall, window functions are a pretty awesome features. They dramatically increase the querying power of SQL and make it so much easier to express many different queries. Ultimately, window functions are another tool in the toolbox that Postgres provides that makes Postgres so great.

SELECT DISTINCT ON … ORDER BY

A common problem I’ve seen that many people are unfamiliar with how to solve is obtaining all of the rows that are the maximum of one value grouped by another value. For example, you may have a pets table with columns nameage, and owner_id, and you may want to obtain the name of the oldest pet for each owner. This seems like a problem that could be solved by a group by, but it actually cannot be using the built-in aggregation functions1. If you try writing a query with a group by, you’ll try to write something like:

SELECT owner_id, <agg for the name of the oldest pet>
GROUP BY owner_id;

Believe it or not, there is no built-in aggregation that can be used to get the desired result. The easiest way to solve this problem is with SELECT DISTINCT ON … ORDER BY2. The general format of such a query looks like:

SELECT DISTINCT ON (<group by columns>) <selected columns>
FROM <table>
ORDER BY <group by columns>, <order by columns>;

In this specific case, a query that solves the problem is:

SELECT DISTINCT ON (owner_id) owner_id, name
FROM pets
ORDER BY owner_id, age DESC;

Let’s breakdown this query and see how it works. The first part you need to understand is the behavior of SELECT DISTINCT ONSELECT DISTINCT ON returns one row for every unique combination of values in the parens. For example, the query:

SELECT DISTINCT ON (owner_id) owner_id, name FROM pets;

Will return the owner_id and name of one pet belonging to each distinct pet owner. One important detail of the behavior of SELECT DISTINCT ON is that it will always return the first row it encounters for each distinct pet owner. That means you can use SELECT DISTINCT ON with ORDER BY to get the row with the highest/lowest value of a given column for each combination of the values being distincted on. The query above uses this fact to get the pet with the oldest age for each owner.

One additional detail with the query is you need to include the columns being distincted on in the ORDER BY. This is necessary because of how Postgres executes a query with DISTINCT ON internally. Postgres executes such a query by first sorting by the fields being distincted on. Sorting the rows this way places all rows with the same values of the fields being distincted on together. Postgres can then just iterate through all of the rows and add a new row to the result of the query whenever it sees a row whose values of the columns differ from the row before it. SELECT DISTINCT ON … ORDER BY takes advantage of this behavior by first sorting by the columns being distincted on and then sorting by the columns being ordered by. That way all rows with the same values of the fields being distincted on are clumped together, and the first row in each clump is the row desired in the output.

The Missing Postgres Scan: The Loose Index Scan

Postgres has many different types of scans builtin. The list includes sequential scans, index scans, and bitmap scans. One useful scan type Postgres does not have builtin that other databases do is the loose index scan. Both MySQL and Oracle support loose index scans. Fortunately for Postgres users, loose index scans can be emulated through a recursive CTE.

At a high level, the idea behind the loose index scan is that rows necessary for the query are at predictable locations in the index. All of the index scans Postgres currently supports requires reading all of the values from the index between two values. Although Postgres does not do it, for certain queries it is possible to skip over large parts of the index because it is possible to determine that values in that part of the index do not affect the result of the query. For a specific example, let’s say we want to calculate the number of distinct elements in a table:

CREATE TABLE ints (n BIGINT);

INSERT INTO ints SELECT floor(random() * 10) 
FROM generate_series(1, 10000000);

CREATE INDEX ON ints (n);

EXPLAIN ANALYZE SELECT COUNT(DISTINCT n) FROM ints;

This example creates a table with 10,000,000 rows, all of which contain a random number from 0 to 9. Then it counts the number of distinct elements using a naive COUNT(DISTINCT n). The plan for the query is available here.

As you can see from the query plan, using a naive COUNT(DISTINCT n) takes six seconds. Of this time, 650 milliseconds is used to read the data from disk while the rest of the time is used to perform the aggregation and determine how many distinct elements there are. A sequential scan is only performed because the query requires reading all of the rows from the database and a sequential scan is the fastest way to do it.

Most of the work Postgres is doing for this query is unnecessary. For example, once Postgres sees a row with a zero, it knows all other rows that contain a zero will not affect the result. With this idea in mind, we can write the following query which performs a loose index scan:

> EXPLAIN ANALYZE
> WITH RECURSIVE temp (i) AS (
>     (SELECT n FROM ints ORDER BY n ASC LIMIT 1)
>   UNION ALL
>     (SELECT n FROM temp, 
>                    LATERAL (SELECT n
>                             FROM ints
>                             WHERE n > i
>                             ORDER BY n ASC
>                             LIMIT 1) sub)
> )
> SELECT COUNT(*) FROM temp;

The plan is available here.

This new query takes less than one millisecond! Let’s try dissecting this query. First of all, this query makes use of a recursive CTE. If you aren’t familiar with how to interpret recursive CTEs, I suggest you read my blog post on how to understand them. The first part of the recursive CTE is:

SELECT n FROM ints ORDER BY n ASC LIMIT 1

This simply gets the smallest number from the table. As for the recursive part of the recursive CTE:

SELECT n FROM temp, 
              LATERAL (SELECT n
                       FROM ints
                       WHERE n > i
                       ORDER BY n ASC
                       LIMIT 1) sub

This query takes the current element in the working table and uses a lateral join to fetch the smallest element greater than the current element. In other words, the next largest element. By first fetching the smallest value in the table and then repeatedly fetching the next largest, the recursive CTE selects the distinct elements from the table! From there, we just need to count the number of distinct elements to determine the number of distinct elements in the table.

By using an index to find each element, only a single index scan needs to be performed per distinct element. Since there are only 10 distinct elements in the table, only 10 index scans need to be performed1. The loose index scan is so much faster than the regular aggregation because once it sees a given value, it skips over all other rows with the same value.

Besides counting the number of distinct elements in a table, there is one other main use case where loose index scans are helpful: if you ever want to calculate the min of one column grouped by another column. If you have a specific value of the column you are grouping by, you can use a compound index on (grouping column, other column) to quickly find the smallest value of the other column for that one value. By iterating over the distinct values of the column you are grouping by, you can answer the query fairly quickly. Without using a loose index scan here, the query would read entire table, which as we already know, is pretty slow.

Postgres Lateral Joins

Personally, lateral joins are one of my favorite Postgres features. They are simple, while at the same time they let you write queries that would be nearly impossible to write otherwise. I find it surprising lateral joins were only introduced into Postgres four years ago given how useful they are. I also find it surprising there are no good explanations of lateral joins online given how simple they are.

A lateral join is simply a foreach loop in SQL. There are actually two different ways of writing a lateral join. The simplest form, which is the one you probably want looks like the following:

SELECT <columns>
FROM <table reference>,
     LATERAL <inner subquery>;

(For context, a table reference is either a table or a subquery.)

The above query will iterate through each row in the table reference and evaluate the inner subquery for each row, exactly like a foreach loop would. The rows returned by the inner subquery are then added to the result of the lateral join. The primary reason for using a lateral join is that the inner subquery can refer to the fields of the row from the table reference and use that to determine what rows to return. To help illustrate what this means, here is Python code that imitates a lateral join:

result = []
for row1 in table_reference():
    for row2 in inner_subquery(row1):
        result += (row1, row2)

As you should be able to see, the rows iterated over in the inner for loop are a function of the current row being iterated over in the outer for loop.

As an example of where a lateral join is useful, let’s say we have a table people with an id column and an age column, and a table pets with columns id, owner_id and age. For the query, let’s say for each person over 30, we want to obtain the id of the oldest pet owned by that person. Anytime you hear the words “for each”, you are probably looking for a lateral join. This specific example can be done with the following query:

SELECT people_sub.id, pets_sub.id
FROM (SELECT id FROM people WHERE age > 30) people_sub,
     LATERAL (SELECT pets.id
              FROM pets
              WHERE pets.owner_id = people_sub.id
              ORDER BY pets.age DESC
              LIMIT 1) pets_sub;

For each person over 30, this query will evaluate the inner subquery and add the results of the subquery to the results of the full query. The inner subquery, given a row for a person, will find the id of the oldest pet owned by that person, if they own a pet. That’s all there is to the lateral join. Believe it or not, most of the other ways of writing the above query are much trickier and usually are not as efficient2.

The other syntax for writing a lateral join is the following:

SELECT <columns>
FROM <table reference>
     JOIN LATERAL <outer subquery>
ON TRUE;

The advantage of this form is you can use any join qualifier (e.g. LEFT JOIN LATERAL). If in the example above, we also wanted to obtain people who didn’t have a pet at all, we would have needed to use a lateral left join to get those people. Most of the time though, the first lateral join form is good enough.

To recap, a lateral join is basically a foreach loop in SQL. Anytime you want to perform some calculation for each row in a subquery, you should try to see if a lateral join suits your use case. Once you understand a lateral join, you’ll realize it makes all sorts of queries much easier to write.

Understanding Postgres Recursive CTEs

Unlike ordinary CTEs, which can be thought of as syntactic sugar, recursive CTEs make SQL a more powerful language. As the name suggests a recursive CTE makes it possible to express recursion in SQL. The ability to express recursion makes recursive CTEs useful for querying any hierarchical data. Additionally, recursive CTEs even make SQL Turing complete!

In general, most queries that make use of a recursive CTEs look similar to the following template3:

WITH RECURSIVE <name> (<columns>) AS (
    <initial query>
  UNION ALL
    <recursive query>
)
<query>

As an example, here is a query that makes use of a recursive CTE to generate the integers from 1 to 100:

WITH RECURSIVE ints (n) AS (
    SELECT 1
  UNION ALL
    SELECT n+1 FROM ints WHERE n+1<= 100
)
SELECT n FROM ints;

A recursive CTE executes according to the following steps:

  1. Evaluate the initial query and store the results in the “working table”. The working table is simply a temporary table that is used during the evaluation of the recursive CTE.
  2. Exit if the working table is empty.
  3. Add all of the rows in the working table to the result of the recursive CTE4.
  4. Evaluate the recursive query on the working table and replace the working table with the results of the recursive query. In the example above, that means the working table is substituted in for ints in the recursive query.
  5. Goto step 2.

Once the recursive CTE has been evaluated, the results of it can be used in the rest of the query like an ordinary CTE5.

Let’s walk through the example I gave above. First the recursive CTE evaluates the initial query which is just SELECT 1, and stores the results, a single row with the value 1, in the working table. Then since the working table is not empty, the recursive CTE proceeds to step 3 which adds the row with value 1 to the result of the recursive CTE. Next, the recursive query:

SELECT n+1 FROM ints WHERE n+1 <= 100

Is ran on the working table. That means the recursive query is ran with ints referring to the current working table. In this case, since the working table is only a single row with the value 1, the recursive query returns a single row with the value 2. This new row becomes the new working table. Steps 2-4 then repeat with the working table containing only a single row with the value 2. This process repeats with the values 3, 4, … until the working table has a single row with the value 100.

This time, the recursive query returns no rows, due to the filter condition in the recursive query. It’s at this point the recursive CTE terminates. During the execution of the CTE, each of the values 1 to 100 was added to the result set, so the recursive CTE terminates with the values 1-100 as the result. I also wrote a Python snippet equivalent to above recursive CTE which you can find here. Looking at it may help with understanding how to think about recursive CTEs.

Let’s now look at a more complicated example. An example that actually includes hierarchical data. Let’s say we have an employees table that models the management hierarchy at a business. The table has three columns: an id column, a name column, and a manager_id column. Let’s say we want to find everyone who works under Alice (someone who reports to someone who reports to someone … who reports to Alice), including Alice herself. That can be done with the following query:

WITH RECURSIVE temp (name, id) AS (
    SELECT name, id
    FROM employees 
    WHERE name = 'Alice'
  UNION ALL
    SELECT employees.name, employees.id 
    FROM temp, employees
    WHERE employees.manager_id = temp.id
)
SELECT name FROM temp;

Let’s look at the individual queries that make up the recursive CTE in the example. The initial query finds the row for Alice and adds it to the working table. The recursive query takes all people in the working table and uses a join to find all people who report to someone in the working table. Going through the steps above for the execution of a recursive CTE, what this query does is first find the row for Alice and puts it in the working table. Then the recursive query finds all people who report to Alice. Next the recursive query finds all people who report to someone who reports to Alice. This process repeats until the query has found all people who work under Alice.

There are all sorts of surprising computations you can perform with recursive CTEs. One of my favorite examples is a SQL query that generates the Mandelbrot Set. This is mainly possible recursive CTEs make SQL turing complete. With recursive CTEs, it is possible to perform any computation you want with SQL!

Postgres CTEs

CTEs, also know as Common Table Expressions, are a way of refactoring a query. They let you give the result of a subquery a name and reuse the result multiple times. The general syntax of a CTE looks like the following:

WITH <name1> AS (<subquery1>),
     <name2> AS (<subquery2>),
     ...
     <nameN> AS (<subqueryN>)
  query

This will first evaluate each of the subqueries and assign the results of each query to the corresponding name. Then within the final query, you can use each name to refer to the results of the corresponding query.

As an example, I frequently do experiments where I try different combinations of indexes. To test which configuration is better, I’ll run many different queries against both configuration. To analyze the results I create a table runs which contains result of each run of each query. The table has the columns query_id which is the id of the query being ran, config which specifies which version of indexes were used, and duration which is the duration of that run of the query. To get the ten queries where which sped up the most going from one configuration to another, I’ll use a query similar to the following:

WITH a_runs AS (SELECT * FROM runs WHERE config = 'a'),
     b_runs AS (SELECT * FROM runs WHERE config = 'b')
  SELECT a_runs.query_id, a_runs.duration, b_runs.duration
  FROM a_runs, b_runs
  WHERE a_runs.query_id = b_runs.query_id
  ORDER BY a_runs.duration - b_runs.duration DESC
  LIMIT 10;

This query uses a CTE to define a_runs and b_runs as the runs of queries under configuration a, and the runs under configuration b respectively. From then on the query uses a_runs and b_runs to join the results of each configuration together.

There are two main differences between a CTE and just substituting the subqueries into the final query. First of all, a CTE can usually make a query easier to read. This alone is a pretty good reason for using CTEs. Second, using a CTE is currently equivalent equivalent to creating new tables with the results of each subquery, and then using the corresponding table every time you use the name corresponding to that query. This behavior has both some pros and some cons.

This can be advantageous, because it guarantees that each subquery is only evaluated once. This allows you to reuse the results of each subquery without reevaluating the subquery. The main downside of this behavior of CTEs is that it prevents Postgres from making use of certain query plans. There are some cases where Postgres can use information outside of the subquery to optimize the subquery and avoid doing some work. CTEs prevent Postgres from doing any of this1. If you want to read more about CTEs as an optimization fence, 2nd Quadrant has a blog post that goes into more detail.

Postgres Heap Only Tuples

Heap only tuples, otherwise known as HOT, is an optimization Postgres uses to reduce the amount of I/O necessary for updates. Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index. This requires a lot more I/O because the row has to be reinserted into every index over the table. Readding the row to each index is necessary because the physical location of the new version of the row is different from the physical location of the old version.

To reduce the amount of I/O necessary for UPDATE, the Postgres team added HOT to Postgres. The idea behind HOT is relatively simple. When updating a row, if it is possible, Postgres will place the new copy of the row right next to the old copy of the row. By doing this and marking the old copy row so that Postgres knows the new copy of the row is right next it, Postgres can avoid updating all of the indexes. During an index scan in which the new copy of the row satisfies the filter, Postgres will encounter the old copy of the row. Since the old copy has been specially marked, Postgres will know the new copy is right next to it and can use that instead. This way, Postgres can pretend that all of the indexes point to the new copy of the row, without ever needing to modify the indexes.

Currently HOT is only possible when none of the columns being updated are indexed. If any of the columns being updated are indexed, HOT is no longer possible. There are several issues that come up if HOT were attempted in this situation. Specially when an index scan is performed on the index on the column which was updated, and the old copy of the row satisfied the predicate of the scan, but not the new one. In this situation, Postgres will try to use the index to quickly find all rows that satisfy the predicate of the query, and in the case of the tuple updated with HOT, will get back the new copy of the row which does not satisfy the predicate. By adding this restriction, Postgres can guarantee that when it tries to find rows that satisfy a predicate through an index, if the predicate is satisfied by the old copy of the row, it is also satisfied by the new copy and vice/versa.

There is currently in development, an extension to HOT, called WARM, that works even when an index on an updated column exists. The idea behind WARM, is to place the new row next to the old row and reinsert the row in the indexes on the columns that changed. This dramatically complicates the situation mentioned above as now Postgres needs someway to determine whether the row actually satisfies the filter or not.

Postgres Vacuum Full

One detail to note about the Postgres vacuum is that it doesn’t not return the space it frees back to the OS. Instead, it only marks the space as free, so it can later be used new rows. That means if you have a 100GB and delete half the rows from it, it will still take up 100GB, even if you run VACUUM on it. To work around this problem, Postgres has a command VACUUM FULL.

VACUUM FULL takes an existing table and creates a completely new copy of the table. This will allow the OS to reclaim all of the free space previously used by rows, but it does have a few downsides. The current implementation of VACUUM FULL completely locks the table being vacuumed. That means you won’t be able to run queries or insert data into the table until the VACUUM FULL completes. In general, if you want to reclaim all of the empty space in a table, instead of using VACUUM FULL, you should look at pg_repack. The pg_repack extension is a Postgres extension that provides a command equivalent to VACUUM FULL that still allows you to run queries on the table being compacted.

Transaction Id Wraparound in Postgres

Transaction id wraparound has been the cause of numerous outages. Both Sentry and Joyent have blog posts detailing day long outages caused by transaction id wraparound. There are also many other companies that have ran into transaction id wraparound, but haven’t said so publicly.

Txid wraparound is a problem due to MVCC. MVCC relies on being able to take the txids of two transactions and determine which of the transactions came first. One detail of how Postgres implements MVCC is that txids in Postgres are only 32-bit integers. That means there are only 232, or about four billion, possible txids1. Four billion may sound like a lot, but high write volume workloads can reach four billion transactions within a matter of weeks.

After four billion transactions, if Postgres didn’t do anything about it, Postgres would start to have to use duplicate txids. To enable the reuse of txids, Postgres assigns txids sequentially from a cycle. The cycle wraps back around to zero, going something like: 0, 1, 2, …, 232-1, 0, 1, … To determine which of two txids is older, the 231 txids before a txid in the cycle are considered older, and the 231 after it are considered larger. Under this scheme, Postgres has to make sure that all txids currently in use are within a range of 231 of each other. That way all of the txids in use form a consistent ordering.

Postgres makes sure only a range of txids are in use at a time by periodically removing all of the old txids still in use. If the old txids aren’t cleared quickly enough, there will eventually be a new txid that is both newer than the newest txids and simultaneously appears older than the oldest txids. This is txid wraparound. Txid wraparound gets its name because the cycle of txids has completely wrapped around. If Postgres let txid wraparound happen, some transactions in the past would appear to be in the future, causing massive data corruption. Instead, if Postgres ever gets close to txid wraparound, it will cease normal operations to prevent data corruption. This is usually what people mean when they say txid wraparound caused an outage. That Postgres got close enough to txid wraparound that it stopped accepting new transactions.

To clear out old txids, Postgres uses a special vacuum sometimes called the anti-wraparound vacuum. If a table has a txid in it that is older than a certain age, Postgres will run an anti-wraparound vacuum on that table. All the anti-wraparound vacuum does is scan through the entire table and replace every reference to an old txid with a special txid2. The special txid is automatically considered older than every other txid. This scheme works because Postgres never needs to determine which of two old txids is older. In MVCC, at least one of the two txids in any comparison will be relatively new.

As long as the anti-wraparound vacuum is able to keep up, you do not need to worry about txid wraparound. If for any reason, the anti-wraparound vacuum is unable to complete its job quickly enough (the settings could be such that it cannot keep up with the rate of new transactions) the database can get close to txid wraparound. This will cause Postgres to cease normal operations, likely causing an outage.

Ultimately, you usually don’t have to worry about txid wraparound. As long as you have reasonable autovacuum settings, and aren’t creating tons of transactions all of the time, the anti-wraparound vacuum should be able to prevent txid wraparound just fine. You especially shouldn’t need to worry about it if you are running a recent version of Postgres. It’s somewhat technical, but in Postgres 9.6, the anti-wraparound vacuum now has a bit in the visibility map which allows it to determine whether or not a page possibly has old txids. This makes it possible for the anti-wraparound vacuum to skip all pages that Postgres knows for certain don’t have any old txids, making it much faster and require dramatically less I/O. This makes the anti-wraparound vacuum much less likely to fall behind. Robert Haas, one of the Postgres core contributors, has a blog post detailing the change.

The Postgres Vacuum

As mentioned in my post on MVCC, when a row is deleted from Postgres, the physical row is not actually deleted. All that really happens is the physical row is marked as deleted. That way, queries that started before the row was deleted can pretend the row was not deleted, which provides a simpler model for understanding how queries behave when multiple queries are executing concurrently (if you want to understand the reasoning for this behavior, you should read my posts on transactions and MVCC). One additional step necessarily is the physical removal of the row from disk. This is done by the Postgres vacuum.

The vacuum is a fairly simple idea. Once a certain number of rows are updated or deleted from a table1, the vacuum will automatically run on the table. The vacuum works by iterating through all of the rows of the table and freeing the space for rows that are no longer needed. The vacuum determines if a row can be freed by checking that there are no queries that started before the row was deleted. Once every query started before the row was deleted has finished, it is impossible for any query in the future to see the row, so Postgres is free to reclaim the space. Once the vacuum has freed the space for a row, Postgres is allowed to insert a new row into that space.

In order to reduce the amount of work needed to be performed by the vacuum, Postgres uses something called the “visibility map”, which also happens to be used for other purposes such as index-only scans. The visibility map stores an array of bits, one for each page in each table2. (For context, every row is stored on a page with multiple rows being stored on each page.) The bit for a given page is set if Postgres knows, for certain, that every row on the page is visible to every currently running query, and will be to every new query. The vacuum is able to skip any pages that have their bit set to one in the visibility map, since that implies there are no rows on the page that can be freed.

Currently, the only way bits in the visibility map are flipped on is through the vacuum if the vacuum determines every row on the page is visible. The bit for a page is flipped off if a new row is put on the page, or an existing row is deleted. In either of these cases, it is possible there are some transactions for which the row is not visible.