Transaction Id Wraparound in Postgres

Transaction id wraparound has been the cause of numerous outages. Both Sentry and Joyent have blog posts detailing day long outages caused by transaction id wraparound. There are also many other companies that have ran into transaction id wraparound, but haven’t said so publicly.

Txid wraparound is a problem due to MVCC. MVCC relies on being able to take the txids of two transactions and determine which of the transactions came first. One detail of how Postgres implements MVCC is that txids in Postgres are only 32-bit integers. That means there are only 232, or about four billion, possible txids1. Four billion may sound like a lot, but high write volume workloads can reach four billion transactions within a matter of weeks.

After four billion transactions, if Postgres didn’t do anything about it, Postgres would start to have to use duplicate txids. To enable the reuse of txids, Postgres assigns txids sequentially from a cycle. The cycle wraps back around to zero, going something like: 0, 1, 2, …, 232-1, 0, 1, … To determine which of two txids is older, the 231 txids before a txid in the cycle are considered older, and the 231 after it are considered larger. Under this scheme, Postgres has to make sure that all txids currently in use are within a range of 231 of each other. That way all of the txids in use form a consistent ordering.

Postgres makes sure only a range of txids are in use at a time by periodically removing all of the old txids still in use. If the old txids aren’t cleared quickly enough, there will eventually be a new txid that is both newer than the newest txids and simultaneously appears older than the oldest txids. This is txid wraparound. Txid wraparound gets its name because the cycle of txids has completely wrapped around. If Postgres let txid wraparound happen, some transactions in the past would appear to be in the future, causing massive data corruption. Instead, if Postgres ever gets close to txid wraparound, it will cease normal operations to prevent data corruption. This is usually what people mean when they say txid wraparound caused an outage. That Postgres got close enough to txid wraparound that it stopped accepting new transactions.

To clear out old txids, Postgres uses a special vacuum sometimes called the anti-wraparound vacuum. If a table has a txid in it that is older than a certain age, Postgres will run an anti-wraparound vacuum on that table. All the anti-wraparound vacuum does is scan through the entire table and replace every reference to an old txid with a special txid2. The special txid is automatically considered older than every other txid. This scheme works because Postgres never needs to determine which of two old txids is older. In MVCC, at least one of the two txids in any comparison will be relatively new.

As long as the anti-wraparound vacuum is able to keep up, you do not need to worry about txid wraparound. If for any reason, the anti-wraparound vacuum is unable to complete its job quickly enough (the settings could be such that it cannot keep up with the rate of new transactions) the database can get close to txid wraparound. This will cause Postgres to cease normal operations, likely causing an outage.

Ultimately, you usually don’t have to worry about txid wraparound. As long as you have reasonable autovacuum settings, and aren’t creating tons of transactions all of the time, the anti-wraparound vacuum should be able to prevent txid wraparound just fine. You especially shouldn’t need to worry about it if you are running a recent version of Postgres. It’s somewhat technical, but in Postgres 9.6, the anti-wraparound vacuum now has a bit in the visibility map which allows it to determine whether or not a page possibly has old txids. This makes it possible for the anti-wraparound vacuum to skip all pages that Postgres knows for certain don’t have any old txids, making it much faster and require dramatically less I/O. This makes the anti-wraparound vacuum much less likely to fall behind. Robert Haas, one of the Postgres core contributors, has a blog post detailing the change.

The Postgres Vacuum

As mentioned in my post on MVCC, when a row is deleted from Postgres, the physical row is not actually deleted. All that really happens is the physical row is marked as deleted. That way, queries that started before the row was deleted can pretend the row was not deleted, which provides a simpler model for understanding how queries behave when multiple queries are executing concurrently (if you want to understand the reasoning for this behavior, you should read my posts on transactions and MVCC). One additional step necessarily is the physical removal of the row from disk. This is done by the Postgres vacuum.

The vacuum is a fairly simple idea. Once a certain number of rows are updated or deleted from a table1, the vacuum will automatically run on the table. The vacuum works by iterating through all of the rows of the table and freeing the space for rows that are no longer needed. The vacuum determines if a row can be freed by checking that there are no queries that started before the row was deleted. Once every query started before the row was deleted has finished, it is impossible for any query in the future to see the row, so Postgres is free to reclaim the space. Once the vacuum has freed the space for a row, Postgres is allowed to insert a new row into that space.

In order to reduce the amount of work needed to be performed by the vacuum, Postgres uses something called the “visibility map”, which also happens to be used for other purposes such as index-only scans. The visibility map stores an array of bits, one for each page in each table2. (For context, every row is stored on a page with multiple rows being stored on each page.) The bit for a given page is set if Postgres knows, for certain, that every row on the page is visible to every currently running query, and will be to every new query. The vacuum is able to skip any pages that have their bit set to one in the visibility map, since that implies there are no rows on the page that can be freed.

Currently, the only way bits in the visibility map are flipped on is through the vacuum if the vacuum determines every row on the page is visible. The bit for a page is flipped off if a new row is put on the page, or an existing row is deleted. In either of these cases, it is possible there are some transactions for which the row is not visible.

Postgres WAL

Write-ahead logging, or as it’s commonly referred to, WAL, is an optimization Postgres uses to minimize disk I/O while still preventing data loss. Intuitively, whenever a transaction completes, a record of every single change that transaction made must have been written out to persistent storage. The changes that need to be written out include every single row that was inserted, deleted, or updated. If transactions completed before writing all of the changes made to disk and the power ever went out before they were, Postgres would suffer data loss. When Postgres would come back up, it will see the transaction completed1, but Postgres will have no record of some of the changes made by the transaction.

The naive way to make sure all changes are written to disk is to actually perform those changes on disk. Every time a transaction modifies a row it would perform a write to the location of that row on disk. The downside to this approach is it requires performing tons of random writes to disk, one to each row that was changed by the transaction. That many random writes winds up being inefficient. To reduce the amount of necessary I/O, Postgres instead uses write-ahead logging.

For WAL a transaction has to do two things to modify a row. First it has to modify a copy of the row in memory2. Queries running know to first check if there is a copy of the row in memory first, so queries will always use the new copy instead of using the old copy on disk. Second, a record describing the change made (what the change was and where) is added to the “WAL log”. The WAL log is simply a record of all changes that have been made and is stored on disk.

WAL prevents data loss since the WAL log can be used to restore the proper state of the database. When Postgres is coming back up from a crash, the first thing it does is go through all of the records in the WAL log that may not have been propagated to disk and perform those changes if they weren’t performs already. WAL also performs less disk I/O since writes to the WAL log can be batched together and flushed to disk in a single write. By writing multiple WAL records to disk at the same time, the total amount of disk I/O needed is dramatically reduced. A transaction just has to make sure to flush the WAL log to disk before completing3. That way there is a record of all changes made by that transaction on disk.

One last detail of WAL is how the data eventually gets written to disk. There is a background job called the “checkpointer”. The checkpointer is constantly going through all of the changes made in memory and not on disk yet, and performing those changes on disk. Since the checkpointer is writing changes to disk after the fact, it can perform changes to multiple rows near each other on disk all at the same time.

The Postgres Commit Log

In the last post, I discussed MVCC and how it allows Postgres to create snapshots of the current state of the database. One piece I didn’t mention is how Postgres handles failed transactions. The answer is through an object called the commit log (commonly abbreviated as “clog”). Use of the commit log also helps Postgres implement atomicity. In other words, through the commit log, Postgres is able to guarantee a transaction will either complete successfully or have no effect on the database at all.

As mentioned in the last post, MVCC is a technique Postgres uses to determine whether a row should be visible to a query. A requisite for MVCC is the capability to determine whether a transaction completed successfully before a given query started. MVCC as described in the last post only checks whether, for a given query, the transaction had started before the query and was not running when the query started. Based on that information alone, the query either could have either finished successfully or failed. By adding the capability to check whether a finished transaction had completed successfully or not, MVCC as previously described works. Postgres supports this capability through the commit log.

The commit log itself is fairly simple, it’s just a log of the txids (transaction ids) of all transactions that successfully completed. The last step a transaction must accomplish before successfully completing is writing its own txid to the commit log. Once a transaction has written it’s txid to the commit log, that transaction is considered to have successfully completed. If a transaction fails for any reason (manually rolled back, power outage, etc), it’s txid is never written to the commit log. If a transaction is no longer running, Postgres can determine whether the query succeeded or failed by looking for it’s txid in the commit log. If the txid of the transaction is in the commit log, the transaction succeeded, otherwise the transaction failed.

In addition to performing the checks for MVCC, by adding a check to the commit log, Postgres now handles failed transactions properly. The process for determining the visibility of a row with respect to a given query now looks like the following. First check that the transaction that inserted the row started before the query, was not running when the query started, and wrote its txid to the commit log. If all of these checks succeed, the row is visible unless all of the checks also hold true for the transaction that deleted the row. If the insert transaction started before the query, was not running when the query started, but did not insert its txid in the commit log, Postgres knows the transaction failed and the row should not be visible. If the insert transaction succeeded, but the deletion transaction failed, the row should still be visible.

One issue with this scheme is that it is inefficient. It’s hugely inefficient for every query to check the commit log multiple times for each row it wants to determine the visibility of. To get around this inefficiency, Postgres introduces the concepts of “hint bits”. Every row has some additional bits for keeping track of the success or failure of the transactions that inserted the row and deleted the row. Whenever Postgres looks up the status of a transaction that inserted or deleted a row in the commit log, it will then set a bit in the row corresponding to the result of the lookup. That way, any future query can just look at the hint bit and avoid a lookup to the commit log. There are a total of four bits, one for each pair of success/failure and insertion/deletion.

Personally, I find the commit log to be an extremely simple, but clever idea. It not only allows Postgres to handle failed transactions, but allows it to handle transactions that failed as a result of a crash! As we’ll see, there are many more clever ideas throughout the rest of Postgres.

Postgres MVCC

MVCC, which stands for multiversion concurrency control, is one of the main techniques Postgres uses to implement transactions. MVCC lets Postgres run many queries that touch the same rows simultaneously, while keeping those queries isolated from each other.

Postgres handles transaction isolation by using MVCC to create a concept called “snapshots”. Whenever a query starts, it takes a snapshot of the current state of the database. Only the effects of transactions that committed before the snapshot was created are visible to the query. Rows inserted by transactions that didn’t commit before the query started are invisible to the query, and rows deleted by a transaction that didn’t commit before the query started are still visible.

MVCC works by assigning every transaction a serially incremented transaction id (commonly abbreviated txid), with earlier transactions having smaller txids. Whenever a query starts, it records the next txid to be issued, and the txid of every transaction currently running. Collectively this information allows Postgres to determine if a transaction had completed before the query started. A transaction occurred before the query started if the txid of the transaction is both smaller than the next txid to be issued when the query started and is not in the list of txids of transactions that were running when the query started.

As part of MVCC, every row records both the txid of the transaction that inserted it, and if the row was deleted, the txid of the transaction that deleted it. With this information and the ability to determine whether a transaction completed before a query started, Postgres has enough information to implement snapshots. A row should be included in the snapshot of the query if that row was inserted by a transaction that completed before the query started and not deleted by a transaction that completed before the query.

Under MVCC, an insert is handled by inserting a new row with the txid of the current transaction as the insert id, deletes are handled by setting the delete txid of the row being deleted to the txid of the current transaction, and an update is just a delete followed by an insert of the new row1. Since a row isn’t physically deleted when a query deletes it, the physical disk space will need to freed sometime later. This is typically handled by a background job called the vacuum.

MVCC is ultimately great for two reasons. First, it offers a simple model for reasoning about the behavior of queries. All queries create a snapshot of the database when they start and only see the data in that snapshot. Second it offers decent performance. With MVCC, there is almost no need for locks. MVCC makes it impossible for read queries to block on write queries or for write queries to block on read queries.

Transactions in Postgres

Transactions are one of the main features of Postgres that make Postgres a great database. When code is ran in a transaction, Postgres provides a set of guarantees about the behavior of the code. Altogether, this set of guarantees is commonly referred to as ACID, with each letter in ACID standing for a different guarantee.

First of all there two main ways to run SQL code in a transaction. The first way is use BEGIN and COMMIT. Any code between the statements BEGIN and COMMIT is ran in a single transaction. For example, if we have a table representing bank accounts and want to run several statements in a single transaction, we can do the following:

BEGIN;
UPDATE accounts SET balance = balance + 100 WHERE id = 10;
UPDATE accounts SET balance = balance - 100 WHERE id = 11;
COMMIT;

This runs the two update statements in a single transaction. The other main way of running code in a transaction is to just run a single statement. Any single statement sent to Postgres implicitly has a BEGIN and COMMIT attached to it. Running the query:

SELECT COUNT(*) FROM accounts;

Is equivalent to running:

BEGIN;
SELECT * FROM accounts;
COMMIT;

As for the ACID guarantees, let’s take a look at each one.

Atomicity

The first ACID guarantee is called “atomicity”. For a set of statements ran in a transaction, atomicity guarantees that either every statement in the transaction succeeds, or the entire transaction fails and none of the statements in the transaction have any effect. Using the first snippet above as an example, after running it, either both updates will have completed or neither will have. This means either 100 dollars will be transferred from account #11 to account #10, or the money transfer will fail completely. It is impossible for the balance of one account to be changed and for the balance of the other to remain unchanged. For many applications, atomicity is absolutely critical. It is completely unacceptable for a bank to complete only one side of a money transfer.

Atomicity is guaranteed even if the power goes out while Postgres is in the middle of executing a transaction! When Postgres comes back up after a crash, it will act as if all of the transactions running at the time of the crash had never happened at all! You can also manually abort a currently running transaction by running the command ROLLBACK before running COMMIT. Running ROLLBACK will undo every command ran after BEGIN.

Consistency

The second ACID guarantee is consistency. Consistency matters when database constraints are used. If a constraint holds true before the transaction starts, the constraint will also hold true when the transaction completes, if it completes. As an example, there may be a unique constraint on a column which guarantees all values of that column are distinct. Whenever a transaction completes, Postgres guarantees that the unique constraint still holds true. This means if two transactions try to insert the same value into the column at the same time, Postgres has to abort one of them, since if they both complete, the unique constraint will be violated. Consistency can be summarized as the database will be in a consistent state both before a transaction begins and after it completes.

Isolation

The third guarantee is isolation. If transactions are fully isolated from each other, two transactions will have absolutely no effect on each other at all. Under the default settings, Postgres provides only partial transaction isolation. That means that transactions can still interact with each other in some cases, but they will be isolated from each other in most cases.

You can change the level of isolation Postgres uses to make transactions completely isolated from each other, although there are quite a few downsides to doing so. If it isn’t possible for Postgres to keep two transactions completely isolated, it will abort one of them, and that transaction will have to be retried later. Raising the isolation level makes it much more likely that any given transaction will fail. I’m not going to go into all of the details in this post, but if you’re curious, you can read about all of the different transaction isolation levels and all of the tradeoffs that come up in the Postgres manual. Generally speaking, the default isolation level is good enough for most use cases.

Durability

The last ACID guarantee is durability. Once Postgres says it has committed a transaction, it will not lose the data for that transaction. Even if there is a power outage immediately after the transaction committed, Postgres will not lose any data that was just committed. All durability means is Postgres won’t lose your data.


Overall Postgres provides a set of properties that make it easy to reason about Postgres queries. Without all of the ACID guarantees, it will be much harder to use Postgres, and there would be many use cases where Postgres could not be used at all. How Postgres implements each part of ACID is fairly interesting in and of itself, and has far reaching consequences throughout all of Postgres. We will see how Postgres implements the different guarantees of ACID as we look at more parts of Postgres.

Speeding up Postgres Index Scans with Index-Only Scans

As mentioned previously, indexes are a datastructure that allow for both quickly locating rows that satisfy some condition and for obtaining rows sorted by some order. To execute a regular index scan, Postgres scans through the index to find the locations of the rows it is looking for and fetches those rows from disk. Sometimes enough information is stored in the index that Postgres can skip fetching the rows from disk. When this happens Postgres may instead perform an index-only scan which is a faster version of the ordinary index scan.

For example, let’s say there is a table people with an age column and an index on the age column. The following query can perform an index-only scan:

SELECT COUNT(*)
FROM people
WHERE 30 <= age AND age <= 40;

The query counts the number of rows with a value of age between 30 and 40. Since the query does not care about actual contents of each row, only the total count, it does not need to fetch the full row from the table. It can determine how many rows satisfy the condition straight from the index. By skipping reading rows from disk, an index-only scan dramatically reduces the amount of data needed to be retrieved from disk, and therefore greatly increases the speed of the query.

There are two main conditions for an index-only scan to be used. First, all of the fields retrieved by the query must be in the index. If it needs a value that isn’t in the index, it needs to fetch the entire contents of the row from disk. It is somewhat common for an index to contain additional fields that are never filtered on, just so an index-only scan can be performed with that index. An index that does this is called a “covering index”.

The second condition is the visibility map must be somewhat up to date. It gets somewhat technical, but due to MVCC (how Postgres implements transactions), an index-only scan can only skip retrieving a row from disk if the visibility map says the page the row is on is visible. Otherwise, the index-only scan will have to fetch the row from disk in order to determine whether or not the row is visible the currently running transaction. Unfortunately, the only way is the visibility map is updated is through vacuums. In short, this means index-only scans will have to fetch recently inserted or updated rows from disk. If most of your queries are over recently inserted data, you’re just out of luck.

How Postgres Plans Queries

Query execution in Postgres proceeds in two steps. First Postgres plans the query. This is where Postgres decides how it’s going to execute the query. The result of planning is a query plan which details specifically what kind of scans Postgres will use to retrieve data from disk, what order Postgres will perform joins in, what algorithms Postgres will use to perform the joins, and a fair amount more. You can obtain the query plan for a query yourself by using the EXPLAIN command. Only once Postgres has determined the query plan will it carry out that plan and execute the query.

At a high level, Postgres plans queries by trying every possible query plan and estimating how long each plan would take to execute. Postgres makes use of dynamic programming in order to avoid redundant computations. To score an individual plan Postgres makes use of a fairly sophisticated cost model that makes use of many different factors about the different potential algorithms and statistics about the raw data.

To estimate the cost of retrieving all rows that satisfy some filter from a table, Postgres will first estimate the number of rows that need to be fetched. In order to do this Postgres makes use of internal statistics about the raw data. These statistics include a histogram of the values of each column, the most common values of each column, and much more. As an example, given a table of people with an age column, Postgres will keep a histogram of the ages of each person. Then if asked for all people between 30 and 40, Postgres will use that histogram to estimate how many people fall in between the ages of 30 and 40.

With that approximation of how many rows will need to be fetched from the table, Postgres will look at the different scan methods and estimate the cost of each of them. A sequential scan will roughly cost the number of pages in the table times the cost of reading a page sequentially while an index scan will cost the approximately the number of pages Postgres thinks it will need to read in times the cost of reading a page randomly1.

Evaluation of joins works similarly. Postgres estimates how many rows will be fed into the join based on the distribution of the input data and will choose the join algorithm based on that. The scoring works out such that a hash join is usually preferable when the size of the hash table for the hash join will fit in memory, otherwise a merge join is usually preferable.

To score the entire plan, Postgres just sums the cost of each individual join or scan in the plan. Once Postgres has scored all of the possible plans, it chooses to execute the plan with the highest score. Surprisingly this method of query planning works extremely well. Most of the time Postgres will just do the right thing.

Addendum: if you want to read more about this form of query planning, I highly recommend this paper on the System R query planner. System R was the first SQL database and was the database that popularized this style of query planning and the paper goes in depth into all of the different formulas used for cost calculation. 

explain.despez.com

The site explain.depesz.com is a wonderful tool for analyzing Postgres query plans. It takes the ordinary output of EXPLAIN or EXPLAIN ANALYZE and will pull out the relevant information from the query plan. Here’s what a query from the last post looks like in explain.despez.com.

In addition to the numbers contained directly in the query plan, explain.despez.com includes some additional derived numbers. The most important of these numbers is the exclusive time, which is the time spent in a node excluding the time spent in child nodes. Ordinary EXPLAIN ANALYZE only displays the inclusive time, which is the time spent in a node including the node’s children. To get the exclusive time from ordinary EXPLAIN ANALYZE, you have to take the inclusive time for a node and subtract the inclusive time for each of that’s nodes direct child nodes. Why do that when you can have explain.despez.com tell you the exclusive times for each node for you?

Additionally, you can click on the button labeled “exclusive” in the top left of the query in order to have explain.despez.com highlight the nodes with the highest exclusive time. Personally, whenever I come across an extremely large Postgres plan, I’ll typically copy and paste it into explain.despez.com and have it highlight the nodes with the highest exclusive time. From there, I can visually pick out where all of the time is spent. I find it much better than trying to parse the inclusive time from every node and trying to figure out where all of the time spent.

Using Explain Analyze in Postgres

In the last post, we discussed EXPLAIN and how it can be used to obtain the query plan for a query. EXPLAIN ANALYZE is a variation of EXPLAIN that provides additional information about the query. In addition to displaying all of the output EXPLAIN does, EXPLAIN ANALYZE will also run the query, track some metrics about the execution of each part of the query, and display that information. Let’s look at what happens when we run the query from the last post to get the pet with the oldest owner with EXPLAIN ANALYZE:

> EXPLAIN ANALYZE
SELECT pets.name
FROM people, pets
WHERE people.id = pets.owner_id
ORDER BY people.age
DESC LIMIT 1;
                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.86..6.20 rows=1 width=36) (actual time=261.981..261.981 rows=1 loops=1)
   ->  Nested Loop  (cost=0.86..5343066.52 rows=1000000 width=36) (actual time=261.980..261.980 rows=1 loops=1)
         ->  Index Scan Backward using people_age_idx on people  (cost=0.43..514360.60 rows=10000054 width=8) (actual time=0.021..150.869 rows=100238 loops=1)
         ->  Index Scan using pets_owner_id_idx on pets  (cost=0.42..0.46 rows=2 width=36) (actual time=0.001..0.001 rows=0 loops=100238)
               Index Cond: (owner_id = people.id)
 Planning time: 0.296 ms
 Execution time: 262.009 ms
(7 rows)

(It may help to re-read the second to last paragraph of the last post to remind yourself how the query is executing.)

In the output, for each stage of the plan, there are two sets of numbers displayed, each surrounded by parenthesis. The first set of numbers (the ones starting with ‘cost=’) contains some estimates made by Postgres. As mentioned in the last post, these estimates are useless most of the time and you are better off ignoring them. The second set of numbers (the ones starting with ‘actual time=’) contain the actual run time statistics gathered by Postgres during query execution. Those are the numbers you will want to pay attention to. The format of those numbers is:

(actual time=STARTUP TIME..TOTAL TIME rows=ROW COUNT loops=LOOP COUNT)

Let’s start with an explanation of the loop count since the other numbers depend on it. The loop count, as the name sounds, is the number of times that node (by node I mean part of the plan) was executed. Most nodes will only be executed once, but there are cases where a node can be executed more than once. That mainly happens when a node is a child of a nested loop node. In the plan above, all of the nodes were executed only once except for the index scan on pets_owner_id_idx which was executed 100,238 different times. This is because Postgres uses the index scan to lookup the up the pets for each person as it iterates through the people from oldest to youngest.

The row count is fairly straightforward. It’s the average number of rows returned by the node each time it is executed. In other words, it’s the total numbers of rows returned by the node divided by the number of times the node was executed.

As for the total time, the name is somewhat misleading. The total time is the average amount of time spent executing the node on each execution of the node. For example, the index scan on pets_owner_id_idx has a total time 0.001ms. This means on average it took 0.001ms. Since the node was executed a 100,238 times, a total of ~100ms was spent executing that node.

Last is the startup time. The startup time is the amount of time it takes for the node to return a single row. Like the other numbers, this is averaged across every execution of the node. Personally I have never found the startup time useful and you’ll probably want to ignore it.

Together these numbers make it possible to determine where all of the time is spent during query execution. Looking at the plan above, it looks like most of the time is spent in two different parts of the query, the two index scans. A total of ~150ms is spent in the first index scan (150ms total time * 1 loop) and a total of ~100ms (.001ms total time * 100,238 loops) is spent in the second. This makes up ~250ms of the 262ms total execution time.

If you recall, the query executes by iterating over people in order from oldest to youngest, and checks if that person has a pet. If that person has a pet, the name of that pet is returned as the result of the query. Based on the output of EXPLAIN ANALYZE, it appears Postgres has to iterate through 100,238 people before it finds one person that has a pet. Since most of the time is spent fetching the rows for those people from the people table, and then looking up the pets for those people in the pets table, if we could avoid going through so many people, we may be able to dramatically speed up the query.

One way to do this would be to include the age of the owner of each pet in the pets table. If we then indexed that field we could quickly find the pet with the oldest owner and avoid going through all of 100,237 people that don’t have pets. In theory this would allow Postgres to execute the query much more quickly than it can currently. Let’s see what happens when we add an owner_age field with the age of the owner, create an index on it, and rewrite the query to take advantage of it:

> EXPLAIN ANALYZE
SELECT pets.name
FROM pets
ORDER BY owner_age DESC
LIMIT 1;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.42..0.49 rows=1 width=36) (actual time=0.021..0.021 rows=1 loops=1)
   ->  Index Scan Backward using pets_owner_age_idx on pets  (cost=0.42..61468.43 rows=1000000 width=36) (actual time=0.020..0.020 rows=1 loops=1)
 Planning time: 0.142 ms
 Execution time: 0.054 ms
(4 rows)

We can see the query now executes over 5000x faster. Now all Postgres needs to do is take the last row in the pets_owner_age_idx. Thanks to EXPLAIN ANALYZE, we were able to both determine where all of the time is being spent for the query and figure out how to optimize it.