Transaction id wraparound has been the cause of numerous outages. Both Sentry and Joyent have blog posts detailing day long outages caused by transaction id wraparound. There are also many other companies that have ran into transaction id wraparound, but haven’t said so publicly.
Txid wraparound is a problem due to MVCC. MVCC relies on being able to take the txids of two transactions and determine which of the transactions came first. One detail of how Postgres implements MVCC is that txids in Postgres are only 32-bit integers. That means there are only 232, or about four billion, possible txids1. Four billion may sound like a lot, but high write volume workloads can reach four billion transactions within a matter of weeks.
After four billion transactions, if Postgres didn’t do anything about it, Postgres would start to have to use duplicate txids. To enable the reuse of txids, Postgres assigns txids sequentially from a cycle. The cycle wraps back around to zero, going something like: 0, 1, 2, …, 232-1, 0, 1, … To determine which of two txids is older, the 231 txids before a txid in the cycle are considered older, and the 231 after it are considered larger. Under this scheme, Postgres has to make sure that all txids currently in use are within a range of 231 of each other. That way all of the txids in use form a consistent ordering.
Postgres makes sure only a range of txids are in use at a time by periodically removing all of the old txids still in use. If the old txids aren’t cleared quickly enough, there will eventually be a new txid that is both newer than the newest txids and simultaneously appears older than the oldest txids. This is txid wraparound. Txid wraparound gets its name because the cycle of txids has completely wrapped around. If Postgres let txid wraparound happen, some transactions in the past would appear to be in the future, causing massive data corruption. Instead, if Postgres ever gets close to txid wraparound, it will cease normal operations to prevent data corruption. This is usually what people mean when they say txid wraparound caused an outage. That Postgres got close enough to txid wraparound that it stopped accepting new transactions.
To clear out old txids, Postgres uses a special vacuum sometimes called the anti-wraparound vacuum. If a table has a txid in it that is older than a certain age, Postgres will run an anti-wraparound vacuum on that table. All the anti-wraparound vacuum does is scan through the entire table and replace every reference to an old txid with a special txid2. The special txid is automatically considered older than every other txid. This scheme works because Postgres never needs to determine which of two old txids is older. In MVCC, at least one of the two txids in any comparison will be relatively new.
As long as the anti-wraparound vacuum is able to keep up, you do not need to worry about txid wraparound. If for any reason, the anti-wraparound vacuum is unable to complete its job quickly enough (the settings could be such that it cannot keep up with the rate of new transactions) the database can get close to txid wraparound. This will cause Postgres to cease normal operations, likely causing an outage.
Ultimately, you usually don’t have to worry about txid wraparound. As long as you have reasonable autovacuum settings, and aren’t creating tons of transactions all of the time, the anti-wraparound vacuum should be able to prevent txid wraparound just fine. You especially shouldn’t need to worry about it if you are running a recent version of Postgres. It’s somewhat technical, but in Postgres 9.6, the anti-wraparound vacuum now has a bit in the visibility map which allows it to determine whether or not a page possibly has old txids. This makes it possible for the anti-wraparound vacuum to skip all pages that Postgres knows for certain don’t have any old txids, making it much faster and require dramatically less I/O. This makes the anti-wraparound vacuum much less likely to fall behind. Robert Haas, one of the Postgres core contributors, has a blog post detailing the change.
- There has been some debate about increasing the size of the txids. This has some downsides such as increasing the size of each row and decreasing the effectiveness of various caches.
- Both when the anti-wraparound vacuum should run and whether or not a txid is considered to be old is based on various configuration parameters.