Postgres WAL

Write-ahead logging, or as it’s commonly referred to, WAL, is an optimization Postgres uses to minimize disk I/O while still preventing data loss. Intuitively, whenever a transaction completes, a record of every single change that transaction made must have been written out to persistent storage. The changes that need to be written out include every single row that was inserted, deleted, or updated. If transactions completed before writing all of the changes made to disk and the power ever went out before they were, Postgres would suffer data loss. When Postgres would come back up, it will see the transaction completed1, but Postgres will have no record of some of the changes made by the transaction.

The naive way to make sure all changes are written to disk is to actually perform those changes on disk. Every time a transaction modifies a row it would perform a write to the location of that row on disk. The downside to this approach is it requires performing tons of random writes to disk, one to each row that was changed by the transaction. That many random writes winds up being inefficient. To reduce the amount of necessary I/O, Postgres instead uses write-ahead logging.

For WAL a transaction has to do two things to modify a row. First it has to modify a copy of the row in memory2. Queries running know to first check if there is a copy of the row in memory first, so queries will always use the new copy instead of using the old copy on disk. Second, a record describing the change made (what the change was and where) is added to the “WAL log”. The WAL log is simply a record of all changes that have been made and is stored on disk.

WAL prevents data loss since the WAL log can be used to restore the proper state of the database. When Postgres is coming back up from a crash, the first thing it does is go through all of the records in the WAL log that may not have been propagated to disk and perform those changes if they weren’t performs already. WAL also performs less disk I/O since writes to the WAL log can be batched together and flushed to disk in a single write. By writing multiple WAL records to disk at the same time, the total amount of disk I/O needed is dramatically reduced. A transaction just has to make sure to flush the WAL log to disk before completing3. That way there is a record of all changes made by that transaction on disk.

One last detail of WAL is how the data eventually gets written to disk. There is a background job called the “checkpointer”. The checkpointer is constantly going through all of the changes made in memory and not on disk yet, and performing those changes on disk. Since the checkpointer is writing changes to disk after the fact, it can perform changes to multiple rows near each other on disk all at the same time.

  1. The commit log, described in the previous post, is stored on disk.
  2. It’s actually a copy of the page the row is on, which is stored in the buffer cache.
  3. By writing it’s txid to the commit log, as described in my post on MVCC.

Leave a Reply

Your email address will not be published. Required fields are marked *