Avoiding Torn Pages

For the Postgres file layout, Postgres reads and writes data to disk 8kb at a time. Most operating systems make use of a smaller page size, such as 4kb. If Postgres is running on one of these operating systems, an interesting edge case can occur. Since Postgres writes to disk in units of 8kb and the OS writes to disk in units of 4kb, if the power went out at just the right time, it is possible that only 4kb of an 8kb write Postgres was performing were written to disk. This edge case is sometimes referred to as “torn pages”. Postgres does have a way of working around torn pages, but it does increase the amount of I/O Postgres needs to perform.

Under normal circumstances, Postgres uses a technique called WAL to prevent data loss. At a high level, WAL works by creating a log on disk of all changes made by a transaction to the database, before the changes themselves are persisted to disk. Since creating a single continuous log on disk is much cheaper than performing random writes to disk, WAL reduces the amount of I/O Postgres needs to perform without the risk of data loss. If Postgres crashes, Postgres will be able to recover all of the changes that weren’t persisted to disk by replaying the WAL log.

Although keeping track of all of the changes made does allow Postgres to recover from common crashes in which every write was either done completely or not at all, it does not let Postgres recover from a torn page. Due to the specifics with the implementation of WAL log1, in the case of a torn page, the changes alone do not provide Postgres with enough information to determine what changes should be applied to each half of the page.

To recover from torn pages, Postgres does something called “full-page writes”. Whenever Postgres makes a change to a page, it writes a full copy of the page to the WAL log. That way, when using the WAL log to recover from a crash, Postgres does not need to pay attention to the contents of the page stored on disk. Postgres is able to  recover the entire state of the page just from the WAL log, sidestepping the problem of torn pages entirely! To avoid constantly writing full copies of every page to WAL, Postgres checks if a full copy of the page was recently written to WAL, and if so, will not write a full copy of the page since it will still be able to recover the complete page from WAL.

There is actually a parameter, full_page_writes, that allows you to disable this behavior. If you care about preventing data corruption, there are very few cases in which you should disable it. The only real case is if the OS/filesystem has built-in protection against torn pages. For example, the ZFS filesystem provides it’s own transactional guarantees and prevents torn pages. This is largely due to the copy-on-write nature of ZFS.

  1. That writes to the WAL log may not appear in the same order they were performed on the page itself. Postgres normally deals with this issue by storing a counter on each page and updating it with every change that is made to the page. That way it can tell in what order the changes were made to the page. This strategy does not work with torn pages.

Leave a Reply

Your email address will not be published. Required fields are marked *