Speeding up Postgres Index Scans with Index-Only Scans

As mentioned previously, indexes are a datastructure that allow for both quickly locating rows that satisfy some condition and for obtaining rows sorted by some order. To execute a regular index scan, Postgres scans through the index to find the locations of the rows it is looking for and fetches those rows from disk. Sometimes enough information is stored in the index that Postgres can skip fetching the rows from disk. When this happens Postgres may instead perform an index-only scan which is a faster version of the ordinary index scan.

For example, let’s say there is a table people with an age column and an index on the age column. The following query can perform an index-only scan:

SELECT COUNT(*)
FROM people
WHERE 30 <= age AND age <= 40;

The query counts the number of rows with a value of age between 30 and 40. Since the query does not care about actual contents of each row, only the total count, it does not need to fetch the full row from the table. It can determine how many rows satisfy the condition straight from the index. By skipping reading rows from disk, an index-only scan dramatically reduces the amount of data needed to be retrieved from disk, and therefore greatly increases the speed of the query.

There are two main conditions for an index-only scan to be used. First, all of the fields retrieved by the query must be in the index. If it needs a value that isn’t in the index, it needs to fetch the entire contents of the row from disk. It is somewhat common for an index to contain additional fields that are never filtered on, just so an index-only scan can be performed with that index. An index that does this is called a “covering index”.

The second condition is the visibility map must be somewhat up to date. It gets somewhat technical, but due to MVCC (how Postgres implements transactions), an index-only scan can only skip retrieving a row from disk if the visibility map says the page the row is on is visible. Otherwise, the index-only scan will have to fetch the row from disk in order to determine whether or not the row is visible the currently running transaction. Unfortunately, the only way is the visibility map is updated is through vacuums. In short, this means index-only scans will have to fetch recently inserted or updated rows from disk. If most of your queries are over recently inserted data, you’re just out of luck.


By the way, if you are working on scaling Postgres, I'm currently working on Perfalytics. Perfalytics is a service designed to help teams scale out Postgres by giving them insight into why their queries are slow and how they can go about making their queries faster. If you're interested in learning more about Perfalytics shoot me an email at michaelmalis2@gmail.com.

Leave a Reply

Your email address will not be published. Required fields are marked *