How Postgres Plans Queries

Query execution in Postgres proceeds in two steps. First Postgres plans the query. This is where Postgres decides how it’s going to execute the query. The result of planning is a query plan which details specifically what kind of scans Postgres will use to retrieve data from disk, what order Postgres will perform joins in, what algorithms Postgres will use to perform the joins, and a fair amount more. You can obtain the query plan for a query yourself by using the EXPLAIN command. Only once Postgres has determined the query plan will it carry out that plan and execute the query.

At a high level, Postgres plans queries by trying every possible query plan and estimating how long each plan would take to execute. Postgres makes use of dynamic programming in order to avoid redundant computations. To score an individual plan Postgres makes use of a fairly sophisticated cost model that makes use of many different factors about the different potential algorithms and statistics about the raw data.

To estimate the cost of retrieving all rows that satisfy some filter from a table, Postgres will first estimate the number of rows that need to be fetched. In order to do this Postgres makes use of internal statistics about the raw data. These statistics include a histogram of the values of each column, the most common values of each column, and much more. As an example, given a table of people with an age column, Postgres will keep a histogram of the ages of each person. Then if asked for all people between 30 and 40, Postgres will use that histogram to estimate how many people fall in between the ages of 30 and 40.

With that approximation of how many rows will need to be fetched from the table, Postgres will look at the different scan methods and estimate the cost of each of them. A sequential scan will roughly cost the number of pages in the table times the cost of reading a page sequentially while an index scan will cost the approximately the number of pages Postgres thinks it will need to read in times the cost of reading a page randomly1.

Evaluation of joins works similarly. Postgres estimates how many rows will be fed into the join based on the distribution of the input data and will choose the join algorithm based on that. The scoring works out such that a hash join is usually preferable when the size of the hash table for the hash join will fit in memory, otherwise a merge join is usually preferable.

To score the entire plan, Postgres just sums the cost of each individual join or scan in the plan. Once Postgres has scored all of the possible plans, it chooses to execute the plan with the highest score. Surprisingly this method of query planning works extremely well. Most of the time Postgres will just do the right thing.

Addendum: if you want to read more about this form of query planning, I highly recommend this paper on the System R query planner. System R was the first SQL database and was the database that popularized this style of query planning and the paper goes in depth into all of the different formulas used for cost calculation. 

  1. Postgres has the configuration variables random_page_cost and seq_page_cost which represent the costs of reading a page randomly and reading a page sequentially. They are supposed to be configured based on the hardware Postgres is running on.

Leave a Reply

Your email address will not be published. Required fields are marked *