This is the first part of a three part series examining the Postgres join algorithms.
For those who need a reminder, a join between two tables results in every pair of rows where some condition is true. Given a query of the form:
SELECT * FROM table1, table2 WHERE <filter>;
Postgres will return every pair of rows from table1 and table2 where the join condition is true. Similar to how a sequential scan is the most basic way to retrieve rows from the table, nested loops are the most basic way for Postgres to perform a join. If Postgres were to execute the query with a nested loop, it could do so by iterating all of the entries in table1, iterating through all of the entries in table2, and then emitting a row whenever the pair of rows from table1 and table2 satisfy the filter condition. That’s literally all there is to a nested loop!
Now the truth is this way of executing a join can be extremely slow. A naive nested loop join is O(M*N) where M and N are the sizes of the inputs to the join. Although nested loops are relatively inefficient, they do have one major advantage. A nested loop is the only join algorithm Postgres has that can be used to process any join! No matter what the join condition is and no matter what indexes exist, Postgres always has the option of executing a nested loop (analogous to how Postgres always has the option of executing a sequential scan).
There is a slight variation of a nested loop, sometimes called an “index join”, that is much more efficient, but isn’t as general. Let’s say we have two tables, people (with fields id, and age) and pets (with fields id and owner_id), where the owner_id field in the pets table is the id of the owner of the pet. If we wanted to get all (people, pet) pairs where the owner of the pet is over 30 years old, we could use the following query:
SELECT * FROM people, pets WHERE people.id = pets.owner_id AND people.age > 30;
If there is an index on the owner_id field, Postgres can execute an index join. Postgres can first retrieve all rows from the people table for people over the age of 30 (using either a sequential scan, index scan, or bitmap index scan). Then Postgres can iterate through each person over 30 and use the index on owner_id to quickly lookup all pets that belong that person and then add that pair of rows to the result. This is only O(M*log(N)) and is usually much more efficient than the O(M*N) naive nested loop.