This is the last part of a three part series examining the Postgres join algorithms.
While a hash join is usually the fastest join algorithm, it is only so when it can be performed in memory. When Postgres thinks the hash table needed for a hash join will exceed memory, Postgres will typically use a merge join instead. A merge join winds up using disk much more effectively than a hash join can.
Similar to a hash join, a merge join only works on joins where the join condition is an equality filter. Let’s use the example we’ve been using with a people table (with fields id, name, and age) and a pets table (with fields id and owner_id), and the query:
SELECT * FROM people, pets WHERE people.id = pets.owner_id AND people.age > 30;
If Postgres decides to use a merge join, execution could proceed as follows. First find all rows for people over 30 and then sort those rows by people.id. Then fetch all rows from the pets table and sort them by owner_id. With the two input relations sorted, Postgres then performs a “merge”, similar to the merge step in merge sort.
The merge step works by comparing the first rows from each input (the person with the smallest id, and the pet with the smallest owner_id). If the id of the owner is equal to the owner_id of the pet, Postgres will add a row to the result of the join for that pair of rows and remove those two rows from the input1. If the fields are not equal and the person has an id smaller than the owner_id of the pet, Postgres will remove the person with the smallest id and continue the join with the person that has the next smallest id. The pet is removed instead if the owner_id of the pet is smaller than the id of the person. This process continues until one of the inputs runs out at which point Postgres returns all of the rows it generated as the result of the join.
Although a merge join is O(N*log(N) + M*log(M)), due to the need to sort the inputs, it can still be faster than a hash join even though it is asymptotically worse. If the sorting algorithm used is merge sort (it typically is), almost the entirety of the merge join is based on merging two inputs together. Postgres can perform a merge on disk very efficiently by keeping a small subset of the data in memory, performing a merge on that data, writing the result of that merge out to disk, and then reading more of the inputs into memory from disk. This reads and writes large amounts of data to/from disk at a time, which is relatively efficient when compared to an on-disk hash table which would be performing random I/O all over the place.
In addition to a merge join being faster when performed on disk, there is a case where a merge join is faster than a hash join, even when the hash join is performed in memory. If one or both of the inputs to the merge join are already sorted by the field being joined on, Postgres can skip the sort step and go straight to the merge step which, under some circumstances, makes the merge join faster than the hash join.