A common problem I’ve seen that many people are unfamiliar with how to solve is obtaining all of the rows that are the maximum of one value grouped by another value. For example, you may have a pets table with columns name, age, and owner_id, and you may want to obtain the name of the oldest pet for each owner. This seems like a problem that could be solved by a group by, but it actually cannot be using the built-in aggregation functions1. If you try writing a query with a group by, you’ll try to write something like:
SELECT owner_id, <agg for the name of the oldest pet> GROUP BY owner_id;
Believe it or not, there is no built-in aggregation that can be used to get the desired result. The easiest way to solve this problem is with SELECT DISTINCT ON … ORDER BY2. The general format of such a query looks like:
SELECT DISTINCT ON (<group by columns>) <selected columns> FROM <table> ORDER BY <group by columns>, <order by columns>;
In this specific case, a query that solves the problem is:
SELECT DISTINCT ON (owner_id) owner_id, name FROM pets ORDER BY owner_id, age DESC;
Let’s breakdown this query and see how it works. The first part you need to understand is the behavior of SELECT DISTINCT ON. SELECT DISTINCT ON returns one row for every unique combination of values in the parens. For example, the query:
SELECT DISTINCT ON (owner_id) owner_id, name FROM pets;
Will return the owner_id and name of one pet belonging to each distinct pet owner. One important detail of the behavior of SELECT DISTINCT ON is that it will always return the first row it encounters for each distinct pet owner. That means you can use SELECT DISTINCT ON with ORDER BY to get the row with the highest/lowest value of a given column for each combination of the values being distincted on. The query above uses this fact to get the pet with the oldest age for each owner.
One additional detail with the query is you need to include the columns being distincted on in the ORDER BY. This is necessary because of how Postgres executes a query with DISTINCT ON internally. Postgres executes such a query by first sorting by the fields being distincted on. Sorting the rows this way places all rows with the same values of the fields being distincted on together. Postgres can then just iterate through all of the rows and add a new row to the result of the query whenever it sees a row whose values of the columns differ from the row before it. SELECT DISTINCT ON … ORDER BY takes advantage of this behavior by first sorting by the columns being distincted on and then sorting by the columns being ordered by. That way all rows with the same values of the fields being distincted on are clumped together, and the first row in each clump is the row desired in the output.
- It is possible to define custom aggregation functions that could solve this problem.
- This problem can also be solved with window functions and a subselect, but that is slightly trickier and not as efficient.