I’ve been writing a lot these past two months. I decided I’m going to take a break for a little bit. I plan on starting to write continuously again within the next 2-4 weeks. That is all.
The parameter track_io_timing is a relatively unknown, but super helpful parameter when optimizing queries. As the name suggests, when the parameter is turned on, Postgres will track how long I/O takes. Then, when you run a query with EXPLAIN (ANALYZE, BUFFERS), Postgres will display how much time was spent just performing I/O.
You normally don’t want to have track_io_timing always on since it incurs a significant amount of overhead. To get around this, when you want to time how long a query is spending performing I/O, you can use a transaction with SET LOCAL track_io_timing = on;. This will enable track_io_timing only during the transaction. As a specific example of track_io_timing, here’s a simple query over a table I have laying around:
> BEGIN; SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pets; COMMIT; QUERY PLAN ----------------------------------------------------------------------------------------------------------- Seq Scan on pets (cost=0.00..607.08 rows=40008 width=330) (actual time=8.318..38.126 rows=40009 loops=1) Buffers: shared read=207 I/O Timings: read=30.927 Planning time: 161.577 ms Execution time: 42.104 ms
The I/O Timings field shows us that of the 42ms spent executing the query, ~31ms was spent performing I/O. Now if we perform the query again when the data is cached:
> BEGIN; SET track_io_timing = ON; EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM pets; COMMIT; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on pets (cost=0.00..607.08 rows=40008 width=330) (actual time=0.004..7.504 rows=40009 loops=1) Buffers: shared hit=207 Planning time: 0.367 ms Execution time: 11.478 ms
We can see the query is just about 31ms faster! This time the query does not show any information about the I/O timing since no time was spent performing I/O, due to the data being cached.
When benchmarking queries, I also make sure to make use of track_io_timing so I can see whether the expensive part of the query is performing I/O, or if the expensive part is something else entirely.
Since Postgres 9.5, Postgres has supported a useful a feature called UPSERT. For a reason I can’t figure out, this feature is referred to as UPSERT, even though there is no UPSERT SQL command. In addition to being a useful feature, UPSERT is fairly interesting from a “behind the scenes” perspective as well.
If you haven’t noticed yet, the word “upsert” is a portmanteau of the words “update” and “insert”. As a feature, UPSERT allows you to insert a new data if that data does not already exist and specify an action to be performed instead if that data does already exist. More specifically, when there is a unique constraint on a column (a constraint specifying all values of a column are distinct from each other), UPSERT allow to say “insert this row if it does not violate the unique constraint, otherwise perform this action to resolve the conflict”.
As an example, let’s say we have a counters table where each row represents a counter. The table has two columns, id and value, where the id specifies the counter we are referring to, and value is the number of times the counter has been incremented. It would be nice if we could increment a counter without needing to create the counter in advance. This is a problem for UPSERT. First let’s create the table:
CREATE TABLE counters (id bigint UNIQUE, value bigint);
It’s important the the id column is marked as unique. Without that we would be unable to use UPSERT.
To write an UPSERT query, you first write a normal INSERT for the case when the constraint is not violated. In this case, when a counter with a given id does not already exist, we want to create a new counter with the given id and the value 1. An INSERT that does this looks like:
INSERT INTO counters (id, value) SELECT <id> AS id, 1 AS value;
Then to make it an UPSERT, you add to the end of it ON CONFLICT (<unique column>) DO <action>. The action can either be NOTHING, in which case the query will be ignored, or it can be UPDATE SET <column1> = <expr1>, <column2> = <expr2> … This will modify the existing row and update the corresponding columns to the new values. In this case we want to use the UPDATE form to increment the value of the counter. The whole query winds up looking like:
INSERT INTO counters SELECT <id> AS id, 0 AS value ON CONFLICT (id) DO UPDATE SET value = counters.value + 1;
When you run the above command with a given id, it will create a new counter with the value 1 if a counter with the id does not already exist. Otherwise it will increment the value of the existing counter. Here’s some examples of its use:
> SELECT * FROM counters; id | value ----+------- (0 rows) > INSERT INTO counters SELECT 0 AS id, 1 AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + 1; > SELECT * FROM counters; id | value ----+------- 0 | 1 (1 row) > INSERT INTO counters SELECT 0 AS id, 1 AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + 1; > SELECT * FROM counters; id | value ----+------- 0 | 2 (1 row) > INSERT INTO counters SELECT 0 AS id, 1 AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + 1; > SELECT * FROM counters; id | value ----+------- 0 | 3 (1 row) > INSERT INTO counters SELECT 1 AS id, 1 AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + 1; > SELECT * FROM counters; id | value ----+------- 0 | 3 1 | 1 (2 rows) > INSERT INTO counters SELECT 1 AS id, 1 AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + 1; > SELECT * FROM counters; id | value ----+------- 0 | 3 1 | 2
One last bit about UPSERT, you can use the faux table excluded to refer to the new row being inserted. This is useful if you either want to values of the old row with values of the new row, or make the values of the row a combination of the values of the old and new rows. As an example, let’s say we want to extend the counter example to increment by an arbitrary amount. That can be done with:
INSERT INTO counters SELECT <id> AS id, <amount> AS VALUE ON CONFLICT (id) DO UPDATE SET value = counters.value + excluded.value;
This even works if you are incrementing multiple counters simultaneously all by different amounts.
What makes UPSERT so interesting to me is that it works even in concurrent situations. UPSERT still works even if other INSERT and UPDATE queries are all running simultaneously! Prior to the UPSERT feature there was a fairly complex method to emulate UPSERT. That method involved using PL/pgSQL to alternate between running INSERT and UPDATE statements until one of them succeeded. The statements need to be ran in a loop because it is possible for a different INSERT to run before the UPSERT INSERT was ran, and a row could be deleted before the UPDATE could be ran. The UPSERT feature takes care of all of this for you, while at the same time providing a single command for the common pattern inserting data if it does not already exist and otherwise modifying the old data!
Postgres has many different types of scans builtin. The list includes sequential scans, index scans, and bitmap scans. One useful scan type Postgres does not have builtin that other databases do is the loose index scan. Both MySQL and Oracle support loose index scans. Fortunately for Postgres users, loose index scans can be emulated through a recursive CTE.
At a high level, the idea behind the loose index scan is that rows necessary for the query are at predictable locations in the index. All of the index scans Postgres currently supports requires reading all of the values from the index between two values. Although Postgres does not do it, for certain queries it is possible to skip over large parts of the index because it is possible to determine that values in that part of the index do not affect the result of the query. For a specific example, let’s say we want to calculate the number of distinct elements in a table:
CREATE TABLE ints (n BIGINT); INSERT INTO ints SELECT floor(random() * 10) FROM generate_series(1, 10000000); CREATE INDEX ON ints (n); EXPLAIN ANALYZE SELECT COUNT(DISTINCT n) FROM ints;
This example creates a table with 10,000,000 rows, all of which contain a random number from 0 to 9. Then it counts the number of distinct elements using a naive COUNT(DISTINCT n). The plan for the query is available here.
As you can see from the query plan, using a naive COUNT(DISTINCT n) takes six seconds. Of this time, 650 milliseconds is used to read the data from disk while the rest of the time is used to perform the aggregation and determine how many distinct elements there are. A sequential scan is only performed because the query requires reading all of the rows from the database and a sequential scan is the fastest way to do it.
Most of the work Postgres is doing for this query is unnecessary. For example, once Postgres sees a row with a zero, it knows all other rows that contain a zero will not affect the result. With this idea in mind, we can write the following query which performs a loose index scan:
> EXPLAIN ANALYZE > WITH RECURSIVE temp (i) AS ( > (SELECT n FROM ints ORDER BY n ASC LIMIT 1) > UNION ALL > (SELECT n FROM temp, > LATERAL (SELECT n > FROM ints > WHERE n > i > ORDER BY n ASC > LIMIT 1) sub) > ) > SELECT COUNT(*) FROM temp;
The plan is available here.
This new query takes less than one millisecond! Let’s try dissecting this query. First of all, this query makes use of a recursive CTE. If you aren’t familiar with how to interpret recursive CTEs, I suggest you read my blog post on how to understand them. The first part of the recursive CTE is:
SELECT n FROM ints ORDER BY n ASC LIMIT 1
This simply gets the smallest number from the table. As for the recursive part of the recursive CTE:
SELECT n FROM temp, LATERAL (SELECT n FROM ints WHERE n > i ORDER BY n ASC LIMIT 1) sub
This query takes the current element in the working table and uses a lateral join to fetch the smallest element greater than the current element. In other words, the next largest element. By first fetching the smallest value in the table and then repeatedly fetching the next largest, the recursive CTE selects the distinct elements from the table! From there, we just need to count the number of distinct elements to determine the number of distinct elements in the table.
By using an index to find each element, only a single index scan needs to be performed per distinct element. Since there are only 10 distinct elements in the table, only 10 index scans need to be performed1. The loose index scan is so much faster than the regular aggregation because once it sees a given value, it skips over all other rows with the same value.
Besides counting the number of distinct elements in a table, there is one other main use case where loose index scans are helpful: if you ever want to calculate the min of one column grouped by another column. If you have a specific value of the column you are grouping by, you can use a compound index on (grouping column, other column) to quickly find the smallest value of the other column for that one value. By iterating over the distinct values of the column you are grouping by, you can answer the query fairly quickly. Without using a loose index scan here, the query would read entire table, which as we already know, is pretty slow.