The four horsemen behind thousands of Postgres outages

Postgres is great, but there are some very common problems that people have that can pretty easily lead to outages with Postgres. These aren’t just theoretical issues. From talking to a lot of startups, these are the things that actually cause outages in production, especially on teams that don’t have a dedicated database person. These are the four main areas I want to improve upon, in pgrust, my reimplementation of Postgres.

VACUUM and Transaction ID Wraparound

The Postgres vacuums are probably the number one cause of outages related to Postgres. When you delete a row in Postgres, Postgres doesn’t actually delete it. Postgres will keep the row around and mark it as deleted. This is done for many reasons including giving Postgres an easy way to rollback the transaction if needed. To actually delete the rows and free up the space, Postgres has a background job called the vacuum, which will physically remove the rows that were marked as deleted.

This creates some problems. First, the vacuum itself. Because the vacuum has to read all your data, the vacuum can consume a ton of I/O. This I/O will compete with the I/O your queries need and can slow your queries down. If you try to configure your vacuum such that it doesn’t use as much IO, it may fall behind, and you may actually be accruing deleted rows faster than the vacuum can clean them up. You end up having to play this really messy game with the vacuum where you want to be aggressive enough that it keeps up with how quickly you’re updating your deleted data, but not too fast that it competes with I/O for your queries.

The second issue is transaction wraparound. In order to keep track of which rows are live and which ones are dead, Postgres will label those rows with a transaction ID. Postgres uses a 32-bit ID to represent the transaction IDs. That means you can have at most four billion transactions before Postgres has to start reusing transaction IDs. One job of the vacuum is it will reclaim the old transaction IDs so they can be reused. Now, if the vacuum falls too far behind on reclaiming transaction IDs, Postgres will actually shut down your DB to prevent you from reusing live transaction IDs. This may sound silly, but it’s necessary because otherwise your data can become corrupted. I would bet this is probably caused on the order of thousands or tens of thousands of outages with Postgres.

What pgrust is doing differently:

I’m looking at two different ways to solve for this:

There’s been a patch out there to add 64-bit transaction IDs to Postgres. There’s been reluctance to merge the patch for a number of reasons: it breaks compatibility with previous versions of Postgres and adds 32 bits of overhead to every row. I personally think, given the number of outages this has caused, it’s totally worth it to make this change.
I’m looking at alternative architectures that don’t require VACUUMs at all. Oracle takes a completely different approach to marking rows as deleted that doesn’t require a VACUUM, referred to as an undo log. There are projects exploring adding this kind of support to Postgres.

Connection Limits and Query Parallelism

When you spin up a Postgres instance, there’s a configuration variable for the maximum number of connections that Postgres can create. If more things try to connect to Postgres then the connection limit, they won’t be able to, and your database will go down. Changing the number of connections requires restarting your DB, so you need to be very thoughtful about about this setting when first creating your database. If you don’t know about this ahead of time, you’ll likely end up with a connection limit that’s too low. Many people will run software like PgBouncer, which will partially solve this problem, but it’s still a big pain.

You may ask, why does Postgres even have a connection limit in the first place? The answer largely comes down to how Postgres parallelizes queries. Every time you create a new Postgres connection, Postgres will create a new process. Compared to other ways of doing parallelism, processes are very expensive, both in terms of taking CPU resources but also the amount of time it takes to spin up a new process.

Not only do processes limit Postgres’ ability to parallelize across queries, but they also limit Postgres’s ability to parallelize within a query. When Postgres wants to parallel process a single query, Postgres will spin up multiple processes and have them each process the query. Because it’s expensive to spin up new processes, Postgres will only do this for long-running queries.

There’s been years of people talking about switching Postgres from a process model to a threading model, but nothing concrete has come out of that. One of the big challenges is it’s such a big refactor that it’s really hard to do, and it’s really hard to do in a safe way.

What pgrust does:

pgrust was architected from day one with a thread based model. One of the big reasons Postgres uses a process model is because of safety. Processes are more isolated from each other than threads are, so if something goes bad in a process, Postgres, it won’t impact the other processes. Because pgrust is written in Rust and Rust provides safety guarantees at compile time, this isn’t a risk for pgrust.

Bad Query Plans

Postgres’s query planner, although it’s very smart, can be challenging to work with. If you aren’t familiar, Postgres has dozens of different algorithms it can use to execute a query. When you run a query, Postgres will look at it and look at some statistics about the data in your tables and will try to figure out which algorithms will execute the query the fastest.

In many cases, Postgres will get things right. The problem is when Postgres gets things wrong, it can get them very wrong. The difference between a good query plan and a bad one, can be the difference between a query taking 10ms and taking 10 minutes. Even worse, Postgres doesn’t give you meaningful control of the query planner. That means when you do get a bad query plan, you’re pretty much SOL.

If you get a bad query plan, the only real option you have is to completely turn off some of the Postgres algorithms. Other databases like MySQL allow for “planner hints” which allow you to influence how specific queries are executed. Postgres doesn’t have anything similar.

At a previous company I worked at, we had to globally turn off nested loop joins because Postgres would try to use them all the time, and it would make all the queries incredibly slow. This was in large part to our use of json, which made it so Postgres couldn’t understand our data, but I’ll get back to that in a bit.

What pgrust is building toward:

This is one of the more theoretical parts of pgrust, but I want to build in an adaptive query planner. The idea: if a query that normally takes 10ms suddenly takes 10 seconds, pgrust would detect the regression, check if the plan changed, check if the underlying statistics drifted, and make a correction, instead of waiting for a DBA to notice and manually intervene.

This hasn’t been built into pgrust yet, but I don’t see a reason why this can’t exist.

JSON

Postgres has a love-hate relationship with json. When you’re working with unstructured data, Postgres json makes it really easy to store that data and makes it really easy to query that data. Unfortunately, it’s also really easy for things to go wrong with json.

At Heap, we were heavy users of json and it was a massive headache. My manager even wrote a whole blog post about it. The biggest issue is Postgres doesn’t capture statistics for json. For most types Postgres will collect a ton of statistics: a histogram, the most common values, and more. For json, Postgres captures no meaningful statistics. Instead what Postgres does is when you filter on a json column, it estimates the filter will match .1% of rows. The .1% is a magic number instead of being calculated based on your data. The right number could be 80% or it could be .0001%.

All this together means as soon as you start filter on json, you’re pretty much guaranteed to start getting bad query plans.

That isn’t to mention the other big problem with json. Postgres doesn’t do any sort of compression of your json. That means if you are reusing the same keys and values, Postgres will store those full keys and values from scratch every time. There’s real opportunity to do something like dictionary compression to reduce the amount of space json takes up.

What pgrust is going to do about it

The first thing I want to do is build in real statistics for json. This will solve 80% of the pain points with json. There were a couple patches floating around for adding statistics for json and I want to see one through to completion. It will require some experimentation, but I’ve got a couple ideas in mind.

I also want to look at introducing some sort of compression to make the amount of space json takes up less problematic. That could be either by introducing some sort of cross row compression or by introducing dictionary compression.

What’s on the roadmap for pgrust

The current priorities:

Continue pushing toward 100% Postgres compatibility – we’re passing 96% of the Postgres regression suite today with only a couple issues remaining.
Stability and bug bashing – getting pgrust to a state where people can start trusting it
Start working on the architectural improvements discussed above – starting with 64-bit transaction IDs

Links for pgrust

Try it: pgrust.com (WASM demo, runs in your browser)
Source: GitHub
Community: Discord
Updates: mailing list

malisper.me

malisper.me

malisper.me

The four horsemen behind thousands of Postgres outages

VACUUM and Transaction ID Wraparound

Connection Limits and Query Parallelism

Bad Query Plans

JSON

What’s on the roadmap for pgrust

Links for pgrust

Leave a Reply Cancel reply

malisper.me

malisper.me

VACUUM and Transaction ID Wraparound

Connection Limits and Query Parallelism

Bad Query Plans

JSON

What’s on the roadmap for pgrust

Links for pgrust

Share this:

Leave a Reply Cancel reply