This is my first post under the Getting Dramatically Better as a Programmer series. Today we are going to look at the paper “A Critique of the CAP Theorem“. The paper was written by Martin Klepmann. Klepmann also wrote the book Designing Data Intensive Applications. I’m going to go through the paper, explain the main points made by the paper, and summarize the takeaways I got from the paper. This is partially for forcing myself to have to explain the paper to others and also for others to learn about the paper.
The CAP theorem is a well known theorem about distributed systems. As the name suggests, “A Critique of the CAP Theorem” points out several issues with the CAP theorem. In particular, Klepmann looks at two different issues with the CAP theorem:
- The terms used in the theorem have unclear definitions and can be interpreted in many different ways.
- Some assumptions made in the proof the CAP theorem do not reflect reality.
Klepmann then proposes a new way of modeling distributed systems he calls “delay-sensitivity”.Klepmann designed delay-sensitivity to address the problems of the CAP theorem.
Background: The CAP Theorem
Before we dig into the paper, let’s first look at the CAP theorem and the history surrounding it. The CAP theorem was originally an informal claim about distributed systems made by Eric Brewer. Brewer’s claim is now known as “Brewer’s Conjecture”. Brewer’s Conjecture stated that a distributed system can have at most two of: consistency, availability, and partition-tolerance (CAP). When making his claim, Brewer gave only rough definitions for each of these terms.
A more intuitive way to interpret Brewer’s Conjecture is the following: in a distributed system, if some servers cannot access each other, the distributed system will either be unable to process some requests (lack of availability) or the distributed system will not behave like a single server (lack of consistency).
To give a specific example, let’s say you want to build a website like Google Docs. You want multiple people to be able to collaborate on a document together. They should all be able to make edits to the document and everyone should see each other’s edits in real-time. You can consider the computers of the users of the website and the servers of the website itself to be a distributed system.
Brewer’s Conjecture states that when a user loses connection to the website (a network partition occurs) you have to make a choice. One option is you don’t allow the user to edit the document until they reconnect. This is choosing consistency over availability. By doing this, you guarantee that when a user views the document, they see the same document everyone else is seeing (consistency). At the same time, they won’t be able to view the document when they are unable to access the website (lack of availability).
The other option you have is to provide some sort of offline mode. If you add offline functionality, you are choosing availability over consistency. Users will be able to continue editing the document locally even if they can’t connect to the website (availability), but every user will not be able to see the most recent edits made by the other users (lack of consistency).
Brewer’s Conjecture states that it’s impossible to provide both availability and consistency when a user cannot connect to the website.
Brewer’s Conjecture is important because many systems want to provide both availability and consistency when there is a network partition. In practice, any system can only provide one of them during a partition. Ultimately whether you choose availability or consistency is a design decision.
Soon after Brewer made his conjecture, Seth Gilbert and Nancy Lynch formalized it in “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services“. Gilbert and Lynch also gave a proof of the formal version of the conjecture. Since Gilbert and Lynch proved the formal version of Brewer’s Conjecture, the formal version became a theorem. That theorem is now known as “The CAP Theorem”.
Now that we’ve covered the history of the CAP theorem, let’s dig into Klepmann’s critique of it.
Ambiguity of the Terms
One of the main issues raised by Klepmann is that the definitions of the terms “availability” and “consistency” are ambiguous in Brewer’s Conjecture.
There are many different ways you can interpret “availability”. One way to interpret it is about how often a production system returns a successful response. For example, you can measure this kind of availability by looking at what percentage of requests are successful. An alternative way to define availability is as a property of the algorithm a system is running. If the algorithm guarantees successful responses in all cases, the algorithm can be referred to as “available”.
There is a related question of how to deal with slow responses. One option is to define a system as available if it returns a response eventually. Another way is to define a system as available if it returns a response in under a given time limit.
The term “consistency” is also overloaded with many different meanings. There’s consistency in the ACID sense which is completely different from the consistency Brewer was referring to.1 In the context of distributed systems, people often refer to “consistency models”. A specific consistency model provides guarantees about the behavior of the distributed system. Some common consistency models are sequential consistency, causal consistency, and linearizability. Each of these models provides a different set of guarantees.
To give you a rough idea of the different consistency models, sequential consistency requires that executing several operations in a distributed system is the same as executing those operations in some order, one at a time. Linearizability is the same as sequential consistency except it also requires that if operation A started after operation B completed, the effects of operation A must come after the effects of operation B.
The term “strong consistency” also comes up often. Even though it does come up often, it has no specific formal definition. “Strong consistency” can refer to any of the consistency models above.
Differences Between Theory and Practice
After looking at some issues around the terms used in the CAP theorem, Klepmann goes on to look at the formal version. This is the CAP theorem as proved by Gilbert and Lynch. In defining the CAP theorem, Gilbert and Lynch define the words “availability”, “consistency”, and “partition-tolerance”. In particular they define them as follows:
- Availability – Every request to the service must terminate.
- Consistency – The system provides linearizability.
- Partition-tolerance – The system still needs to work even when some nodes in the system are unable to communicate with other nodes in the system.
Klepmann points out that these definitions and some assumptions made by Gilbert and Lynch do not model reality.
There’s a big difference in the use of “availability” in the CAP theorem and the use of “availability” in practice. In the CAP theorem, availability means the system only needs to return a response eventually. That means the system could return a response in a few seconds, several minutes, or even several days. As long as the system eventually returns a response, it is considered to be available by the CAP theorem. In practice, you usually set a timeout so that if a system takes too long to respond, the system is considered to be unavailable. If a website takes more than a few seconds to response, it doesn’t matter that the system is available in the CAP sense. For all practical purposes, if no one is able to access the website, the website is unavailable.
Another big difference is the CAP theorem only considers one kind of failure, a partition. In practice there are many other failures that can occur. Individual servers can crash. Individual network links can go down. Data on servers can even become corrupted. This makes the CAP theorem useful if the only kind of failure you are worried about are partitions. If you want to reason about other kinds of failures in your distributed system, it’s difficult to make use of the CAP theorem.
To deal with the issues of the CAP theorem, Klepmann gives a rough draft of a system he calls “delay-sensitivity”. In delay-sensitivity, you measure every operation by whether the operation requires time proportional to the network latency. The advantage of delay-sensitivity is it gives you an idea of how long each operation will take. This is as opposed to the CAP theorem which creates a dichotomy between available and unavailable.
For a linearizable system, it’s possible to prove that all operations will take time proportional to the network latency. When there is a network partition, the network latency is unbounded, That means when a partition occurs in a linearizable system, operations will take an unbounded amount of time and the system will become unavailable. Such a system is consistent, but not available when there is a network partition.
On the other hand, it’s possible to provide a non-linearizable system in which all operations complete independently of the network latency. This means even when there is a partition and the network latency is unbounded, operations will still complete. In other words, the system is available even when there is a network partition. When there is a network partition, this system would be available, but it lacks consistency since it does not provide linearizability.
- Many of the terms used by the CAP theorem are ambiguous. “Availability”, “Consistency”, and “Partition-tolerance” can all be interpreted in many different ways.
- The CAP theorem makes many assumptions that do not reflect reality. The CAP theorem only looks at partitions as the only possible fault that can come up. It also considers a service to be available as long as the service is able to eventually return a response. It doesn’t matter how long the response will take.
While the paper was an interesting read, I do have some issues with it. The paper is titled “A Critique of the CAP Theorem”. While part of the paper discusses how the CAP theorem differs from reality, a good amount of the paper focuses on the ambiguity of the terms of the CAP theorem. The ambiguity isn’t so much a problem with The CAP theorem itself. When they define the CAP theorem, Gilbert and Lynch give formal definitions for all the terms they use. The issues raised are more an issue with the colloquial version of the CAP theorem.
The issues raised about the actual CAP theorem are a bit more interesting. Especially the part about how the CAP theorem doesn’t have any regard for how long the operations actually take. I do like this aspect of delay-sensitivity since delay-sensitivity roughly measures how long each operation takes, even if there isn’t a network partition. On the other hand, I do find it difficult to reason about the exact implications of an operation taking time proportional to the network latency. I find it much easier to reason about the CAP theorem because it makes a clear distinction between a system being available or unavailable.