My goal with Perfalytics is to hit $100,000 in annual recurring revenue by the end of 2019. To hit that goal, assuming each user is paying $99/month, I need to get to 85 paying users by the end of the year. Getting to 85 customers won’t be easy. It’s going to take a lot of work and careful planning. To ensure I’m on track to hitting that goal, I worked backwards to come up with a rough plan and milestones I’ll need to achieve along the way.

The high level plan I made has two distinct phases. At the beginning of the year, I’m going to focus on iterating and improving the product. During this time I’m going to make sure I’m making something that people want and are willing to pay for. If I don’t make something people want, Perfalytics will fail no matter what else I do.

Once I’ve got a small group of people who are getting value out of Perfalytics, I’m going to shift my focus to acquiring new customers. Initially I’ll focus on acquiring single individual customers. As time goes, I’ll work on building out a scalable and repeatable process for acquiring new users.

Based on this rough plan, I came up with the following four quarterly milestones:

By the end of Q1 2019, at least one person will be paying for Perfalytics.

During Q1, I’m going to be talking with potential customers and have people try out the product. I want to understand their needs and how a tool like Perfalytics can help them. Once I’ve got a good sense of what problems are the most important for me to work on, I’ll make sure Perfalytics solves those problems.

If I can get just one person to pay for Perfalytics, that means I have solved a need that someone else besides me has.

By the end of Q2, at least five people will be paying for Perfalytics.

Q2 is going to look a lot like Q1. I’m still going to be focusing on how to improve the product. The difference is I’m going to have more people trying out Perfalytics and getting more feedback on the product itself. Each person is going to have different needs and wants, so the product will need to change a bit as I sell to each of the first few customers.

Once I get to five paying customers, I’ll have strong evidence that Perfalytics is solving a need that many different people have.

By the end of Q3, at least 20 people will be paying for Perfalytics.

At this point I’ve built something that multiple people are geting value out of. Now I’ll shift my focus onto acquiring more and more customers. My main focus during Q3 will be on finding people who are are a good fit for Perfalytics and building out a repeatable sales process. I’m going to look at a few different approaches to acquriring customers and see which ones are the most effective. The main approaches I’m thinking about right now are email and Google ads.

By the end of Q4, at least 85 people will be paying for Perfalytics.

During Q4, I’ll take the sales process I built out in Q3 and scale it. I’ll need to be able to find more people who are good fits for Perfalytics faster. As a comparison, I’m going to need to acquire net 15 new customeres in Q3. In Q4 I’m going to need to find net 65 new customers. This means I’ll have to make the process way more efficient to acquire all the new users in Q4.

This plan and milestones are tentative, but they do give me a starting point. Now that I’ve got a sense of the milestones I need to hit along the way, I only need to make sure I hit each one and I’ll achieve my goal!

My Starting Point

Running a bootstrapped business isn’t going to be easy. To make Perfalytics succeed, I’m going to need to lean on my strengths and, at the same time, compensate for my weaknesses.

I have some weaknesses which are going to make Perfalytics a real struggle:

I procrastinate a bit too much. It’s all too common that I have some tasks I want to get done at the start of the day, but wind up pushing them off to the next day. To avoid this problem with Perfaltyics, I’m going to set aside one hour each day to work on it. Each day after I finish dinner, I’m going to work on Perfalytics. By setting aside dedicated time to work on Perfalytics it’s way harder for me to procrastinate.

I have a poor sense of design. Web design is a muscle I’ve never really exercised. Whenever I try to make a UI pretty, it always seems “off”. Although I recognize something is wrong, I’m never able to place my finger on how to make it better. While building Perfalytics, I’m going to continuously work on improving my design skills. I’m going to produce a ton of designs, seek feedback on them from my coworkers, and reflect on how I can do better.

I have a hard time talking to strangers. A large part of getting a business off the ground is talking to people. You need to pitch your product to people you barely know. You also need to get feedback from your users to understand how you can make your product better.

Like web design, talking to strangers is not something I’m used to doing. In order for Perfalytics to succeed, I’m going to have to face my fear of talking to strangers. I plan on fighting my fear by being rigorous in my approach. As a concrete example, I’m going to set specific daily targets around the number of people I talk to. By setting small and specific goals I’m sure I’ll be able to conquer my fear.

At the same time I have a number of skills that are going to be helpful. I want to take as much advantage of these skills as I can. I also plan on using these skills to help cover for my weaknesses.

I aggressively seek to improve. I really like improving my skills. I always seek feedback on my work from those around me. In my opinion constantly focusing on improving is a “cheat code” for life. This mindset is definitely going to be useful for Perfalytics. The first few months of Perfalytics are almost entirely going to be on iterating and improving the business.

I am methodical. To keep my life organized, I’ve become methodical. I have a specific list of habits that I do at the start of the day and at the end of the day. For one example, I’ve meditated for 45 minutes a day, everyday for the last six months. I plan on using my methodicalness to help plan out Perfalytics and ensure I’m making progress towards my goals.

I have deep Postgres knowledge. My day job for the past three years has been around optimizing a 1PB Postgres cluster. Along the way, I’ve learned a ton about Postgres. To give you an idea, I’ve written over 30 blog posts about Postgres. I’m going to leverage my Postgres knowledge to make sure that Perfalytics, a tool for optimizing Postgres performance, is great at what it does. Having a solid understanding of Postgres is what enabled me to see the possibility of a tool like Perfalytics in the first place.

Attempting a Bootstrapped Business

It’s a few days early for a New Year’s resolution. Even so, my New Year’s resolution for 2019 is to seriously attempt to start a bootstrapped business. A few weeks ago, I announced Perfalytics to the world. Perfalytics is a neat idea I’ve been throwing around for a while.

The core idea behind Perfalytics is to make it easy for a database administrator to optimize their database. I based Perfalytics on a tool I built during my day job. I found the tool useful and I figure there is a good chance others will too.

I have a soft goal for Perfalytics. By the end of 2019, I aim to be making $100,000 in annual recurring revenue.

When I think about that goal, I get simultaneous feelings of dread and excitement. $100,000 sounds like way too much. How could I ever get there? At the same time, it also seems oddly within reach. Doing the Math, if I were to charge users of Perfalytics $99/month, I only need 100,000/99/12 = ~85 users by the end of the year. To phrase it another way, if I’m able to add seven users to Perfalytics each month, I’ll hit my goal.

The first step to achieving that goal is to get people actually using Perfalytics. Right now I’m close to completing the MVP and should finish it within the next few days. Once I finish the MVP, I’m going to reach out to a few people that signed up for the free beta and get them started using it. I’m excited to see what the initial reaction to Perfalytics is going to be.

Announcing Perfalytics: Never Run EXPLAIN ANALYZE Again

Today I’m announcing the beta for Perfalytics. Perfalytics is a tool I’ve been working on designed to make it easy to analyze query performance for Postgres. As the lead of the Database team at Heap, where I optimize the performance of a Postgres cluster with more than 1PB of data, I’ve done my fair share of Postgres performance analysis. I believe the standard way of debugging a slow Postgres query is tedious and cumbersome.

Debugging a slow Postgres query usually goes something like this. Given the slow query, you run EXPLAIN ANALYZE on it. This will rerun the query with performance instrumentation. This gives you the query plan of the query and how much time was spent processing each part of the query plan.

If you’re lucky, the slow query is reproducible and you will see why the query was slow. In my experience, this is frequently not the case.  All too often, by the time you run the query with EXPLAIN ANALYZE, something has changed and the query is now fast. This can happen for a number of reasons including caching, the makeup of the table changing, decreased server load, and many other potential factors.

The idea behind Perfalytics is to turn this way of doing things on it’s head. Instead of running EXPLAIN ANALYZE after the fact to try to reproduce the slow query, why not gather the exact same information at query time? That way if the query is slow, you can see exactly why that particular run of the query was slow.

That’s right. Perfalytics records the query plan and various query statistics for every single query at query time. This means you have all the information you need to debug why a particular query was slow. There’s no need to try to reproduce the problem after the fact.

In addition to making it easier to debug slow queries, collecting this information has a number of other advantages. Because you are collecting this information for every single query, you can look at overall trends. For example, you can, for each slow query, programmatically classify why exactly that query was slow. This lets you see how often different kinds of slow queries come up and tells you exactly where you should focus your efforts.

Once you try using something like Perfalytics, you’ll never want to go back to the old way of doing things.

If you want to try out Perfalytics, sign up for an beta invite.

If you have any questions, feel free to contact me at

Recording Footage of Myself Investigating a Performance Issue

Of the exercises I do as part of Getting Dramatically Better as a Programmer my favorite is this one. Every week I record my screen and later review the footage. I’ve found this to be a great way to understand where I’m wasting time and how I could solve the same problem more efficiently.

In this write up I’m going to go over one segment of footage of myself. In particular I’m going to share:

  • The problem I was working on.
  • A breakdown of where I spent my time when working on the problem.
  • General techniques I’ve come up with that would have reduced the amount of time it took to solve the problem.

Since I reviewed myself doing a problem at work, I’m unable to share the raw footage. I am able to share my notes on what I was doing at each point in time which you can see here.

Doing this write up is both for forcing me to explain to others what I learned and for others to see what lessons I’ve learned.

The Problem

The footage I reviewed was of myself doing an investigation at work. For several minutes, our p95 for queries went up by an order of magnitude. I’m usually the person responsible for investigating the problem and coming up with a solution to prevent the problem in the future.

To give you some background, I work for Heap. Heap is a tool that people can use to run analytics on their website. We give customers a piece of Javascript they put on their website which allows us to collect data from their website. Our customers can then run queries over their data. They can ask questions such as “how many people visited my website last week” or “of people who signed up, how many of them went through the tutorial”.

It’s these kinds of queries that were slow for a brief period of time.

Since performance is so important to Heap, we’ve built a plethora tools for analyzing performance. One of the most powerful tools we’ve built is a tool called “Perf Trace”. Perf Trace  gathers information about every query ran in our database. It records the query plan used to execute the query, as well as how much time was spent in each stage of the query. It then writes this data to a SQL database so it’s possible to query the data with SQL. If you want to read more about Perf Trace, I wrote a blog post for Heap that covers it in depth.

One of the most common uses of Perf Trace is debugging issues just like this one.

Where the Time Went

In total, it took 25 minutes to determine the root cause and write it up. The root cause was a group of computationally expensive queries were ran together.

To be more specific, there is a specific type of query Heap supports that consumes a lot of resources on our master database. Every query we run goes through our master database. If our master database becomes overloaded with queries, the amount of time the master DB spends processing queries will go up.

I determined this was the root cause through two symptoms. According to Perf Trace, the increase in query time came from queries spending more time in our master database. I also looked at the CPU usage of our master database and noticed a large spike in CPU usage around the time of the incident.

In finding the root cause, most of my time was spent in one of following three categories:

  • 9 minutes – Looking at various pieces of data. This includes looking at both Perf Trace data and looking at various dashboards.
  • 8 minutes – Writing or debugging SQL queries. The main SQL query I wrote was for looking at the Perf Trace data for queries around the time of the incident.
  • 5 minutes – Writing up the results in a document.


Based on the footage, there are two main takeaways I have:

  • When looking at data, I should have a clear sense of what I’m looking for.
  • I should have a repository of common SQL snippets I use.

For the first point, I noticed that I would spend a lot of time scrolling through the data without looking for anything in particular. Scrolling mindlessly through the data wasted a lot of time and I didn’t really find anything particularly interesting. I think if I had a clear sense of what I was looking for when digging through the data, I could have reduced the amount of time I spent looking through the data.

As for the second point, I spent a lot of time getting the SQL query I wrote to work. The SQL query I wrote for analyzing the Perf Trace data is similar to many queries I’ve written in the past. It took a fair amount of time getting the SQL query to work and debugging some issues with it. If I kept a repository of common SQL snippets I used, I wouldn’t have needed to spend as much time writing the SQL. This is because the repository would already have a snippet that worked and was already debugged.

Tool Write Up – Hadoop

One exercise I do every week as part of Getting Dramatically Better as a Programmer is learn a new tool. This week, I took a look at Hadoop. I’m going to walk through what I learned and mention a few interesting items I got out of it. This is both for forcing me to explain what I learned and for others to see what’s so cool about Hadoop. Most of what I learned was from the Hadoop docs.


Hadoop is a framework designed for dealing with large amounts of data. There’s two main pieces to Hadoop. There’s HDFS, the Hadoop filesystem, and there is Hadoop MapReduce.

HDFS is a distributed file system based on the Google File System. It looks like a regular old file system, but stores the data across many different servers. HDFS handles problems like coordinating what data is stored where and replicating the files. HDFS is also able to handle files much larger than an ordinary filesystem can deal with. HDFS is the part of Hadoop that handles the raw data storage.

The other part of Hadoop is Hadoop MapReduce. Like HDFS, Hadoop MapReduce is based on a tool created by Google. As the name suggests, it is based on the MapReduce framework. Hadoop MapReduce is the part of Hadoop that handles processing data. It takes care of moving data between many different servers and performing aggregations over the data.

Together, the two parts of Hadoop solve one of the hardest problems that come up when processing large amounts of data: running code across many different servers. Hadoop takes care of this by:

  • Breaking down the input data into many different pieces spread across many different servers.
  • Handling the scheduling of each part of the data processing job across many different servers.
  • Rescheduling the necessary parts of the job if a machine fails.
  • Aggregating the data from each part of the job together.

The rest of this post will mainly discuss the data processing portion of Hadoop, Hadoop MapReduce.

Outline of Hadoop MapReduce

As intended, writing a job to process large amounts of data using Hadoop MapReduce is straightforward. The framework takes care of all the hard parts. As the programmer, you only have to do two things. You only need to specify a “map” function and a “reduce” function. That’s all there is to it!

The core idea of MapReduce is that the map function takes a bit of raw input and processes it. The reduce function then takes the output from multiple map calls and combines them together. Any calculation that can be specified in terms of these two functions is trivial to turn into a distributed and fault tolerant job with Hadoop.

To be more specific, a Hadoop job takes an HDFS file as input. Hadoop splits up the HDFS into many different chunks (this is often already done by HDFS). Hadoop then calls the map function on each chunk. For each chunk, the map function processes it and returns an intermediate result. The result is in the form of a list of key-value pairs.

For each unique key, Hadoop gathers the list of all pairs with that key together. Hadoop then passes the list of values associated with that key to the reduce function. The reduce function combines the list of values into a single value. Hadoop will then write the result of the reduce function for each key to an output file.

Many different types of calculations can be expressed in terms of map and reduce and that’s exactly what makes Hadoop MapReduce so powerful.

Before we dig into what’s going on under the hood, let’s take a look at an example MapReduce application.

Example: Word Count

The code for this section is taken from this tutorial from the Hadoop docs.

The classic example of a MapReduce job is the word count problem. We want to count the number of times each word occurs in a file. Doing this when the file is on the order of gigabytes is no problem. When the file gets on the order of terabytes or even petabytes, you are going to have to bring out the more powerful tools like Hadoop.

To write the Hadoop job, we need to write a map function and a reduce function. As mentioned above, the map function takes a chunk of raw data and produces an intermediate result for that chunk. For the word count problem, we can write a map function that takes the raw text and produces a list of key-value pairs of the form (<word>, 1) for each word in the text.

To write the function, we need to write a class that extends the Mapper class. Then we override the map function with the definition we want. This looks like the following:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      context.write(word, one);

The map function takes the chunk of input, splits up the text into individual words, and outputs the key-value pair (<word>, 1) for each word. There is some MapReduce specific code, but the example should still be easy to understand.

Writing the reduce function is just as easy. We want to get the number of times each word appears in the file. Since the key of each tuple produced by the map function is the word seen, the reduce function is passed list of ones for each unique word. There is one entry in the list for each time the word appears.

The reduce function simply needs to take the list of values and count how many values there are. In order to allow for some optimizations by Hadoop, we will instead sum the list of values. Since each value in the list is a one, this winds up giving the same result as counting the number of values. We’ll get into what these optimizations are when we look at what’s going on under the hood. Here’s the code:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    context.write(key, result);

It simply iterates through the list of values and sums them.

Now that we have the map and reduce functions, we only need to write a small bit of code that initializes a job that uses them:

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "word count");

  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  System.exit(job.waitForCompletion(true) ? 0 : 1);

I bolded the important bits. The first two bolded lines specify the map function to use and the reduce function to use. The last bolded line kicks of the job and waits for it to complete. Most of the other lines are just boilerplate for initializing the job.

The above job can be ran across as many machines as you want. That makes it easy to scale the job up to much larger amounts of data than you would ordinarily be able to process without Hadoop.

Hadoop Under the Hood

In order to run a Hadoop job, Hadoop uses two kinds of processes. It has mappers, which are responsible for running the map function, and it has reducers, which are responsible for running the reduce function. Hadoop runs these processes across the different servers in the Hadoop cluster.

When running a Hadoop job, Hadoop first starts many different mapper processes. Hadoop sends the raw chunks of the file being processed to the mapper processes. The mapper processes then run the map function you specified. In order to minimize the amount of data sent over the network to the mappers, Hadoop will try to run the mapper functions on the same machines as where HDFS stores the file.

Once a mapper process finishes generating its output, it sends the output to the reducers. The mapper spreads its output across all the different reducers. To make sure all values with the same key are sent to the same reducer, Hadoop decides which reducer to send a particular key-value pair to based on the key. The process of mappers sending different chunks of their data to the reducers is known as the shuffle step.

As an optimization, the mapper will use the “combiner” function to combine values from the same mapper before they are sent to the reducer. The combiner function is an optional function you can specify specifically for this purpose. Before sending its output to the reducer, the map function will pass all values with the same key to the combiner function and then send the output of the combiner function to the reducer. This reduces the total amount of data sent over the network. Often the combiner function can be the same as the reduce function.

This is why the reducer we defined in the word count problem takes the sum of the values instead of the count of them. The reducer is really taking the sum of the values from each mapper. Each mapper had already passed the key-value pairs to the local combiner which counted how many times each word occured and handed that information off to the reducer.

As each reducer completes, they write their data to the output HDFS file. And that’s how a Hadoop MapReduce job works.


  • HDFS is a distributed file system that makes it easy to store lots of data.
  • Hadoop MapReduce is a framework that makes it easy to write distributed and fault tolerant jobs.

I find MapReduce pretty interesting. I do find the boilerplate a bit off putting, but that would be missing the forest for the trees. MapReduce saves way more code than the extra code needed for the boilerplate. I find it impressive how great of an abstraction MapReduce is. By dramatically limiting the programming model, it makes it trivial to write scalable software.

Book Write Up – Debugging Chapters 5 and 6

As part of my Getting Dramatically Better as a Programmer Series, I’m going to write up a few chapters of the book I’m currently reading and a few takeaways I got. I’m currently reading the book Debugging by David Agans. The book covers Agans’ “nine golden rules of debugging”. Each rule is a general guideline to follow that makes debugging easier. Each chapter in the book covers one of the nine rules.
Today, I’m going to go through chapters 5 and 6. Chapter 5 covers the rule “Quit Thinking and Look” and chapter 6 covers the rule “Divide and Conquer”.

Chapter 5: Quit Thinking and Look

There are two main points Agans makes in this chapter:
  • If you try to guess where the bug is, you will likely guess wrong. Instead of guessing where the bug is, you should look until you can “see” the bug.
  • You should add lots of instrumentation throughout your code to make it easier to see the bug.
Throughout my time as a programmer, I’ve been amazed at how many times people, including myself, incorrectly guess where a bug is. One of my favorite examples is when one of my friends was making a simple 2d game. The game wound up being incredibly slow. My friend thought the slow part must have been the collision detection. He was beginning to think about how to optimize the collision detection when I suggested that he should try hooking up a profiler and actually see where the problem was.
My friend reluctantly hooked up a profiler to his program and found, lo and behold, that collision detection was not the problem. It turned out the program was reloading the image files for the sprites on every frame. After he changed the program so it only loaded the sprites once, the game moved a lot more smoothly.
My friend had no evidence that the problem was what he thought it was. He was only relying on his intuition. Once he hooked up a profiler, he was able to see exactly where the real problem was.
If my friend had gone through and optimized the code for collision detection, he would have wasted time optimizing the code only to discover the issue was somewhere else. This is why it’s important to gather evidence that the bug is where you think it is before you attempt to fix the problem. The phrase “measure twice, cut once” comes to mind. You should be able to “see” the bug before you start to fix it.
This is one reason why profilers are so great. They let you see exactly where the slow part of your program is. There’s no need to guess at which part of your program is slow when you have access to a profiler.
This is also why you should instrumentation throughout your program. The more data you collector about your program, the easier it is to see the bugs inside your program.
One of the most useful pieces of code I’ve ever written was a tool I wrote for work called “perf trace”. For my day job, I’m responsible for optimizing the performance of a 1pb distributed database. For every query that runs in our database, perf trace gathers information about the different stages of the query and how long each stage took. This data gives us a clear sense of what queries are slow for what reasons. Perf trace makes it easy to debug slow queries because we are able to see exactly how much time each query spent at each stage of query execution.
We also use perf trace for understanding what the common causes of slow queries are. This allows us to determine what optimizations will improve query performance the most. If you want to read more about perf trace, I wrote a blog post about it that’s on my company’s blog.

Chapter 6: Divide and Conquer

The main point of chapter 6 is that the process of “Divide and Conquer” is one of the most efficient ways to locate a bug. This rule goes hand in hand with “Quit Thinking and Look”. Quit thinking and look suggests looking for the bug until you are able to see it. divide and conquer is the method you should use to locate the bug.
The idea behind divide and conquer is simple. Try to find the general region where the bug is. Then repeatedly narrow down the location of the bug into smaller and smaller regions until you can see the bug. This is the same idea behind binary search, only applied to debugging.
At my job, a common issue that comes up is query times went up for some duration of time. When something like that happens, I have to investigate the issue and figure out why exactly query times went up. I usually use the perf trace tool I mentioned above to aid in debugging the problem.
When I have to look into why queries were slow, the first thing I do is try to figure out what area of the stack the increase in query times came from. Did it come from the web server? Did it come from the master database? Did it come from one of the worker databases?
Once I narrow down the query time increase to a specific area of our stack, I try to figure out the high level reason for why that part of the stack got slower. Did we max out CPU on a machine for some reason? Did we exceed the maximum number of queries we can run on a single server at a time?
Then, once I determine the high level reason query times go up, I can then start looking for the root cause. An issue that came up before was that a job didn’t have a limit on the number of queries it was running in parallel. When the job ran, it ran many more queries in our database than we normally process in our database. Processing so many queries in parallel caused a spike in CPU usage on our master database. That in turn caused all queries to become slower during the master database part of the query.
Divide and conquer is a general and efficient way to find the problem when dealing with issues like this. By applying  divide and conquer when doing investigations, I’m able to make consistent forward progress until I can find the issue.
As an aside, one of the least effective ways to debug problems is to look through the code line by line until you find the bug. I’ve done this myself in the past and I’ve seen a lot of other people do it too. This approach to debugging is bad for several reasons. For one, it’s inefficient. It’s more of a linear search than a binary search because you have to read a large chunk of code before you have any chance of finding the bug. It’s also likely that the bug is non-obvious. Whoever wrote the bug probably didn’t see it so it’s unlikely that you’ll be able to spot it by just looking at the code. By reproducing the bug and following divide and conquer, you can find the bug much more quickly than if you were to read all the relevant code line by line.
Of course, it’s important to have a good understanding of the code base and how it works. If you have a good understanding of the system in advance, that will be helpful in debugging the system. Although, walking through the code for the system line by line is an inefficient way to debug.


To recap, the two rules covered are:

  • Quit Thinking and Look – Instead of guessing where the bug is, you should produce evidence that the bug is where you think it is
  • Divide and Conquer – Try to find the general region in your code where the bug is. Then narrow down the region it can be in until you are looking at the bug.
I find both of these to be solid pieces of advice. I’m definitely going to keep them in mind the next time I’m debugging a problem.

Paper Write Up – A Critique of the CAP Theorem

This is my first post under the Getting Dramatically Better as a Programmer series. Today we are going to look at the paper “A Critique of the CAP Theorem“. The paper was written by Martin Klepmann. Klepmann also wrote the book Designing Data Intensive Applications. I’m going to go through the paper, explain the main points made by the paper, and summarize the takeaways I got from the paper. This is partially for forcing myself to have to explain the paper to others and also for others to learn about the paper.


The CAP theorem is a well known theorem about distributed systems. As the name suggests, “A Critique of the CAP Theorem” points out several issues with the CAP theorem. In particular, Klepmann looks at two different issues with the CAP theorem:

  • The terms used in the theorem have unclear definitions and can be interpreted in many different ways.
  • Some assumptions made in the proof the CAP theorem do not reflect reality.

Klepmann then proposes a new way of modeling distributed systems he calls “delay-sensitivity”.Klepmann designed delay-sensitivity to address the problems of the CAP theorem.

Background: The CAP Theorem

Before we dig into the paper, let’s first look at the CAP theorem and the history surrounding it. The CAP theorem was originally an informal claim about distributed systems made by Eric Brewer. Brewer’s claim is now known as “Brewer’s Conjecture”. Brewer’s Conjecture stated that a distributed system can have at most two of: consistency, availability, and  partition-tolerance (CAP). When making his claim, Brewer gave only rough definitions for each of these terms.

A more intuitive way to interpret Brewer’s Conjecture is the following: in a distributed system, if some servers cannot access each other, the distributed system will either be unable to process some requests (lack of availability) or the distributed system will not behave like a single server (lack of consistency).

To give a specific example, let’s say you want to build a website like Google Docs. You want multiple people to be able to collaborate on a document together. They should all be able to make edits to the document and everyone should see each other’s edits in real-time. You can consider the computers of the users of the website and the servers of the website itself to be a distributed system.

Brewer’s Conjecture states that when a user loses connection to the website (a network partition occurs) you have to make a choice. One option is you don’t allow the user to edit the document until they reconnect. This is choosing consistency over availability. By doing this, you guarantee that when a user views the document, they see the same document everyone else is seeing (consistency). At the same time, they won’t be able to view the document when they are unable to access the website (lack of availability).

The other option you have is to provide some sort of offline mode. If you add offline functionality, you are choosing availability over consistency. Users will be able to continue editing the document locally even if they can’t connect to the website (availability), but every user will not be able to see the most recent edits made by the other users (lack of consistency).

Brewer’s Conjecture states that it’s impossible to provide both availability and consistency when a user cannot connect to the website.

Brewer’s Conjecture is important because many systems want to provide both availability and consistency when there is a network partition. In practice, any system can only provide one of them during a partition. Ultimately whether you choose availability or consistency is a design decision.

Soon after Brewer made his conjecture, Seth Gilbert and Nancy Lynch formalized it in “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services“. Gilbert and Lynch also gave a proof of the formal version of the conjecture. Since Gilbert and Lynch proved the formal version of Brewer’s Conjecture, the formal version became a theorem. That theorem is now known as “The CAP Theorem”.

Now that we’ve covered the history of the CAP theorem, let’s dig into Klepmann’s critique of it.

Ambiguity of the Terms

One of the main issues raised by Klepmann is that the definitions of the terms “availability” and “consistency” are ambiguous in Brewer’s Conjecture.

There are many different ways you can interpret “availability”. One way to interpret it is about how often a production system returns a successful response. For example, you can measure this kind of availability by looking at what percentage of requests are successful. An alternative way to define availability is as a property of the algorithm a system is running. If the algorithm guarantees successful responses in all cases, the algorithm can be referred to as “available”.

There is a related question of how to deal with slow responses. One option is to define a system as available if it returns a response eventually. Another way is to define a system as available if it returns a response in under a given time limit.

The term “consistency” is also overloaded with many different meanings. There’s consistency in the ACID sense which is completely different from the consistency Brewer was referring to.1 In the context of distributed systems, people often refer to “consistency models”. A specific consistency model provides guarantees about the behavior of the distributed system. Some common consistency models are sequential consistency, causal consistency, and linearizability. Each of these models provides a different set of guarantees.

To give you a rough idea of the different consistency models, sequential consistency requires that executing several operations in a distributed system is the same as executing those operations in some order, one at a time. Linearizability is the same as sequential consistency except it also requires that if operation A started after operation B completed, the effects of operation A must come after the effects of operation B.

The term “strong consistency” also comes up often. Even though it does come up often, it has no specific formal definition. “Strong consistency” can refer to any of the consistency models above.

Differences Between Theory and Practice

After looking at some issues around the terms used in the CAP theorem, Klepmann goes on to look at the formal version. This is the CAP theorem as proved by Gilbert and Lynch. In defining the CAP theorem, Gilbert and Lynch define the words “availability”, “consistency”, and “partition-tolerance”. In particular they define them as follows:

  • Availability – Every request to the service must terminate.
  • Consistency – The system provides linearizability.
  • Partition-tolerance – The system still needs to work even when some nodes in the system are unable to communicate with other nodes in the system.

Klepmann points out that these definitions and some assumptions made by Gilbert and Lynch do not model reality.

There’s a big difference in the use of “availability” in the CAP theorem and the use of “availability” in practice. In the CAP theorem, availability means the system only needs to return a response eventually. That means the system could return a response in a few seconds, several minutes, or even several days. As long as the system eventually returns a response, it is considered to be available by the CAP theorem. In practice, you usually set a timeout so that if a system takes too long to respond, the system is considered to be unavailable. If a website takes more than a few seconds to response, it doesn’t matter that the system is available in the CAP sense. For all practical purposes, if no one is able to access the website, the website is unavailable.

Another big difference is the CAP theorem only considers one kind of failure, a partition. In practice there are many other failures that can occur. Individual servers can crash. Individual network links can go down. Data on servers can even become corrupted. This makes the CAP theorem useful if the only kind of failure you are worried about are partitions. If you want to reason about other kinds of failures in your distributed system, it’s difficult to make use of the CAP theorem.


To deal with the issues of the CAP theorem, Klepmann gives a rough draft of a system he calls “delay-sensitivity”. In delay-sensitivity, you measure every operation by whether the operation requires time proportional to the network latency. The advantage of delay-sensitivity is it gives you an idea of how long each operation will take. This is as opposed to the CAP theorem which creates a dichotomy between available and unavailable.

For a linearizable system, it’s possible to prove that all operations will take time proportional to the network latency. When there is a network partition, the network latency is unbounded, That means when a partition occurs in a linearizable system, operations will take an unbounded amount of time and the system will become unavailable. Such a system is consistent, but not available when there is a network partition.

On the other hand, it’s possible to provide a non-linearizable system in which all operations complete independently of the network latency. This means even when there is a partition and the network latency is unbounded, operations will still complete. In other words, the system is available even when there is a network partition. When there is a network partition, this system would be available, but it lacks consistency since it does not provide linearizability.


  • Many of the terms used by the CAP theorem are ambiguous. “Availability”, “Consistency”, and “Partition-tolerance” can all be interpreted in many different ways.
  • The CAP theorem makes many assumptions that do not reflect reality. The CAP theorem only looks at partitions as the only possible fault that can come up. It also  considers a service to be available as long as the service is able to eventually return a response. It doesn’t matter how long the response will take.

My Opinion

While the paper was an interesting read, I do have some issues with it. The paper is titled “A Critique of the CAP Theorem”. While part of the paper discusses how the CAP theorem differs from reality, a good amount of the paper focuses on the ambiguity of the terms of the CAP theorem. The ambiguity isn’t so much a problem with The CAP theorem itself. When they define the CAP theorem, Gilbert and Lynch give formal definitions for all the terms they use. The issues raised are more an issue with the colloquial version of the CAP theorem.

The issues raised about the actual CAP theorem are a bit more interesting. Especially the part about how the CAP theorem doesn’t have any regard for how long the operations actually take. I do like this aspect of delay-sensitivity since delay-sensitivity roughly measures how long each operation takes, even if there isn’t a network partition. On the other hand, I do find it difficult to reason about the exact implications of an operation taking time proportional to the network latency. I find it much easier to reason about the CAP theorem because it makes a clear distinction between a system being available or unavailable.

My Approach to Getting Dramatically Better as a Programmer

There was a recent discussion among my social group about what “getting dramatically better as a programmer” means. Based on that discussion, I’ve decided to share my own approach to becoming a “dramatically better programmer”. I want others to understand what practices I’ve found useful, so they can incorporate them into their own life.

My approach to getting dramatically better is built around a training regime. There are a specific set of “exercises” I do every week. I designed the training regime with two explicit goals in mind:

  • Learning how to solve problems I didn’t know how to solve before.
  • Learning how to write correct programs faster.

My training routine consists of a total of four different exercises. Each one helps me achieve the two objectives above. The four different exercises are:

  • Reading a paper.
  • Learning a new tool.
  • Reading several chapters of a book.
  • Recording my screen as I write a program. Then reviewing the footage and seeing how I could have written the program faster.

Let me explain each of the exercises in a bit more depth. I want to share some details on how each exercise works and the main benefits I’ve had from doing each exercise.

Reading a Paper

This exercise is designed to expand my knowledge of Computer Science. I’ve found two direct benefits to reading papers. The first benefit is that some papers have changed my model of certain problems. A good example of this is The Tail at Scale. That paper examines the counter intuitive nature of long tail latency.

One interesting lesson I learned from the paper was how running a request on multiple machines affects latency. The authors looked at empirical data from a Google service.  The service processes a request by distributing parts of the request across many different services. They used the data to estimate what would happen if you distributed requests across 100 different services. The authors found that if you measure the time it takes to get a response from all 100 severs, half of that time will be spent waiting for only the last five! This is because the slowest 5% of requests is that much slower than all the other requests. The paper also gives several approaches on how to reduce tail latency. I have found those approaches useful in my own work.

The other benefit I’ve found from reading papers is that they give me the knowledge to understand different systems as a whole. As an example, take Spanner, Google’s distributed database. Spanner uses many different techniques such as Paxos, two phase commit, MVCC, and predicate locks. I’ve been able to build up an understanding of these different techniques through reading papers. In turn, this enables me to reason the Spanner as a whole and reason about the trade offs Spanner makes compared to other systems.

I find most papers I read by either following the references of papers I have read or following up on a paper covered in the Morning Paper. The book Designing Data Intensive Applications also has a lot of references to papers that are worth reading.

Learning a New Tool

One of the easiest ways to solve problems is to use an existing tool that already solves that problem. For this exercise, I pick a tool and learn a bit about it. Usually I setup the tool locally, go through a few tutorials, and read a bit of the manual. Tools that I’ve learned in the past range from bash utils such as jq or sed to distributed systems such as Kafka or Zookeeper.

Learning the bash utilities helps me get solve many common tasks quicker than I could have otherwise. Simple text processing is often easier with sed than it is with a programming language. Likewise, learning about different distributed systems come in handy for understanding what tools work well for solving different problems. That way I know what tool I should use when faced with a given problem.

Reading Several Chapters of a Book

I use books to supplement the knowledge I don’t get from reading a paper or
learning a tool. The books I read cover a wide range of topics. Books I’ve read
recently include:

  • Refactoring – I found this to be a great way to understand what good code should look like and how to turn bad code into good code.
  • Getting Things Done – I found this book to be helpful for prioritizing and keeping track of things. It helped me build a system to make sure I get done what’s important for me to get done.
  • The First Time Manager – I recently became the team coordinator for my team at work. As team coordinator, my main responsibility is communicating with other teams when necessary. I also lead my team’s meetings. I found this book to be great for getting an understanding of basic management principles.

Recording My Screen

This exercise is my favorite. It’s the exercise that’s changed how I approach problems the most. It’s common for athletes to review footage of themselves to understand how they could have done better. I decided to apply the same approach to programming. Some of the lessons I’ve learned by recording my screen include:

  • It helps to test code as it’s written. Doing this reduces the amount of time it takes to debug code by reducing the time you spend locating where the bug is. If all your existing code is bug free, the bug has to be in the new code you just wrote.
  • When debugging a problem, adding functionality only for the purpose debugging is often well worth the cost. As an example, one toy problem I worked on was writing an LRU cache. I had a bug where it wasn’t evicting the right elements. I was able to quickly determine what was wrong by adding a function to print the state of the cache. I could then see where the expected behavior of the cache differed from the actual behavior. This allowed me to quickly locate the bug.
  • Spending five minutes deciding on an approach up front before writing any code is worth it. I’ve found two benefits to doing this. It helps me make sure my approach is correct. More importantly, it forces me to decide on a single approach. By watching a recording of myself, I found I wasted a lot of time switching my code between two different approaches. In reality, either approach would have worked fine.

All these lessons are obvious in retrospect. but I had no clue that any of these were issues until I recorded my screen and saw where I was actually spending time.

The steps I take for this exercise are:

  1. Record myself writing some problem. This can either be a problem I worked on at work or a problem from a programming challenge website such as Leetcode.
  2. Go through the recording at 10x speed and annotate what I was doing at each moment.
  3. Total how much time I spent into high level categories. How much time did I spend debugging some bug? How much time did I spend building some feature
  4. Look at the categories I spent the most time in. Then dig into what actually took up that time.
  5. Come up with approaches that would have allowed me to save time. Often there are ways I could have structured my code up front that would have allowed me to write less code or find bugs earlier.

I highly recommend recording your screen. It’s one of the easiest ways to find small changes you can make to make yourself a lot more productive.

I’ve been doing this training regime over the past year. I’ve definitely noticed a huge difference. There’s a large amount of knowledge about systems and tools that I wouldn’t have developed otherwise. I’m also able to solve problems more quickly than I was able to before. I hope you reflect on these exercises and implement a few of them yourself.

Going forward, I’m going to start sharing what I find through my training regime. I’m going to start off by writing a blog post every time I do one of the exercises. I’ll share what I learned and found during the exercise. I think it would be beneficial for me to explain what I learned, as well as make a great learning resource for others.

Postgres Backups with Continuous Archiving

It’s extremely important that you have backups. Without backups, you are at serious risk of losing data. If you don’t have backups, all it takes is one machine failure or fat fingered command to lose data.

There are several different ways to backup a Postgres database. In this post, we are going to look at backups done through an approach called “continuous archiving”. We’ll take a look at how continuous archiving works. We’ll also cover how to setup continuous archiving, and how to restore a backup through continuous archiving.

One neat aspect of backups created through continuous archiving is they allow point in time recovery. Point in time recovery means you can use the backup to restore your database to any point in time you want. You are able to specify the exact moment to restore the backup to. Let’s say you accidentally ran DROP TABLE on the wrong table yesterday at 4:05pm. With point in time recovery, you can recover your database to 4:04pm, right before you dropped the table.

Let’s now take a look at how continuous archiving works. There are two pieces to continuous archiving backups: a base backup and a WAL archive. A base backup is a copy of the entire database at a specific moment in time. On the other hand, the WAL archive is a list of all the diffs made to the database after the base backup was created. The WAL archive is made up of series of WAL files. Each WAL file contains the diffs made during a small period time. Altogether the WAL files give you the diffs since the base backup was created. Not only is WAL used here as part of continuous archiving, but it is also used to recovery in case the database crashes. I’ve talked a bit about how the WAL helps when recovering from crashes here.

With a base backup and WAL archive, it’s straightforward to restore the database to any point in time after the base backup was created. You first restore the base backup. This gives you the database at the time the base backup was created. From there, you apply all of the WAL before the time you want to restore the database to. Since the WAL is a list of all the diffs made to the database, this will bring the database up to the state it was in at that specific time.

Now that we have an idea of how continuous archiving works, let’s take a look at how to set it up. We setup continuous archiving through the core functionality Postgres provides for setting up continuous archiving. If you are setting up backups for you own database, you should instead look at a tool like WAL-E or WAL-G instead. WAL-E and WAL-G wrap the core Postgres functionality and take care of a lot of the work for you. We’ll take a look at WAL-E in my next post.

Creating a base backup is straightforward. Postgres provides a command line tool, pg_basebackup. The pg_basebackup command creates a base backup of the specified Postgres instance and stores it in the location you specify.

Setting up the WAL archive is a bit more involved. Postgres has a configuration parameter archive_command. Every time Postgres generates a certain amount of WAL, it will create a WAL file. Postgres will then run the command in archive_command passing the location of the WAL file as an argument. The archive_command needs to copy the WAL file and store it somewhere where it can be retrieved later. In other words, in the WAL archive. The Postgres docs give the following example archive_command:

 test ! -f /mnt/server/archivedir/%f & cp %p /mnt/server/archivedir/%f

When Postgres runs the command, it will replace the %f with the name of the WAL file and it will replace %p with the full path to the existing WAL file. The command above first uses the test command to make sure it hasn’t already copied the specified WAL file. It then copies the WAL file from it’s current location to /mnt/server/archivedir, the location of the WAL archive. Every time Postgres generates a new WAL file, it will run archive_command and copy the file to /mnt/server/archivedir.

Restoring a Postgres backup is pretty much the opposite of how you create it. First you need to restore the base backup. To do so, you clear out the Postgres data directory to wipe the existing Postgres database. Then you copy the files generated by pg_basebackup to the data directory. This will restore the base backup.

Then, to restore the WAL, you write a recovery.conf file. A recovery.conf file specifies a few parameters. The two most important ones are recovery_target_time and restore_command. recovery_target_time is the time at which you want to restore the backup to. The other parameter, restore_command is the opposite of archive_command. Instead of copying the WAL file from Postgres to the WAL archive, restore_command copies the WAL file from the archive to the specified Postgres directory. Whenever Postgres needs a specific WAL file, it will run restore_command with the name of the file it needs. Here’s an example recovery_command file the Postgres docs give that is the inverse of the above archive_command:

 restore_command = 'cp /mnt/server/archivedir/%f %p'

This command copies the WAL file from /mnt/server/archivedir to the location where Postgres keeps WAL files.

Once you have copied the base backup and setup the recovery.conf file, all you need to do is start Postgres. Postgres will detect that you want to restore a backup and automatically run restore_command until it has restored up to the point you specified! This will give you a copy of your database at exactly the time you requested. If you want a more detailed walk through on how to setup a Postgres backup, the docs have a pretty good one.

I honestly find it incredible that you can restore backups in Postgres to an arbitrary point in time. The way I’ve discussed above is only one way to setup continuous archiving. Because it only uses the bare Postgres functionality, you still need to do a lot of work on your end to configure it how you want. In my next post, we’ll take a look at WAL-E. WAL-E wraps the Postgres functionality and makes it much easier to setup continous archiving. It even has functionality for storing backups in the cloud storage system of your choice!