Recording Footage of Myself Investigating a Performance Issue

Of the exercises I do as part of Getting Dramatically Better as a Programmer my favorite is this one. Every week I record my screen and later review the footage. I’ve found this to be a great way to understand where I’m wasting time and how I could solve the same problem more efficiently.

In this write up I’m going to go over one segment of footage of myself. In particular I’m going to share:

  • The problem I was working on.
  • A breakdown of where I spent my time when working on the problem.
  • General techniques I’ve come up with that would have reduced the amount of time it took to solve the problem.

Since I reviewed myself doing a problem at work, I’m unable to share the raw footage. I am able to share my notes on what I was doing at each point in time which you can see here.

Doing this write up is both for forcing me to explain to others what I learned and for others to see what lessons I’ve learned.

The Problem

The footage I reviewed was of myself doing an investigation at work. For several minutes, our p95 for queries went up by an order of magnitude. I’m usually the person responsible for investigating the problem and coming up with a solution to prevent the problem in the future.

To give you some background, I work for Heap. Heap is a tool that people can use to run analytics on their website. We give customers a piece of Javascript they put on their website which allows us to collect data from their website. Our customers can then run queries over their data. They can ask questions such as “how many people visited my website last week” or “of people who signed up, how many of them went through the tutorial”.

It’s these kinds of queries that were slow for a brief period of time.

Since performance is so important to Heap, we’ve built a plethora tools for analyzing performance. One of the most powerful tools we’ve built is a tool called “Perf Trace”. Perf Trace  gathers information about every query ran in our database. It records the query plan used to execute the query, as well as how much time was spent in each stage of the query. It then writes this data to a SQL database so it’s possible to query the data with SQL. If you want to read more about Perf Trace, I wrote a blog post for Heap that covers it in depth.

One of the most common uses of Perf Trace is debugging issues just like this one.

Where the Time Went

In total, it took 25 minutes to determine the root cause and write it up. The root cause was a group of computationally expensive queries were ran together.

To be more specific, there is a specific type of query Heap supports that consumes a lot of resources on our master database. Every query we run goes through our master database. If our master database becomes overloaded with queries, the amount of time the master DB spends processing queries will go up.

I determined this was the root cause through two symptoms. According to Perf Trace, the increase in query time came from queries spending more time in our master database. I also looked at the CPU usage of our master database and noticed a large spike in CPU usage around the time of the incident.

In finding the root cause, most of my time was spent in one of following three categories:

  • 9 minutes – Looking at various pieces of data. This includes looking at both Perf Trace data and looking at various dashboards.
  • 8 minutes – Writing or debugging SQL queries. The main SQL query I wrote was for looking at the Perf Trace data for queries around the time of the incident.
  • 5 minutes – Writing up the results in a document.

Takeaways

Based on the footage, there are two main takeaways I have:

  • When looking at data, I should have a clear sense of what I’m looking for.
  • I should have a repository of common SQL snippets I use.

For the first point, I noticed that I would spend a lot of time scrolling through the data without looking for anything in particular. Scrolling mindlessly through the data wasted a lot of time and I didn’t really find anything particularly interesting. I think if I had a clear sense of what I was looking for when digging through the data, I could have reduced the amount of time I spent looking through the data.

As for the second point, I spent a lot of time getting the SQL query I wrote to work. The SQL query I wrote for analyzing the Perf Trace data is similar to many queries I’ve written in the past. It took a fair amount of time getting the SQL query to work and debugging some issues with it. If I kept a repository of common SQL snippets I used, I wouldn’t have needed to spend as much time writing the SQL. This is because the repository would already have a snippet that worked and was already debugged.

Tool Write Up – Hadoop

One exercise I do every week as part of Getting Dramatically Better as a Programmer is learn a new tool. This week, I took a look at Hadoop. I’m going to walk through what I learned and mention a few interesting items I got out of it. This is both for forcing me to explain what I learned and for others to see what’s so cool about Hadoop. Most of what I learned was from the Hadoop docs.

Overview

Hadoop is a framework designed for dealing with large amounts of data. There’s two main pieces to Hadoop. There’s HDFS, the Hadoop filesystem, and there is Hadoop MapReduce.

HDFS is a distributed file system based on the Google File System. It looks like a regular old file system, but stores the data across many different servers. HDFS handles problems like coordinating what data is stored where and replicating the files. HDFS is also able to handle files much larger than an ordinary filesystem can deal with. HDFS is the part of Hadoop that handles the raw data storage.

The other part of Hadoop is Hadoop MapReduce. Like HDFS, Hadoop MapReduce is based on a tool created by Google. As the name suggests, it is based on the MapReduce framework. Hadoop MapReduce is the part of Hadoop that handles processing data. It takes care of moving data between many different servers and performing aggregations over the data.

Together, the two parts of Hadoop solve one of the hardest problems that come up when processing large amounts of data: running code across many different servers. Hadoop takes care of this by:

  • Breaking down the input data into many different pieces spread across many different servers.
  • Handling the scheduling of each part of the data processing job across many different servers.
  • Rescheduling the necessary parts of the job if a machine fails.
  • Aggregating the data from each part of the job together.

The rest of this post will mainly discuss the data processing portion of Hadoop, Hadoop MapReduce.

Outline of Hadoop MapReduce

As intended, writing a job to process large amounts of data using Hadoop MapReduce is straightforward. The framework takes care of all the hard parts. As the programmer, you only have to do two things. You only need to specify a “map” function and a “reduce” function. That’s all there is to it!

The core idea of MapReduce is that the map function takes a bit of raw input and processes it. The reduce function then takes the output from multiple map calls and combines them together. Any calculation that can be specified in terms of these two functions is trivial to turn into a distributed and fault tolerant job with Hadoop.

To be more specific, a Hadoop job takes an HDFS file as input. Hadoop splits up the HDFS into many different chunks (this is often already done by HDFS). Hadoop then calls the map function on each chunk. For each chunk, the map function processes it and returns an intermediate result. The result is in the form of a list of key-value pairs.

For each unique key, Hadoop gathers the list of all pairs with that key together. Hadoop then passes the list of values associated with that key to the reduce function. The reduce function combines the list of values into a single value. Hadoop will then write the result of the reduce function for each key to an output file.

Many different types of calculations can be expressed in terms of map and reduce and that’s exactly what makes Hadoop MapReduce so powerful.

Before we dig into what’s going on under the hood, let’s take a look at an example MapReduce application.

Example: Word Count

The code for this section is taken from this tutorial from the Hadoop docs.

The classic example of a MapReduce job is the word count problem. We want to count the number of times each word occurs in a file. Doing this when the file is on the order of gigabytes is no problem. When the file gets on the order of terabytes or even petabytes, you are going to have to bring out the more powerful tools like Hadoop.

To write the Hadoop job, we need to write a map function and a reduce function. As mentioned above, the map function takes a chunk of raw data and produces an intermediate result for that chunk. For the word count problem, we can write a map function that takes the raw text and produces a list of key-value pairs of the form (<word>, 1) for each word in the text.

To write the function, we need to write a class that extends the Mapper class. Then we override the map function with the definition we want. This looks like the following:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}

The map function takes the chunk of input, splits up the text into individual words, and outputs the key-value pair (<word>, 1) for each word. There is some MapReduce specific code, but the example should still be easy to understand.

Writing the reduce function is just as easy. We want to get the number of times each word appears in the file. Since the key of each tuple produced by the map function is the word seen, the reduce function is passed list of ones for each unique word. There is one entry in the list for each time the word appears.

The reduce function simply needs to take the list of values and count how many values there are. In order to allow for some optimizations by Hadoop, we will instead sum the list of values. Since each value in the list is a one, this winds up giving the same result as counting the number of values. We’ll get into what these optimizations are when we look at what’s going on under the hood. Here’s the code:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }
}

It simply iterates through the list of values and sums them.

Now that we have the map and reduce functions, we only need to write a small bit of code that initializes a job that uses them:

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "word count");
  job.setJarByClass(WordCount.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  System.exit(job.waitForCompletion(true) ? 0 : 1);
}

I bolded the important bits. The first two bolded lines specify the map function to use and the reduce function to use. The last bolded line kicks of the job and waits for it to complete. Most of the other lines are just boilerplate for initializing the job.

The above job can be ran across as many machines as you want. That makes it easy to scale the job up to much larger amounts of data than you would ordinarily be able to process without Hadoop.

Hadoop Under the Hood

In order to run a Hadoop job, Hadoop uses two kinds of processes. It has mappers, which are responsible for running the map function, and it has reducers, which are responsible for running the reduce function. Hadoop runs these processes across the different servers in the Hadoop cluster.

When running a Hadoop job, Hadoop first starts many different mapper processes. Hadoop sends the raw chunks of the file being processed to the mapper processes. The mapper processes then run the map function you specified. In order to minimize the amount of data sent over the network to the mappers, Hadoop will try to run the mapper functions on the same machines as where HDFS stores the file.

Once a mapper process finishes generating its output, it sends the output to the reducers. The mapper spreads its output across all the different reducers. To make sure all values with the same key are sent to the same reducer, Hadoop decides which reducer to send a particular key-value pair to based on the key. The process of mappers sending different chunks of their data to the reducers is known as the shuffle step.

As an optimization, the mapper will use the “combiner” function to combine values from the same mapper before they are sent to the reducer. The combiner function is an optional function you can specify specifically for this purpose. Before sending its output to the reducer, the map function will pass all values with the same key to the combiner function and then send the output of the combiner function to the reducer. This reduces the total amount of data sent over the network. Often the combiner function can be the same as the reduce function.

This is why the reducer we defined in the word count problem takes the sum of the values instead of the count of them. The reducer is really taking the sum of the values from each mapper. Each mapper had already passed the key-value pairs to the local combiner which counted how many times each word occured and handed that information off to the reducer.

As each reducer completes, they write their data to the output HDFS file. And that’s how a Hadoop MapReduce job works.

Takeaways

  • HDFS is a distributed file system that makes it easy to store lots of data.
  • Hadoop MapReduce is a framework that makes it easy to write distributed and fault tolerant jobs.

I find MapReduce pretty interesting. I do find the boilerplate a bit off putting, but that would be missing the forest for the trees. MapReduce saves way more code than the extra code needed for the boilerplate. I find it impressive how great of an abstraction MapReduce is. By dramatically limiting the programming model, it makes it trivial to write scalable software.

Book Write Up – Debugging Chapters 5 and 6

As part of my Getting Dramatically Better as a Programmer Series, I’m going to write up a few chapters of the book I’m currently reading and a few takeaways I got. I’m currently reading the book Debugging by David Agans. The book covers Agans’ “nine golden rules of debugging”. Each rule is a general guideline to follow that makes debugging easier. Each chapter in the book covers one of the nine rules.
 
Today, I’m going to go through chapters 5 and 6. Chapter 5 covers the rule “Quit Thinking and Look” and chapter 6 covers the rule “Divide and Conquer”.

Chapter 5: Quit Thinking and Look

There are two main points Agans makes in this chapter:
 
  • If you try to guess where the bug is, you will likely guess wrong. Instead of guessing where the bug is, you should look until you can “see” the bug.
  • You should add lots of instrumentation throughout your code to make it easier to see the bug.
Throughout my time as a programmer, I’ve been amazed at how many times people, including myself, incorrectly guess where a bug is. One of my favorite examples is when one of my friends was making a simple 2d game. The game wound up being incredibly slow. My friend thought the slow part must have been the collision detection. He was beginning to think about how to optimize the collision detection when I suggested that he should try hooking up a profiler and actually see where the problem was.
 
My friend reluctantly hooked up a profiler to his program and found, lo and behold, that collision detection was not the problem. It turned out the program was reloading the image files for the sprites on every frame. After he changed the program so it only loaded the sprites once, the game moved a lot more smoothly.
 
My friend had no evidence that the problem was what he thought it was. He was only relying on his intuition. Once he hooked up a profiler, he was able to see exactly where the real problem was.
 
If my friend had gone through and optimized the code for collision detection, he would have wasted time optimizing the code only to discover the issue was somewhere else. This is why it’s important to gather evidence that the bug is where you think it is before you attempt to fix the problem. The phrase “measure twice, cut once” comes to mind. You should be able to “see” the bug before you start to fix it.
 
This is one reason why profilers are so great. They let you see exactly where the slow part of your program is. There’s no need to guess at which part of your program is slow when you have access to a profiler.
 
This is also why you should instrumentation throughout your program. The more data you collector about your program, the easier it is to see the bugs inside your program.
 
One of the most useful pieces of code I’ve ever written was a tool I wrote for work called “perf trace”. For my day job, I’m responsible for optimizing the performance of a 1pb distributed database. For every query that runs in our database, perf trace gathers information about the different stages of the query and how long each stage took. This data gives us a clear sense of what queries are slow for what reasons. Perf trace makes it easy to debug slow queries because we are able to see exactly how much time each query spent at each stage of query execution.
 
We also use perf trace for understanding what the common causes of slow queries are. This allows us to determine what optimizations will improve query performance the most. If you want to read more about perf trace, I wrote a blog post about it that’s on my company’s blog.

Chapter 6: Divide and Conquer

The main point of chapter 6 is that the process of “Divide and Conquer” is one of the most efficient ways to locate a bug. This rule goes hand in hand with “Quit Thinking and Look”. Quit thinking and look suggests looking for the bug until you are able to see it. divide and conquer is the method you should use to locate the bug.
 
The idea behind divide and conquer is simple. Try to find the general region where the bug is. Then repeatedly narrow down the location of the bug into smaller and smaller regions until you can see the bug. This is the same idea behind binary search, only applied to debugging.
 
At my job, a common issue that comes up is query times went up for some duration of time. When something like that happens, I have to investigate the issue and figure out why exactly query times went up. I usually use the perf trace tool I mentioned above to aid in debugging the problem.
 
When I have to look into why queries were slow, the first thing I do is try to figure out what area of the stack the increase in query times came from. Did it come from the web server? Did it come from the master database? Did it come from one of the worker databases?
 
Once I narrow down the query time increase to a specific area of our stack, I try to figure out the high level reason for why that part of the stack got slower. Did we max out CPU on a machine for some reason? Did we exceed the maximum number of queries we can run on a single server at a time?
 
Then, once I determine the high level reason query times go up, I can then start looking for the root cause. An issue that came up before was that a job didn’t have a limit on the number of queries it was running in parallel. When the job ran, it ran many more queries in our database than we normally process in our database. Processing so many queries in parallel caused a spike in CPU usage on our master database. That in turn caused all queries to become slower during the master database part of the query.
 
Divide and conquer is a general and efficient way to find the problem when dealing with issues like this. By applying  divide and conquer when doing investigations, I’m able to make consistent forward progress until I can find the issue.
 
As an aside, one of the least effective ways to debug problems is to look through the code line by line until you find the bug. I’ve done this myself in the past and I’ve seen a lot of other people do it too. This approach to debugging is bad for several reasons. For one, it’s inefficient. It’s more of a linear search than a binary search because you have to read a large chunk of code before you have any chance of finding the bug. It’s also likely that the bug is non-obvious. Whoever wrote the bug probably didn’t see it so it’s unlikely that you’ll be able to spot it by just looking at the code. By reproducing the bug and following divide and conquer, you can find the bug much more quickly than if you were to read all the relevant code line by line.
 
Of course, it’s important to have a good understanding of the code base and how it works. If you have a good understanding of the system in advance, that will be helpful in debugging the system. Although, walking through the code for the system line by line is an inefficient way to debug.

Takeaways

To recap, the two rules covered are:

  • Quit Thinking and Look – Instead of guessing where the bug is, you should produce evidence that the bug is where you think it is
  • Divide and Conquer – Try to find the general region in your code where the bug is. Then narrow down the region it can be in until you are looking at the bug.
I find both of these to be solid pieces of advice. I’m definitely going to keep them in mind the next time I’m debugging a problem.

Paper Write Up – A Critique of the CAP Theorem

This is my first post under the Getting Dramatically Better as a Programmer series. Today we are going to look at the paper “A Critique of the CAP Theorem“. The paper was written by Martin Klepmann. Klepmann also wrote the book Designing Data Intensive Applications. I’m going to go through the paper, explain the main points made by the paper, and summarize the takeaways I got from the paper. This is partially for forcing myself to have to explain the paper to others and also for others to learn about the paper.

Overview

The CAP theorem is a well known theorem about distributed systems. As the name suggests, “A Critique of the CAP Theorem” points out several issues with the CAP theorem. In particular, Klepmann looks at two different issues with the CAP theorem:

  • The terms used in the theorem have unclear definitions and can be interpreted in many different ways.
  • Some assumptions made in the proof the CAP theorem do not reflect reality.

Klepmann then proposes a new way of modeling distributed systems he calls “delay-sensitivity”.Klepmann designed delay-sensitivity to address the problems of the CAP theorem.

Background: The CAP Theorem

Before we dig into the paper, let’s first look at the CAP theorem and the history surrounding it. The CAP theorem was originally an informal claim about distributed systems made by Eric Brewer. Brewer’s claim is now known as “Brewer’s Conjecture”. Brewer’s Conjecture stated that a distributed system can have at most two of: consistency, availability, and  partition-tolerance (CAP). When making his claim, Brewer gave only rough definitions for each of these terms.

A more intuitive way to interpret Brewer’s Conjecture is the following: in a distributed system, if some servers cannot access each other, the distributed system will either be unable to process some requests (lack of availability) or the distributed system will not behave like a single server (lack of consistency).

To give a specific example, let’s say you want to build a website like Google Docs. You want multiple people to be able to collaborate on a document together. They should all be able to make edits to the document and everyone should see each other’s edits in real-time. You can consider the computers of the users of the website and the servers of the website itself to be a distributed system.

Brewer’s Conjecture states that when a user loses connection to the website (a network partition occurs) you have to make a choice. One option is you don’t allow the user to edit the document until they reconnect. This is choosing consistency over availability. By doing this, you guarantee that when a user views the document, they see the same document everyone else is seeing (consistency). At the same time, they won’t be able to view the document when they are unable to access the website (lack of availability).

The other option you have is to provide some sort of offline mode. If you add offline functionality, you are choosing availability over consistency. Users will be able to continue editing the document locally even if they can’t connect to the website (availability), but every user will not be able to see the most recent edits made by the other users (lack of consistency).

Brewer’s Conjecture states that it’s impossible to provide both availability and consistency when a user cannot connect to the website.

Brewer’s Conjecture is important because many systems want to provide both availability and consistency when there is a network partition. In practice, any system can only provide one of them during a partition. Ultimately whether you choose availability or consistency is a design decision.

Soon after Brewer made his conjecture, Seth Gilbert and Nancy Lynch formalized it in “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services“. Gilbert and Lynch also gave a proof of the formal version of the conjecture. Since Gilbert and Lynch proved the formal version of Brewer’s Conjecture, the formal version became a theorem. That theorem is now known as “The CAP Theorem”.

Now that we’ve covered the history of the CAP theorem, let’s dig into Klepmann’s critique of it.

Ambiguity of the Terms

One of the main issues raised by Klepmann is that the definitions of the terms “availability” and “consistency” are ambiguous in Brewer’s Conjecture.

There are many different ways you can interpret “availability”. One way to interpret it is about how often a production system returns a successful response. For example, you can measure this kind of availability by looking at what percentage of requests are successful. An alternative way to define availability is as a property of the algorithm a system is running. If the algorithm guarantees successful responses in all cases, the algorithm can be referred to as “available”.

There is a related question of how to deal with slow responses. One option is to define a system as available if it returns a response eventually. Another way is to define a system as available if it returns a response in under a given time limit.

The term “consistency” is also overloaded with many different meanings. There’s consistency in the ACID sense which is completely different from the consistency Brewer was referring to.1 In the context of distributed systems, people often refer to “consistency models”. A specific consistency model provides guarantees about the behavior of the distributed system. Some common consistency models are sequential consistency, causal consistency, and linearizability. Each of these models provides a different set of guarantees.

To give you a rough idea of the different consistency models, sequential consistency requires that executing several operations in a distributed system is the same as executing those operations in some order, one at a time. Linearizability is the same as sequential consistency except it also requires that if operation A started after operation B completed, the effects of operation A must come after the effects of operation B.

The term “strong consistency” also comes up often. Even though it does come up often, it has no specific formal definition. “Strong consistency” can refer to any of the consistency models above.

Differences Between Theory and Practice

After looking at some issues around the terms used in the CAP theorem, Klepmann goes on to look at the formal version. This is the CAP theorem as proved by Gilbert and Lynch. In defining the CAP theorem, Gilbert and Lynch define the words “availability”, “consistency”, and “partition-tolerance”. In particular they define them as follows:

  • Availability – Every request to the service must terminate.
  • Consistency – The system provides linearizability.
  • Partition-tolerance – The system still needs to work even when some nodes in the system are unable to communicate with other nodes in the system.

Klepmann points out that these definitions and some assumptions made by Gilbert and Lynch do not model reality.

There’s a big difference in the use of “availability” in the CAP theorem and the use of “availability” in practice. In the CAP theorem, availability means the system only needs to return a response eventually. That means the system could return a response in a few seconds, several minutes, or even several days. As long as the system eventually returns a response, it is considered to be available by the CAP theorem. In practice, you usually set a timeout so that if a system takes too long to respond, the system is considered to be unavailable. If a website takes more than a few seconds to response, it doesn’t matter that the system is available in the CAP sense. For all practical purposes, if no one is able to access the website, the website is unavailable.

Another big difference is the CAP theorem only considers one kind of failure, a partition. In practice there are many other failures that can occur. Individual servers can crash. Individual network links can go down. Data on servers can even become corrupted. This makes the CAP theorem useful if the only kind of failure you are worried about are partitions. If you want to reason about other kinds of failures in your distributed system, it’s difficult to make use of the CAP theorem.

Delay-Sensitivity

To deal with the issues of the CAP theorem, Klepmann gives a rough draft of a system he calls “delay-sensitivity”. In delay-sensitivity, you measure every operation by whether the operation requires time proportional to the network latency. The advantage of delay-sensitivity is it gives you an idea of how long each operation will take. This is as opposed to the CAP theorem which creates a dichotomy between available and unavailable.

For a linearizable system, it’s possible to prove that all operations will take time proportional to the network latency. When there is a network partition, the network latency is unbounded, That means when a partition occurs in a linearizable system, operations will take an unbounded amount of time and the system will become unavailable. Such a system is consistent, but not available when there is a network partition.

On the other hand, it’s possible to provide a non-linearizable system in which all operations complete independently of the network latency. This means even when there is a partition and the network latency is unbounded, operations will still complete. In other words, the system is available even when there is a network partition. When there is a network partition, this system would be available, but it lacks consistency since it does not provide linearizability.

Takeaways

  • Many of the terms used by the CAP theorem are ambiguous. “Availability”, “Consistency”, and “Partition-tolerance” can all be interpreted in many different ways.
  • The CAP theorem makes many assumptions that do not reflect reality. The CAP theorem only looks at partitions as the only possible fault that can come up. It also  considers a service to be available as long as the service is able to eventually return a response. It doesn’t matter how long the response will take.

My Opinion

While the paper was an interesting read, I do have some issues with it. The paper is titled “A Critique of the CAP Theorem”. While part of the paper discusses how the CAP theorem differs from reality, a good amount of the paper focuses on the ambiguity of the terms of the CAP theorem. The ambiguity isn’t so much a problem with The CAP theorem itself. When they define the CAP theorem, Gilbert and Lynch give formal definitions for all the terms they use. The issues raised are more an issue with the colloquial version of the CAP theorem.

The issues raised about the actual CAP theorem are a bit more interesting. Especially the part about how the CAP theorem doesn’t have any regard for how long the operations actually take. I do like this aspect of delay-sensitivity since delay-sensitivity roughly measures how long each operation takes, even if there isn’t a network partition. On the other hand, I do find it difficult to reason about the exact implications of an operation taking time proportional to the network latency. I find it much easier to reason about the CAP theorem because it makes a clear distinction between a system being available or unavailable.

My Approach to Getting Dramatically Better as a Programmer

There was a recent discussion among my social group about what “getting dramatically better as a programmer” means. Based on that discussion, I’ve decided to share my own approach to becoming a “dramatically better programmer”. I want others to understand what practices I’ve found useful, so they can incorporate them into their own life.

My approach to getting dramatically better is built around a training regime. There are a specific set of “exercises” I do every week. I designed the training regime with two explicit goals in mind:

  • Learning how to solve problems I didn’t know how to solve before.
  • Learning how to write correct programs faster.

My training routine consists of a total of four different exercises. Each one helps me achieve the two objectives above. The four different exercises are:

  • Reading a paper.
  • Learning a new tool.
  • Reading several chapters of a book.
  • Recording my screen as I write a program. Then reviewing the footage and seeing how I could have written the program faster.

Let me explain each of the exercises in a bit more depth. I want to share some details on how each exercise works and the main benefits I’ve had from doing each exercise.

Reading a Paper

This exercise is designed to expand my knowledge of Computer Science. I’ve found two direct benefits to reading papers. The first benefit is that some papers have changed my model of certain problems. A good example of this is The Tail at Scale. That paper examines the counter intuitive nature of long tail latency.

One interesting lesson I learned from the paper was how running a request on multiple machines affects latency. The authors looked at empirical data from a Google service.  The service processes a request by distributing parts of the request across many different services. They used the data to estimate what would happen if you distributed requests across 100 different services. The authors found that if you measure the time it takes to get a response from all 100 severs, half of that time will be spent waiting for only the last five! This is because the slowest 5% of requests is that much slower than all the other requests. The paper also gives several approaches on how to reduce tail latency. I have found those approaches useful in my own work.

The other benefit I’ve found from reading papers is that they give me the knowledge to understand different systems as a whole. As an example, take Spanner, Google’s distributed database. Spanner uses many different techniques such as Paxos, two phase commit, MVCC, and predicate locks. I’ve been able to build up an understanding of these different techniques through reading papers. In turn, this enables me to reason the Spanner as a whole and reason about the trade offs Spanner makes compared to other systems.

I find most papers I read by either following the references of papers I have read or following up on a paper covered in the Morning Paper. The book Designing Data Intensive Applications also has a lot of references to papers that are worth reading.

Learning a New Tool

One of the easiest ways to solve problems is to use an existing tool that already solves that problem. For this exercise, I pick a tool and learn a bit about it. Usually I setup the tool locally, go through a few tutorials, and read a bit of the manual. Tools that I’ve learned in the past range from bash utils such as jq or sed to distributed systems such as Kafka or Zookeeper.

Learning the bash utilities helps me get solve many common tasks quicker than I could have otherwise. Simple text processing is often easier with sed than it is with a programming language. Likewise, learning about different distributed systems come in handy for understanding what tools work well for solving different problems. That way I know what tool I should use when faced with a given problem.

Reading Several Chapters of a Book

I use books to supplement the knowledge I don’t get from reading a paper or
learning a tool. The books I read cover a wide range of topics. Books I’ve read
recently include:

  • Refactoring – I found this to be a great way to understand what good code should look like and how to turn bad code into good code.
  • Getting Things Done – I found this book to be helpful for prioritizing and keeping track of things. It helped me build a system to make sure I get done what’s important for me to get done.
  • The First Time Manager – I recently became the team coordinator for my team at work. As team coordinator, my main responsibility is communicating with other teams when necessary. I also lead my team’s meetings. I found this book to be great for getting an understanding of basic management principles.

Recording My Screen

This exercise is my favorite. It’s the exercise that’s changed how I approach problems the most. It’s common for athletes to review footage of themselves to understand how they could have done better. I decided to apply the same approach to programming. Some of the lessons I’ve learned by recording my screen include:

  • It helps to test code as it’s written. Doing this reduces the amount of time it takes to debug code by reducing the time you spend locating where the bug is. If all your existing code is bug free, the bug has to be in the new code you just wrote.
  • When debugging a problem, adding functionality only for the purpose debugging is often well worth the cost. As an example, one toy problem I worked on was writing an LRU cache. I had a bug where it wasn’t evicting the right elements. I was able to quickly determine what was wrong by adding a function to print the state of the cache. I could then see where the expected behavior of the cache differed from the actual behavior. This allowed me to quickly locate the bug.
  • Spending five minutes deciding on an approach up front before writing any code is worth it. I’ve found two benefits to doing this. It helps me make sure my approach is correct. More importantly, it forces me to decide on a single approach. By watching a recording of myself, I found I wasted a lot of time switching my code between two different approaches. In reality, either approach would have worked fine.

All these lessons are obvious in retrospect. but I had no clue that any of these were issues until I recorded my screen and saw where I was actually spending time.

The steps I take for this exercise are:

  1. Record myself writing some problem. This can either be a problem I worked on at work or a problem from a programming challenge website such as Leetcode.
  2. Go through the recording at 10x speed and annotate what I was doing at each moment.
  3. Total how much time I spent into high level categories. How much time did I spend debugging some bug? How much time did I spend building some feature
  4. Look at the categories I spent the most time in. Then dig into what actually took up that time.
  5. Come up with approaches that would have allowed me to save time. Often there are ways I could have structured my code up front that would have allowed me to write less code or find bugs earlier.

I highly recommend recording your screen. It’s one of the easiest ways to find small changes you can make to make yourself a lot more productive.


I’ve been doing this training regime over the past year. I’ve definitely noticed a huge difference. There’s a large amount of knowledge about systems and tools that I wouldn’t have developed otherwise. I’m also able to solve problems more quickly than I was able to before. I hope you reflect on these exercises and implement a few of them yourself.

Going forward, I’m going to start sharing what I find through my training regime. I’m going to start off by writing a blog post every time I do one of the exercises. I’ll share what I learned and found during the exercise. I think it would be beneficial for me to explain what I learned, as well as make a great learning resource for others.