Announcing Perfalytics: Never Run EXPLAIN ANALYZE Again

Today I’m announcing the beta for Perfalytics. Perfalytics is a tool I’ve been working on designed to make it easy to analyze query performance for Postgres. As the lead of the Database team at Heap, where I optimize the performance of a Postgres cluster with more than 1PB of data, I’ve done my fair share of Postgres performance analysis. I believe the standard way of debugging a slow Postgres query is tedious and cumbersome.

Debugging a slow Postgres query usually goes something like this. Given the slow query, you run EXPLAIN ANALYZE on it. This will rerun the query with performance instrumentation. This gives you the query plan of the query and how much time was spent processing each part of the query plan.

If you’re lucky, the slow query is reproducible and you will see why the query was slow. In my experience, this is frequently not the case.  All too often, by the time you run the query with EXPLAIN ANALYZE, something has changed and the query is now fast. This can happen for a number of reasons including caching, the makeup of the table changing, decreased server load, and many other potential factors.

The idea behind Perfalytics is to turn this way of doing things on it’s head. Instead of running EXPLAIN ANALYZE after the fact to try to reproduce the slow query, why not gather the exact same information at query time? That way if the query is slow, you can see exactly why that particular run of the query was slow.

That’s right. Perfalytics records the query plan and various query statistics for every single query at query time. This means you have all the information you need to debug why a particular query was slow. There’s no need to try to reproduce the problem after the fact.

In addition to making it easier to debug slow queries, collecting this information has a number of other advantages. Because you are collecting this information for every single query, you can look at overall trends. For example, you can, for each slow query, programmatically classify why exactly that query was slow. This lets you see how often different kinds of slow queries come up and tells you exactly where you should focus your efforts.

Once you try using something like Perfalytics, you’ll never want to go back to the old way of doing things.

If you want to try out Perfalytics, sign up for an beta invite.

If you have any questions, feel free to contact me at michaelmalis2@gmail.com.

Recording Footage of Myself Investigating a Performance Issue

Of the exercises I do as part of Getting Dramatically Better as a Programmer my favorite is this one. Every week I record my screen and later review the footage. I’ve found this to be a great way to understand where I’m wasting time and how I could solve the same problem more efficiently.

In this write up I’m going to go over one segment of footage of myself. In particular I’m going to share:

  • The problem I was working on.
  • A breakdown of where I spent my time when working on the problem.
  • General techniques I’ve come up with that would have reduced the amount of time it took to solve the problem.

Since I reviewed myself doing a problem at work, I’m unable to share the raw footage. I am able to share my notes on what I was doing at each point in time which you can see here.

Doing this write up is both for forcing me to explain to others what I learned and for others to see what lessons I’ve learned.

The Problem

The footage I reviewed was of myself doing an investigation at work. For several minutes, our p95 for queries went up by an order of magnitude. I’m usually the person responsible for investigating the problem and coming up with a solution to prevent the problem in the future.

To give you some background, I work for Heap. Heap is a tool that people can use to run analytics on their website. We give customers a piece of Javascript they put on their website which allows us to collect data from their website. Our customers can then run queries over their data. They can ask questions such as “how many people visited my website last week” or “of people who signed up, how many of them went through the tutorial”.

It’s these kinds of queries that were slow for a brief period of time.

Since performance is so important to Heap, we’ve built a plethora tools for analyzing performance. One of the most powerful tools we’ve built is a tool called “Perf Trace”. Perf Trace  gathers information about every query ran in our database. It records the query plan used to execute the query, as well as how much time was spent in each stage of the query. It then writes this data to a SQL database so it’s possible to query the data with SQL. If you want to read more about Perf Trace, I wrote a blog post for Heap that covers it in depth.

One of the most common uses of Perf Trace is debugging issues just like this one.

Where the Time Went

In total, it took 25 minutes to determine the root cause and write it up. The root cause was a group of computationally expensive queries were ran together.

To be more specific, there is a specific type of query Heap supports that consumes a lot of resources on our master database. Every query we run goes through our master database. If our master database becomes overloaded with queries, the amount of time the master DB spends processing queries will go up.

I determined this was the root cause through two symptoms. According to Perf Trace, the increase in query time came from queries spending more time in our master database. I also looked at the CPU usage of our master database and noticed a large spike in CPU usage around the time of the incident.

In finding the root cause, most of my time was spent in one of following three categories:

  • 9 minutes – Looking at various pieces of data. This includes looking at both Perf Trace data and looking at various dashboards.
  • 8 minutes – Writing or debugging SQL queries. The main SQL query I wrote was for looking at the Perf Trace data for queries around the time of the incident.
  • 5 minutes – Writing up the results in a document.

Takeaways

Based on the footage, there are two main takeaways I have:

  • When looking at data, I should have a clear sense of what I’m looking for.
  • I should have a repository of common SQL snippets I use.

For the first point, I noticed that I would spend a lot of time scrolling through the data without looking for anything in particular. Scrolling mindlessly through the data wasted a lot of time and I didn’t really find anything particularly interesting. I think if I had a clear sense of what I was looking for when digging through the data, I could have reduced the amount of time I spent looking through the data.

As for the second point, I spent a lot of time getting the SQL query I wrote to work. The SQL query I wrote for analyzing the Perf Trace data is similar to many queries I’ve written in the past. It took a fair amount of time getting the SQL query to work and debugging some issues with it. If I kept a repository of common SQL snippets I used, I wouldn’t have needed to spend as much time writing the SQL. This is because the repository would already have a snippet that worked and was already debugged.

Tool Write Up – Hadoop

One exercise I do every week as part of Getting Dramatically Better as a Programmer is learn a new tool. This week, I took a look at Hadoop. I’m going to walk through what I learned and mention a few interesting items I got out of it. This is both for forcing me to explain what I learned and for others to see what’s so cool about Hadoop. Most of what I learned was from the Hadoop docs.

Overview

Hadoop is a framework designed for dealing with large amounts of data. There’s two main pieces to Hadoop. There’s HDFS, the Hadoop filesystem, and there is Hadoop MapReduce.

HDFS is a distributed file system based on the Google File System. It looks like a regular old file system, but stores the data across many different servers. HDFS handles problems like coordinating what data is stored where and replicating the files. HDFS is also able to handle files much larger than an ordinary filesystem can deal with. HDFS is the part of Hadoop that handles the raw data storage.

The other part of Hadoop is Hadoop MapReduce. Like HDFS, Hadoop MapReduce is based on a tool created by Google. As the name suggests, it is based on the MapReduce framework. Hadoop MapReduce is the part of Hadoop that handles processing data. It takes care of moving data between many different servers and performing aggregations over the data.

Together, the two parts of Hadoop solve one of the hardest problems that come up when processing large amounts of data: running code across many different servers. Hadoop takes care of this by:

  • Breaking down the input data into many different pieces spread across many different servers.
  • Handling the scheduling of each part of the data processing job across many different servers.
  • Rescheduling the necessary parts of the job if a machine fails.
  • Aggregating the data from each part of the job together.

The rest of this post will mainly discuss the data processing portion of Hadoop, Hadoop MapReduce.

Outline of Hadoop MapReduce

As intended, writing a job to process large amounts of data using Hadoop MapReduce is straightforward. The framework takes care of all the hard parts. As the programmer, you only have to do two things. You only need to specify a “map” function and a “reduce” function. That’s all there is to it!

The core idea of MapReduce is that the map function takes a bit of raw input and processes it. The reduce function then takes the output from multiple map calls and combines them together. Any calculation that can be specified in terms of these two functions is trivial to turn into a distributed and fault tolerant job with Hadoop.

To be more specific, a Hadoop job takes an HDFS file as input. Hadoop splits up the HDFS into many different chunks (this is often already done by HDFS). Hadoop then calls the map function on each chunk. For each chunk, the map function processes it and returns an intermediate result. The result is in the form of a list of key-value pairs.

For each unique key, Hadoop gathers the list of all pairs with that key together. Hadoop then passes the list of values associated with that key to the reduce function. The reduce function combines the list of values into a single value. Hadoop will then write the result of the reduce function for each key to an output file.

Many different types of calculations can be expressed in terms of map and reduce and that’s exactly what makes Hadoop MapReduce so powerful.

Before we dig into what’s going on under the hood, let’s take a look at an example MapReduce application.

Example: Word Count

The code for this section is taken from this tutorial from the Hadoop docs.

The classic example of a MapReduce job is the word count problem. We want to count the number of times each word occurs in a file. Doing this when the file is on the order of gigabytes is no problem. When the file gets on the order of terabytes or even petabytes, you are going to have to bring out the more powerful tools like Hadoop.

To write the Hadoop job, we need to write a map function and a reduce function. As mentioned above, the map function takes a chunk of raw data and produces an intermediate result for that chunk. For the word count problem, we can write a map function that takes the raw text and produces a list of key-value pairs of the form (<word>, 1) for each word in the text.

To write the function, we need to write a class that extends the Mapper class. Then we override the map function with the definition we want. This looks like the following:

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}

The map function takes the chunk of input, splits up the text into individual words, and outputs the key-value pair (<word>, 1) for each word. There is some MapReduce specific code, but the example should still be easy to understand.

Writing the reduce function is just as easy. We want to get the number of times each word appears in the file. Since the key of each tuple produced by the map function is the word seen, the reduce function is passed list of ones for each unique word. There is one entry in the list for each time the word appears.

The reduce function simply needs to take the list of values and count how many values there are. In order to allow for some optimizations by Hadoop, we will instead sum the list of values. Since each value in the list is a one, this winds up giving the same result as counting the number of values. We’ll get into what these optimizations are when we look at what’s going on under the hood. Here’s the code:

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
  private IntWritable result = new IntWritable();

  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
    int sum = 0;
    for (IntWritable val : values) {
      sum += val.get();
    }
    result.set(sum);
    context.write(key, result);
  }
}

It simply iterates through the list of values and sums them.

Now that we have the map and reduce functions, we only need to write a small bit of code that initializes a job that uses them:

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "word count");
  job.setJarByClass(WordCount.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));

  System.exit(job.waitForCompletion(true) ? 0 : 1);
}

I bolded the important bits. The first two bolded lines specify the map function to use and the reduce function to use. The last bolded line kicks of the job and waits for it to complete. Most of the other lines are just boilerplate for initializing the job.

The above job can be ran across as many machines as you want. That makes it easy to scale the job up to much larger amounts of data than you would ordinarily be able to process without Hadoop.

Hadoop Under the Hood

In order to run a Hadoop job, Hadoop uses two kinds of processes. It has mappers, which are responsible for running the map function, and it has reducers, which are responsible for running the reduce function. Hadoop runs these processes across the different servers in the Hadoop cluster.

When running a Hadoop job, Hadoop first starts many different mapper processes. Hadoop sends the raw chunks of the file being processed to the mapper processes. The mapper processes then run the map function you specified. In order to minimize the amount of data sent over the network to the mappers, Hadoop will try to run the mapper functions on the same machines as where HDFS stores the file.

Once a mapper process finishes generating its output, it sends the output to the reducers. The mapper spreads its output across all the different reducers. To make sure all values with the same key are sent to the same reducer, Hadoop decides which reducer to send a particular key-value pair to based on the key. The process of mappers sending different chunks of their data to the reducers is known as the shuffle step.

As an optimization, the mapper will use the “combiner” function to combine values from the same mapper before they are sent to the reducer. The combiner function is an optional function you can specify specifically for this purpose. Before sending its output to the reducer, the map function will pass all values with the same key to the combiner function and then send the output of the combiner function to the reducer. This reduces the total amount of data sent over the network. Often the combiner function can be the same as the reduce function.

This is why the reducer we defined in the word count problem takes the sum of the values instead of the count of them. The reducer is really taking the sum of the values from each mapper. Each mapper had already passed the key-value pairs to the local combiner which counted how many times each word occured and handed that information off to the reducer.

As each reducer completes, they write their data to the output HDFS file. And that’s how a Hadoop MapReduce job works.

Takeaways

  • HDFS is a distributed file system that makes it easy to store lots of data.
  • Hadoop MapReduce is a framework that makes it easy to write distributed and fault tolerant jobs.

I find MapReduce pretty interesting. I do find the boilerplate a bit off putting, but that would be missing the forest for the trees. MapReduce saves way more code than the extra code needed for the boilerplate. I find it impressive how great of an abstraction MapReduce is. By dramatically limiting the programming model, it makes it trivial to write scalable software.

Book Write Up – Debugging Chapters 5 and 6

As part of my Getting Dramatically Better as a Programmer Series, I’m going to write up a few chapters of the book I’m currently reading and a few takeaways I got. I’m currently reading the book Debugging by David Agans. The book covers Agans’ “nine golden rules of debugging”. Each rule is a general guideline to follow that makes debugging easier. Each chapter in the book covers one of the nine rules.
 
Today, I’m going to go through chapters 5 and 6. Chapter 5 covers the rule “Quit Thinking and Look” and chapter 6 covers the rule “Divide and Conquer”.

Chapter 5: Quit Thinking and Look

There are two main points Agans makes in this chapter:
 
  • If you try to guess where the bug is, you will likely guess wrong. Instead of guessing where the bug is, you should look until you can “see” the bug.
  • You should add lots of instrumentation throughout your code to make it easier to see the bug.
Throughout my time as a programmer, I’ve been amazed at how many times people, including myself, incorrectly guess where a bug is. One of my favorite examples is when one of my friends was making a simple 2d game. The game wound up being incredibly slow. My friend thought the slow part must have been the collision detection. He was beginning to think about how to optimize the collision detection when I suggested that he should try hooking up a profiler and actually see where the problem was.
 
My friend reluctantly hooked up a profiler to his program and found, lo and behold, that collision detection was not the problem. It turned out the program was reloading the image files for the sprites on every frame. After he changed the program so it only loaded the sprites once, the game moved a lot more smoothly.
 
My friend had no evidence that the problem was what he thought it was. He was only relying on his intuition. Once he hooked up a profiler, he was able to see exactly where the real problem was.
 
If my friend had gone through and optimized the code for collision detection, he would have wasted time optimizing the code only to discover the issue was somewhere else. This is why it’s important to gather evidence that the bug is where you think it is before you attempt to fix the problem. The phrase “measure twice, cut once” comes to mind. You should be able to “see” the bug before you start to fix it.
 
This is one reason why profilers are so great. They let you see exactly where the slow part of your program is. There’s no need to guess at which part of your program is slow when you have access to a profiler.
 
This is also why you should instrumentation throughout your program. The more data you collector about your program, the easier it is to see the bugs inside your program.
 
One of the most useful pieces of code I’ve ever written was a tool I wrote for work called “perf trace”. For my day job, I’m responsible for optimizing the performance of a 1pb distributed database. For every query that runs in our database, perf trace gathers information about the different stages of the query and how long each stage took. This data gives us a clear sense of what queries are slow for what reasons. Perf trace makes it easy to debug slow queries because we are able to see exactly how much time each query spent at each stage of query execution.
 
We also use perf trace for understanding what the common causes of slow queries are. This allows us to determine what optimizations will improve query performance the most. If you want to read more about perf trace, I wrote a blog post about it that’s on my company’s blog.

Chapter 6: Divide and Conquer

The main point of chapter 6 is that the process of “Divide and Conquer” is one of the most efficient ways to locate a bug. This rule goes hand in hand with “Quit Thinking and Look”. Quit thinking and look suggests looking for the bug until you are able to see it. divide and conquer is the method you should use to locate the bug.
 
The idea behind divide and conquer is simple. Try to find the general region where the bug is. Then repeatedly narrow down the location of the bug into smaller and smaller regions until you can see the bug. This is the same idea behind binary search, only applied to debugging.
 
At my job, a common issue that comes up is query times went up for some duration of time. When something like that happens, I have to investigate the issue and figure out why exactly query times went up. I usually use the perf trace tool I mentioned above to aid in debugging the problem.
 
When I have to look into why queries were slow, the first thing I do is try to figure out what area of the stack the increase in query times came from. Did it come from the web server? Did it come from the master database? Did it come from one of the worker databases?
 
Once I narrow down the query time increase to a specific area of our stack, I try to figure out the high level reason for why that part of the stack got slower. Did we max out CPU on a machine for some reason? Did we exceed the maximum number of queries we can run on a single server at a time?
 
Then, once I determine the high level reason query times go up, I can then start looking for the root cause. An issue that came up before was that a job didn’t have a limit on the number of queries it was running in parallel. When the job ran, it ran many more queries in our database than we normally process in our database. Processing so many queries in parallel caused a spike in CPU usage on our master database. That in turn caused all queries to become slower during the master database part of the query.
 
Divide and conquer is a general and efficient way to find the problem when dealing with issues like this. By applying  divide and conquer when doing investigations, I’m able to make consistent forward progress until I can find the issue.
 
As an aside, one of the least effective ways to debug problems is to look through the code line by line until you find the bug. I’ve done this myself in the past and I’ve seen a lot of other people do it too. This approach to debugging is bad for several reasons. For one, it’s inefficient. It’s more of a linear search than a binary search because you have to read a large chunk of code before you have any chance of finding the bug. It’s also likely that the bug is non-obvious. Whoever wrote the bug probably didn’t see it so it’s unlikely that you’ll be able to spot it by just looking at the code. By reproducing the bug and following divide and conquer, you can find the bug much more quickly than if you were to read all the relevant code line by line.
 
Of course, it’s important to have a good understanding of the code base and how it works. If you have a good understanding of the system in advance, that will be helpful in debugging the system. Although, walking through the code for the system line by line is an inefficient way to debug.

Takeaways

To recap, the two rules covered are:

  • Quit Thinking and Look – Instead of guessing where the bug is, you should produce evidence that the bug is where you think it is
  • Divide and Conquer – Try to find the general region in your code where the bug is. Then narrow down the region it can be in until you are looking at the bug.
I find both of these to be solid pieces of advice. I’m definitely going to keep them in mind the next time I’m debugging a problem.

Paper Write Up – A Critique of the CAP Theorem

This is my first post under the Getting Dramatically Better as a Programmer series. Today we are going to look at the paper “A Critique of the CAP Theorem“. The paper was written by Martin Klepmann. Klepmann also wrote the book Designing Data Intensive Applications. I’m going to go through the paper, explain the main points made by the paper, and summarize the takeaways I got from the paper. This is partially for forcing myself to have to explain the paper to others and also for others to learn about the paper.

Overview

The CAP theorem is a well known theorem about distributed systems. As the name suggests, “A Critique of the CAP Theorem” points out several issues with the CAP theorem. In particular, Klepmann looks at two different issues with the CAP theorem:

  • The terms used in the theorem have unclear definitions and can be interpreted in many different ways.
  • Some assumptions made in the proof the CAP theorem do not reflect reality.

Klepmann then proposes a new way of modeling distributed systems he calls “delay-sensitivity”.Klepmann designed delay-sensitivity to address the problems of the CAP theorem.

Background: The CAP Theorem

Before we dig into the paper, let’s first look at the CAP theorem and the history surrounding it. The CAP theorem was originally an informal claim about distributed systems made by Eric Brewer. Brewer’s claim is now known as “Brewer’s Conjecture”. Brewer’s Conjecture stated that a distributed system can have at most two of: consistency, availability, and  partition-tolerance (CAP). When making his claim, Brewer gave only rough definitions for each of these terms.

A more intuitive way to interpret Brewer’s Conjecture is the following: in a distributed system, if some servers cannot access each other, the distributed system will either be unable to process some requests (lack of availability) or the distributed system will not behave like a single server (lack of consistency).

To give a specific example, let’s say you want to build a website like Google Docs. You want multiple people to be able to collaborate on a document together. They should all be able to make edits to the document and everyone should see each other’s edits in real-time. You can consider the computers of the users of the website and the servers of the website itself to be a distributed system.

Brewer’s Conjecture states that when a user loses connection to the website (a network partition occurs) you have to make a choice. One option is you don’t allow the user to edit the document until they reconnect. This is choosing consistency over availability. By doing this, you guarantee that when a user views the document, they see the same document everyone else is seeing (consistency). At the same time, they won’t be able to view the document when they are unable to access the website (lack of availability).

The other option you have is to provide some sort of offline mode. If you add offline functionality, you are choosing availability over consistency. Users will be able to continue editing the document locally even if they can’t connect to the website (availability), but every user will not be able to see the most recent edits made by the other users (lack of consistency).

Brewer’s Conjecture states that it’s impossible to provide both availability and consistency when a user cannot connect to the website.

Brewer’s Conjecture is important because many systems want to provide both availability and consistency when there is a network partition. In practice, any system can only provide one of them during a partition. Ultimately whether you choose availability or consistency is a design decision.

Soon after Brewer made his conjecture, Seth Gilbert and Nancy Lynch formalized it in “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services“. Gilbert and Lynch also gave a proof of the formal version of the conjecture. Since Gilbert and Lynch proved the formal version of Brewer’s Conjecture, the formal version became a theorem. That theorem is now known as “The CAP Theorem”.

Now that we’ve covered the history of the CAP theorem, let’s dig into Klepmann’s critique of it.

Ambiguity of the Terms

One of the main issues raised by Klepmann is that the definitions of the terms “availability” and “consistency” are ambiguous in Brewer’s Conjecture.

There are many different ways you can interpret “availability”. One way to interpret it is about how often a production system returns a successful response. For example, you can measure this kind of availability by looking at what percentage of requests are successful. An alternative way to define availability is as a property of the algorithm a system is running. If the algorithm guarantees successful responses in all cases, the algorithm can be referred to as “available”.

There is a related question of how to deal with slow responses. One option is to define a system as available if it returns a response eventually. Another way is to define a system as available if it returns a response in under a given time limit.

The term “consistency” is also overloaded with many different meanings. There’s consistency in the ACID sense which is completely different from the consistency Brewer was referring to.1 In the context of distributed systems, people often refer to “consistency models”. A specific consistency model provides guarantees about the behavior of the distributed system. Some common consistency models are sequential consistency, causal consistency, and linearizability. Each of these models provides a different set of guarantees.

To give you a rough idea of the different consistency models, sequential consistency requires that executing several operations in a distributed system is the same as executing those operations in some order, one at a time. Linearizability is the same as sequential consistency except it also requires that if operation A started after operation B completed, the effects of operation A must come after the effects of operation B.

The term “strong consistency” also comes up often. Even though it does come up often, it has no specific formal definition. “Strong consistency” can refer to any of the consistency models above.

Differences Between Theory and Practice

After looking at some issues around the terms used in the CAP theorem, Klepmann goes on to look at the formal version. This is the CAP theorem as proved by Gilbert and Lynch. In defining the CAP theorem, Gilbert and Lynch define the words “availability”, “consistency”, and “partition-tolerance”. In particular they define them as follows:

  • Availability – Every request to the service must terminate.
  • Consistency – The system provides linearizability.
  • Partition-tolerance – The system still needs to work even when some nodes in the system are unable to communicate with other nodes in the system.

Klepmann points out that these definitions and some assumptions made by Gilbert and Lynch do not model reality.

There’s a big difference in the use of “availability” in the CAP theorem and the use of “availability” in practice. In the CAP theorem, availability means the system only needs to return a response eventually. That means the system could return a response in a few seconds, several minutes, or even several days. As long as the system eventually returns a response, it is considered to be available by the CAP theorem. In practice, you usually set a timeout so that if a system takes too long to respond, the system is considered to be unavailable. If a website takes more than a few seconds to response, it doesn’t matter that the system is available in the CAP sense. For all practical purposes, if no one is able to access the website, the website is unavailable.

Another big difference is the CAP theorem only considers one kind of failure, a partition. In practice there are many other failures that can occur. Individual servers can crash. Individual network links can go down. Data on servers can even become corrupted. This makes the CAP theorem useful if the only kind of failure you are worried about are partitions. If you want to reason about other kinds of failures in your distributed system, it’s difficult to make use of the CAP theorem.

Delay-Sensitivity

To deal with the issues of the CAP theorem, Klepmann gives a rough draft of a system he calls “delay-sensitivity”. In delay-sensitivity, you measure every operation by whether the operation requires time proportional to the network latency. The advantage of delay-sensitivity is it gives you an idea of how long each operation will take. This is as opposed to the CAP theorem which creates a dichotomy between available and unavailable.

For a linearizable system, it’s possible to prove that all operations will take time proportional to the network latency. When there is a network partition, the network latency is unbounded, That means when a partition occurs in a linearizable system, operations will take an unbounded amount of time and the system will become unavailable. Such a system is consistent, but not available when there is a network partition.

On the other hand, it’s possible to provide a non-linearizable system in which all operations complete independently of the network latency. This means even when there is a partition and the network latency is unbounded, operations will still complete. In other words, the system is available even when there is a network partition. When there is a network partition, this system would be available, but it lacks consistency since it does not provide linearizability.

Takeaways

  • Many of the terms used by the CAP theorem are ambiguous. “Availability”, “Consistency”, and “Partition-tolerance” can all be interpreted in many different ways.
  • The CAP theorem makes many assumptions that do not reflect reality. The CAP theorem only looks at partitions as the only possible fault that can come up. It also  considers a service to be available as long as the service is able to eventually return a response. It doesn’t matter how long the response will take.

My Opinion

While the paper was an interesting read, I do have some issues with it. The paper is titled “A Critique of the CAP Theorem”. While part of the paper discusses how the CAP theorem differs from reality, a good amount of the paper focuses on the ambiguity of the terms of the CAP theorem. The ambiguity isn’t so much a problem with The CAP theorem itself. When they define the CAP theorem, Gilbert and Lynch give formal definitions for all the terms they use. The issues raised are more an issue with the colloquial version of the CAP theorem.

The issues raised about the actual CAP theorem are a bit more interesting. Especially the part about how the CAP theorem doesn’t have any regard for how long the operations actually take. I do like this aspect of delay-sensitivity since delay-sensitivity roughly measures how long each operation takes, even if there isn’t a network partition. On the other hand, I do find it difficult to reason about the exact implications of an operation taking time proportional to the network latency. I find it much easier to reason about the CAP theorem because it makes a clear distinction between a system being available or unavailable.

My Approach to Getting Dramatically Better as a Programmer

There was a recent discussion among my social group about what “getting dramatically better as a programmer” means. Based on that discussion, I’ve decided to share my own approach to becoming a “dramatically better programmer”. I want others to understand what practices I’ve found useful, so they can incorporate them into their own life.

My approach to getting dramatically better is built around a training regime. There are a specific set of “exercises” I do every week. I designed the training regime with two explicit goals in mind:

  • Learning how to solve problems I didn’t know how to solve before.
  • Learning how to write correct programs faster.

My training routine consists of a total of four different exercises. Each one helps me achieve the two objectives above. The four different exercises are:

  • Reading a paper.
  • Learning a new tool.
  • Reading several chapters of a book.
  • Recording my screen as I write a program. Then reviewing the footage and seeing how I could have written the program faster.

Let me explain each of the exercises in a bit more depth. I want to share some details on how each exercise works and the main benefits I’ve had from doing each exercise.

Reading a Paper

This exercise is designed to expand my knowledge of Computer Science. I’ve found two direct benefits to reading papers. The first benefit is that some papers have changed my model of certain problems. A good example of this is The Tail at Scale. That paper examines the counter intuitive nature of long tail latency.

One interesting lesson I learned from the paper was how running a request on multiple machines affects latency. The authors looked at empirical data from a Google service.  The service processes a request by distributing parts of the request across many different services. They used the data to estimate what would happen if you distributed requests across 100 different services. The authors found that if you measure the time it takes to get a response from all 100 severs, half of that time will be spent waiting for only the last five! This is because the slowest 5% of requests is that much slower than all the other requests. The paper also gives several approaches on how to reduce tail latency. I have found those approaches useful in my own work.

The other benefit I’ve found from reading papers is that they give me the knowledge to understand different systems as a whole. As an example, take Spanner, Google’s distributed database. Spanner uses many different techniques such as Paxos, two phase commit, MVCC, and predicate locks. I’ve been able to build up an understanding of these different techniques through reading papers. In turn, this enables me to reason the Spanner as a whole and reason about the trade offs Spanner makes compared to other systems.

I find most papers I read by either following the references of papers I have read or following up on a paper covered in the Morning Paper. The book Designing Data Intensive Applications also has a lot of references to papers that are worth reading.

Learning a New Tool

One of the easiest ways to solve problems is to use an existing tool that already solves that problem. For this exercise, I pick a tool and learn a bit about it. Usually I setup the tool locally, go through a few tutorials, and read a bit of the manual. Tools that I’ve learned in the past range from bash utils such as jq or sed to distributed systems such as Kafka or Zookeeper.

Learning the bash utilities helps me get solve many common tasks quicker than I could have otherwise. Simple text processing is often easier with sed than it is with a programming language. Likewise, learning about different distributed systems come in handy for understanding what tools work well for solving different problems. That way I know what tool I should use when faced with a given problem.

Reading Several Chapters of a Book

I use books to supplement the knowledge I don’t get from reading a paper or
learning a tool. The books I read cover a wide range of topics. Books I’ve read
recently include:

  • Refactoring – I found this to be a great way to understand what good code should look like and how to turn bad code into good code.
  • Getting Things Done – I found this book to be helpful for prioritizing and keeping track of things. It helped me build a system to make sure I get done what’s important for me to get done.
  • The First Time Manager – I recently became the team coordinator for my team at work. As team coordinator, my main responsibility is communicating with other teams when necessary. I also lead my team’s meetings. I found this book to be great for getting an understanding of basic management principles.

Recording My Screen

This exercise is my favorite. It’s the exercise that’s changed how I approach problems the most. It’s common for athletes to review footage of themselves to understand how they could have done better. I decided to apply the same approach to programming. Some of the lessons I’ve learned by recording my screen include:

  • It helps to test code as it’s written. Doing this reduces the amount of time it takes to debug code by reducing the time you spend locating where the bug is. If all your existing code is bug free, the bug has to be in the new code you just wrote.
  • When debugging a problem, adding functionality only for the purpose debugging is often well worth the cost. As an example, one toy problem I worked on was writing an LRU cache. I had a bug where it wasn’t evicting the right elements. I was able to quickly determine what was wrong by adding a function to print the state of the cache. I could then see where the expected behavior of the cache differed from the actual behavior. This allowed me to quickly locate the bug.
  • Spending five minutes deciding on an approach up front before writing any code is worth it. I’ve found two benefits to doing this. It helps me make sure my approach is correct. More importantly, it forces me to decide on a single approach. By watching a recording of myself, I found I wasted a lot of time switching my code between two different approaches. In reality, either approach would have worked fine.

All these lessons are obvious in retrospect. but I had no clue that any of these were issues until I recorded my screen and saw where I was actually spending time.

The steps I take for this exercise are:

  1. Record myself writing some problem. This can either be a problem I worked on at work or a problem from a programming challenge website such as Leetcode.
  2. Go through the recording at 10x speed and annotate what I was doing at each moment.
  3. Total how much time I spent into high level categories. How much time did I spend debugging some bug? How much time did I spend building some feature
  4. Look at the categories I spent the most time in. Then dig into what actually took up that time.
  5. Come up with approaches that would have allowed me to save time. Often there are ways I could have structured my code up front that would have allowed me to write less code or find bugs earlier.

I highly recommend recording your screen. It’s one of the easiest ways to find small changes you can make to make yourself a lot more productive.


I’ve been doing this training regime over the past year. I’ve definitely noticed a huge difference. There’s a large amount of knowledge about systems and tools that I wouldn’t have developed otherwise. I’m also able to solve problems more quickly than I was able to before. I hope you reflect on these exercises and implement a few of them yourself.

Going forward, I’m going to start sharing what I find through my training regime. I’m going to start off by writing a blog post every time I do one of the exercises. I’ll share what I learned and found during the exercise. I think it would be beneficial for me to explain what I learned, as well as make a great learning resource for others.

Postgres Backups with Continuous Archiving

It’s extremely important that you have backups. Without backups, you are at serious risk of losing data. If you don’t have backups, all it takes is one machine failure or fat fingered command to lose data.

There are several different ways to backup a Postgres database. In this post, we are going to look at backups done through an approach called “continuous archiving”. We’ll take a look at how continuous archiving works. We’ll also cover how to setup continuous archiving, and how to restore a backup through continuous archiving.

One neat aspect of backups created through continuous archiving is they allow point in time recovery. Point in time recovery means you can use the backup to restore your database to any point in time you want. You are able to specify the exact moment to restore the backup to. Let’s say you accidentally ran DROP TABLE on the wrong table yesterday at 4:05pm. With point in time recovery, you can recover your database to 4:04pm, right before you dropped the table.

Let’s now take a look at how continuous archiving works. There are two pieces to continuous archiving backups: a base backup and a WAL archive. A base backup is a copy of the entire database at a specific moment in time. On the other hand, the WAL archive is a list of all the diffs made to the database after the base backup was created. The WAL archive is made up of series of WAL files. Each WAL file contains the diffs made during a small period time. Altogether the WAL files give you the diffs since the base backup was created. Not only is WAL used here as part of continuous archiving, but it is also used to recovery in case the database crashes. I’ve talked a bit about how the WAL helps when recovering from crashes here.

With a base backup and WAL archive, it’s straightforward to restore the database to any point in time after the base backup was created. You first restore the base backup. This gives you the database at the time the base backup was created. From there, you apply all of the WAL before the time you want to restore the database to. Since the WAL is a list of all the diffs made to the database, this will bring the database up to the state it was in at that specific time.

Now that we have an idea of how continuous archiving works, let’s take a look at how to set it up. We setup continuous archiving through the core functionality Postgres provides for setting up continuous archiving. If you are setting up backups for you own database, you should instead look at a tool like WAL-E or WAL-G instead. WAL-E and WAL-G wrap the core Postgres functionality and take care of a lot of the work for you. We’ll take a look at WAL-E in my next post.

Creating a base backup is straightforward. Postgres provides a command line tool, pg_basebackup. The pg_basebackup command creates a base backup of the specified Postgres instance and stores it in the location you specify.

Setting up the WAL archive is a bit more involved. Postgres has a configuration parameter archive_command. Every time Postgres generates a certain amount of WAL, it will create a WAL file. Postgres will then run the command in archive_command passing the location of the WAL file as an argument. The archive_command needs to copy the WAL file and store it somewhere where it can be retrieved later. In other words, in the WAL archive. The Postgres docs give the following example archive_command:

 test ! -f /mnt/server/archivedir/%f & cp %p /mnt/server/archivedir/%f

When Postgres runs the command, it will replace the %f with the name of the WAL file and it will replace %p with the full path to the existing WAL file. The command above first uses the test command to make sure it hasn’t already copied the specified WAL file. It then copies the WAL file from it’s current location to /mnt/server/archivedir, the location of the WAL archive. Every time Postgres generates a new WAL file, it will run archive_command and copy the file to /mnt/server/archivedir.

Restoring a Postgres backup is pretty much the opposite of how you create it. First you need to restore the base backup. To do so, you clear out the Postgres data directory to wipe the existing Postgres database. Then you copy the files generated by pg_basebackup to the data directory. This will restore the base backup.

Then, to restore the WAL, you write a recovery.conf file. A recovery.conf file specifies a few parameters. The two most important ones are recovery_target_time and restore_command. recovery_target_time is the time at which you want to restore the backup to. The other parameter, restore_command is the opposite of archive_command. Instead of copying the WAL file from Postgres to the WAL archive, restore_command copies the WAL file from the archive to the specified Postgres directory. Whenever Postgres needs a specific WAL file, it will run restore_command with the name of the file it needs. Here’s an example recovery_command file the Postgres docs give that is the inverse of the above archive_command:

 restore_command = 'cp /mnt/server/archivedir/%f %p'

This command copies the WAL file from /mnt/server/archivedir to the location where Postgres keeps WAL files.

Once you have copied the base backup and setup the recovery.conf file, all you need to do is start Postgres. Postgres will detect that you want to restore a backup and automatically run restore_command until it has restored up to the point you specified! This will give you a copy of your database at exactly the time you requested. If you want a more detailed walk through on how to setup a Postgres backup, the docs have a pretty good one.

I honestly find it incredible that you can restore backups in Postgres to an arbitrary point in time. The way I’ve discussed above is only one way to setup continuous archiving. Because it only uses the bare Postgres functionality, you still need to do a lot of work on your end to configure it how you want. In my next post, we’ll take a look at WAL-E. WAL-E wraps the Postgres functionality and makes it much easier to setup continous archiving. It even has functionality for storing backups in the cloud storage system of your choice!

There is More to Programming Than Programming Languages

I hear the question all the time. “What programming languages should I be learning?”. I’ve asked this question many times myself in the past. When you are first getting started programming, it seems like all there is to becoming a good programmer is the number of programming languages you know. Several years later and I now realize this is not the case. If you really want to become a better programmer, instead of focusing on learning more programming languages you should focus on the other aspects of programming.

What it Means to Learn a Programming Language

There are many different pieces to a programming language you have to learn before you really know the language. For the purposes of this post, I will separate them out into two different categories:

Language Syntax: Every language has it’s own way for writing if statements, for loops, function calls, etc. Obviously you need to be familiar with how to write these constructs before you can be proficient in a programming language. Although learning the syntax of a language largely means ingraining the syntax into muscle memory, knowing the syntax for a language is still a necessary step in learning a programming language.

Language Concepts: Beyond the basic syntax of the language, every programming language has it’s own features and concepts that separate it out from other programming languages. Python has constructs such as decorators and context managers. Lisp based languages provide code generation through macros as a first class feature. As you learn a programing language, you become aware of these features the language provides. You learn what each of the features do and how to use them. The cool thing about learning more concepts is that it teaches you new ways to approach problems, but more on that later.

The Benefits of Learning More Programming Languages

As I see it, there are two main advantages to learning more programming languages. First you have the advantage of being able to quickly ramp up on any project that makes use of a programming language you know. If you know Python, you can start contributing to a codebase in Python almost immediately. While this is an advantage for programmers that know more languages, it is only a slight advantage. Usually it takes just a few weeks for a programmer to become proficient at a new programming language. If this is the reason you want to learn so many different programming languages, it makes more sense for you to wait for there to be a project you want to work on that actually forces you to learn a new language as it won’t take much time to ramp up anyways.

The other, more interesting advantage to learning more programming languages, is that each new language will introduce you to new ways of approaching problems. This is mainly because as you learn a language, you learn the specific features the language provides and how to use them to approach problems. In a way, learning a new language “stretches your mind” and allows you to think about problems in new ways. As an example, if you learn a Lisp derived language, you will learn about how to use code generation to approach problems. In the future, being familiar with using code generation allows you to recognize when code generation is the best approach to a problem. As you learn more programming languages, you learn more approaches to solving problems. This enables you to choose the best approach for solving a problem from a larger repertoire of options.

What Really Matters

Even though learning more programming languages does give you more ways to approach problems, ultimately the choice of approach for solving a particular problem doesn’t matter much. What is vastly more important than the number of approaches to problems you know is the space of problems you know how to solve. A good programmer isn’t valuable because they can solve the same set of problems every other programmer can solve, just in different ways. A good programmer is valuable because they can solve problems that other programmers cannot.

Most programming languages and styles are designed with a similar purpose in mind: making it easy to express algorithms and procedures in a way a computer can execute them. While some programming styles are better than others at expressing certain procedures, they only make writing solving the particular problem at hand slightly easier. For the most part, any problem that can be solved using an object oriented style can also be solved in a functional style and vice versa. Knowing just one programming language and one programming style enables you to solve the vast majority of problems solvable in any programming language or style. In the larger picture, the particular choice of programming language and approach is mostly an implementation detail.

If you really want to become a better programmer and broaden the space of problems you can solve, instead of focusing on programming languages, you should focus on the parts of programming that actually enable you to solve more problems. Here are a few examples of areas outside of programming languages that having knowledge of broadens the space of problems you can solve:

  • Operating Systems
  • Web Development
  • Distributed Systems
  • Networking
  • Algorithms
  • Security

As an example, you should definitely learn how to setup and use a database. Nearly every non-trivial program uses some kind of database to keep track of application data. Why? Because databases are able to solve problems around handing data are extremely difficult to do with pretty much any programming language. For a few examples many databases:

  • Can work with more data than you can work with in your typical programming language.
  • Guarantee data won’t be lost, even if the power goes out.
  • Make it impossible for your data to get into an inconsistent state.1

If you are familiar with how to work with a database, you can solve all of these problems with little to no effort by just setting up a database. No matter what programming language or programming paradigm you are using, you would much rather have the database handle these problems for you. Ultimately, learning how to use a database enables you to solve way more problems than learning another programming language does.

This applies just as much to the other areas listed. Each one of them enables you to solve more problems than you would be able to otherwise. How are you supposed to build a website without knowing about web development? How are you supposed to write an application that can handle a machine crashing without knowledge of distributed systems? For each of these areas you learn, you become able to solve problems you wouldn’t be able to before.

This is completely unlike learning more programming languages or paradigms, which for the most part are interchangeable. There just isn’t enough distinction between most programming languages and styles that learning a new language enables you to solve that many more problems then you could have before. At this point which do you want to learn: how to solve a problem in more ways or how to solve more problems?

How to Get a Programming Job Straight Out of High School

While I was in High School I looked at the possible options ahead of me. The obvious option was to go onto college, but I was also looking at another option. I wanted to try to get a job working as a programmer and skip college altogether. Given the number of people that have skipped college and made their way directly into the tech industry, I’m surprised at how little there is written about what it’s like to go through the process. In this post, I explain, for people who are thinking about getting a job straight out of high school, what it’s like to go through the process and what obstacles you’ll run into along the way.

The first thing I will say is that it’s definitely possible to get a programming job without a college degree. I did it myself and wound up with offers from numerous companies including Google, Uber, and Dropbox, as well as several lesser known, smaller companies. Second, if you do decide to look for a job out of high school, I highly recommend applying to college and deferring for a year. Deferring will guarantee you have the option to go to school in case your job search doesn’t pan out.

As for actually getting a programming job straight out of high school, there are two main hurdles you will run into. First, you have to get companies to interview you. This stage will be especially tough for you. Second, after getting companies to interview you, you will be put into their interview processes. For most companies, the interview process consists of a several phone calls and an in person interview at the end. In each round you will be asked to solve one or more programming related problems. In most interviews, you will be asked to only describe a solution. In many of the interviews you also will be asked to write code for a working solution. Once you make it through a company’s interview process, you will find yourself with a job offer.

It’s important to emphasize that, for the most part, these two stages are completely separate. Once you get your foot in the door, how well your application looks does not matter. You could have the best application in the world, but if you fail a company’s interview they will still reject you.

Hurdle #1: Getting Companies to Interview You

There are many different approaches you can take to get companies to interview you and there is a lot of surprisingly bad advice out there. Here’s a list of several different approaches you can try and whether or not they work informed by my personal experience. Let’s start with the approaches that I found not to work:

What Does Not Work

Applying to a Company Directly

If you submit an application directly to a company, in most cases your resume will be sent to a recruiter. The recruiter will look at your resume and decide whether it’s worth the companies time to interview you. Usually when a recruiter looks at your application, they will first look for one of two things: a college degree from a well known school or prior experience at a well known tech company. Since you have neither of those things you are almost guaranteed to be rejected before you even start interviewing. You may hope that some company will take the risk and decide to interview you since you are straight out of high school, but as far as I know, no company actually does this.

When I tried applying directly to several companies, I did not get into the interview process for any of them. That was in spite of the fact that I had a Knuth check, several interesting side projects, and had won the Illinois statewide WYSE competition in Computer Science.

Contributing to Open Source

For some reason a lot of advice suggests that if you want to get a programming job, you should contribute to some open source software projects. As I see it, there are two things you can hope to gain out of working on open source:

  1. It looks good on your resume and you think it will help you pass the application process.
  2. You hope companies will see your open source contributions and reach out to you directly.

Regarding 1, as mentioned above, most recruiters first look for either a degree from a well known school or prior experience at a well known company. Without either of those, any open source contributions you have won’t matter. Most companies don’t even take a look at what open source projects you’ve worked on.

As for 2, in general companies only reach out to open source contributions if they are one of the core developers of a well known open source project. While it is possible to get interviews this way, the amount of effort you would have to put in is way more than it’s worth. The approaches I recommend later will help get you get interviews with a lot less effort.


As for approaches that actually work, here are two I’ve found that have worked. Both of them try to bypass the traditional resume screen and get you straight into companies’ interview processes.

What Does Work

Applying Through a Company that does Recruiting

There are a number of companies that will take care of the application process for you. Most of the job offers I got out of high school came out of the Recurse Center. The Recurse Center is a three month long program in New York where programmers work together on whatever projects they want. Although helping attendees find jobs isn’t the purpose of the Recurse Center, they do help all attendees looking for a job find one. After my time at the Recurse Center ended, I started looking for a job. The Recurse Center reached out to companies on my behalf and was able to get me straight into some companies’ interview pipelines. Even with the Recurse Center reaching out to companies on my behalf, I still only had ~1/3 companies decide to interview me, but that was enough for me to get several job offers.

Triplebyte is another option here. Triplebyte is a recruiting company that has their own interview process. If you pass their interview process, Triplebyte will then send you straight to the last round of interviews at other companies. If you are really good at technical interviews (described below), you should be able to pass Triplebyte’s interview process. Once you pass their interview, they will make sure you get interviews from other companies.

Networking

Another approach I’ve found successful is networking. If you have a friend at a company you want to work for and you can get them to refer you, they should be able to get you past the resume screen. Unfortunately, since you are straight out of high school, you probably don’t have much of a network so this option probably isn’t viable to you. If you do not already have a network, it is probably not worth building out a network just to get a job. In that case, you should try the approach above and apply through a company that handles the application process for you because it will take a lot less effort on your part.

Hurdle #2: Passing the Interview

Once you make it into a company’s interview process, all you need to do to get a job offer from them is to do well in their interview process. In general, programming interviews today across Most companies are very similar. Usually a company will ask you multiple algorithmic problems and you just need to be able to solve them to succeed in their interview. You may also be asked to write code that solves the problem.

An example of a question you will may run into is “Write a function that takes as input a number N and returns the number of different ways to make change for N cents.”

If you aren’t familiar with algorithm problems and how to solve them, there are tons of resources available. The book, Cracking the Coding Interview walks you through lots of different interview problems and how to solve them. If you want a comprehensive resource that covers everything you need to know, I really like the book Introduction to Algorithms. At over 1300 pages, it is a really long book, but it does cover everything you will need to know to solve algorithm related interview problems.

If you want to practice solving interview problems, the website LeetCode has many different algorithmic coding problems. The easy problems in the array section are about the difficulty you should expect in a programming interview. For any of those problems, you should be able to implement a program that solves the problem and be able to explain its runtime (in terms of big O notation) in <45 minutes.

In general, you should be familiar with the following:

  • What big O notation is and how to apply it. Given a piece of code or algorithm you should be able to easily determine its run time complexity and explain why it has that complexity.
  • All of the basic data structures (arrays, linked lists, hash tables, heaps, binary trees). For each data structure you should have all of the operations and the runtime of each operation memorized.
  • Basic algorithms (breadth first search, depth first search, quicksort, mergesort, binary search) and their runtimes.
  • Dynamic Programming. Dynamic programing is an algorithmic technique for solving various algorithms problems. I’m not sure why, but tons of companies ask problems that can be solved with dynamic programming.

There are a few companies that have started to move away from algorithm related interview problems, but the vast majority still ask them. You should mainly focus on learning how to solve algorithm problems and that should give you enough to pass the interview process at many different companies.


That’s really all there is to getting a job straight out of high school. It can all be boiled down to getting good at two things:  getting your foot in the door and getting really good at algorithm problems. If you are able to do both of these things, you will be able to start getting job offers fairly quickly. It isn’t impossible to get a programming job straight out of high school, you just need to work for it.

How to Improve Your Productivity as a Working Programmer

For the past few weeks, I’ve been obsessed with improving my productivity. During this time, I’ve continuously been monitoring the amount of work I’ve been getting done and have been experimenting with changes to make myself more productive. After only two months, I can now get significantly more work done than I did previously in the same amount of time.

If you had asked me my opinion on programmer productivity before I started this process, I wouldn’t have had much to say. After looking back and seeing how much more I can get done, I now think that understanding how to be more productive is one of the most important skills a programmer can have. Here are a few changes I’ve made in the past few weeks that have had a noticeable impact on my productivity:

Eliminating Distractions

One of the first and easiest changes I made was eliminating as many distractions as possible. Previously, I would spend a nontrivial portion of my day reading through Slack/email/Hacker News. Nearly all of that time could have been used much more effectively if I had only used that time to focus on getting my work done.

To eliminate as many distractions as possible, I first eliminated my habit of pulling out my phone whenever I got a marginal amount of work done. Now, as soon as I take my phone out of my pocket, I immediately put it back in. To make Slack less of a distraction, I left every Slack room that I did not derive immediate value from. Currently I’m only a in a few rooms that are directly relevant to my team and the work I do. In addition, I only allow myself to check Slack at specific times throughout the day. These times are before meetings, as well as before lunch and at the end of the day. I specifically do not check Slack when I first get into the office and instead immediately get started working.

Getting into the Habit of Getting into Flow

Flow is that state of mind where all of your attention is focused solely at the task at hand, sometimes referred to as “the zone”. I’ve worked on setting up my environment to maximize the amount of time I’m in flow. I moved my desk over onto the quiet side of the office and try set up long periods of time where I won’t be interrupted. When I want to get into flow, I’ll put on earmuffs, close all of my open tabs, and focus all of my energy at the task in front of me.

Scheduling My Day Around When I’m Most Productive

When I schedule my day, there are now two goals I have in mind. The first is to arrange all of my meetings together. This is to maximize the amount of time I can get into flow. The worst possible schedule I’ve encountered is having several meetings, all 30 minutes apart from each other. 30 minutes isn’t enough time for me to get any significant work done before being interrupted by my next meeting. Instead by aligning all of my meetings right next to each other, I go straight from one to the next. This way I have fewer larger blocks of time where I can get into flow and stay in flow.

The second goal I aim for is to arrange my schedule so I am working at the times of the day when I am most productive. I usually find myself most productive in the mornings. By the time 4pm rolls around, I am typically exhausted and have barely enough energy to get any work done at all. To reduce the effect this had on my productivity, I now schedule meetings specifically at the times of the day when I’m least productive. It doesn’t take a ton of energy to sit through a meeting, and scheduling my day this way allows me to work when I’m most productive. Think of it this way. If I can move a single 30 minute meeting from the time when I’m most productive to the time of the time at which I’m the least productive, I just added 30 minutes of productive time to my day.

Watching Myself Code

One incredibly useful exercise I’ve found is to watch myself program. Throughout the week, I have a program running in the background that records my screen. At the end of the week, I’ll watch a few segments from the previous week. Usually I will watch the times that felt like it took a lot longer to complete some task than it should have. While watching them, I’ll pay attention to specifically where the time went and figure out what I could have done better. When I first did this, I was really surprised at where all of my time was going.

For example, previously when writing code, I would write all my code for a new feature up front and then test all of the code collectively. When testing code this way, I would have to isolate which function the bug was in and then debug that individual function. After watching a recording of myself writing code, I realized I was spending about a quarter of the total time implementing the feature tracking down which functions the bugs were in! This was completely non-obvious to me and I wouldn’t have found it out without recording myself. Now that I’m aware that I spent so much time isolating which function a bugs are in, I now test each function as I write it to make sure they work. This allows me to write code a lot faster as it dramatically reduces the amount of time it takes to debug my code.

Tracking My Progress and Implementing Changes

At the end of every day, I spend 15 minutes thinking about my day. I think about what went right, as well as what went wrong and how I could have done better. At the end of the 15 minutes, I’ll write up my thoughts. Every Saturday, I’ll reread what I wrote for the week and implement changes based on any patterns I noticed.

As an example of a simple change that came out of this, previously on weekends I would spend an hour or two every morning on my phone before getting out of bed. That was time that would have been better used doing pretty much anything else. To eliminate that problem, I put my phone far away from my bed at night. Then when I wake up, I force myself to get straight into the shower without checking my phone. This makes it extremely difficult for me to waste my morning in bed on my phone, saving me several hours every week.

Being  Patient

I didn’t make all of these changes at once. I only introduced one or two of them at a time. If I had tried to implement all of these changes at once, I would have quickly burned out and given up. Instead, I was able to make a lot more changes by introducing each change more slowly. It only takes one or two changes each week for things to quickly snowball. After only a few weeks, I’m significantly more productive than I was previously. Making any progress at change at all is a lot better than no change. I think Stanford professor John Ousterhout’s quote describes this aptly. In his words, “a little bit of slope makes up for a lot of y-intercept”.