Of the exercises I do as part of Getting Dramatically Better as a Programmer my favorite is this one. Every week I record my screen and later review the footage. I’ve found this to be a great way to understand where I’m wasting time and how I could solve the same problem more efficiently.
In this write up I’m going to go over one segment of footage of myself. In particular I’m going to share:
- The problem I was working on.
- A breakdown of where I spent my time when working on the problem.
- General techniques I’ve come up with that would have reduced the amount of time it took to solve the problem.
Since I reviewed myself doing a problem at work, I’m unable to share the raw footage. I am able to share my notes on what I was doing at each point in time which you can see here.
Doing this write up is both for forcing me to explain to others what I learned and for others to see what lessons I’ve learned.
The footage I reviewed was of myself doing an investigation at work. For several minutes, our p95 for queries went up by an order of magnitude. I’m usually the person responsible for investigating the problem and coming up with a solution to prevent the problem in the future.
It’s these kinds of queries that were slow for a brief period of time.
Since performance is so important to Heap, we’ve built a plethora tools for analyzing performance. One of the most powerful tools we’ve built is a tool called “Perf Trace”. Perf Trace gathers information about every query ran in our database. It records the query plan used to execute the query, as well as how much time was spent in each stage of the query. It then writes this data to a SQL database so it’s possible to query the data with SQL. If you want to read more about Perf Trace, I wrote a blog post for Heap that covers it in depth.
One of the most common uses of Perf Trace is debugging issues just like this one.
Where the Time Went
In total, it took 25 minutes to determine the root cause and write it up. The root cause was a group of computationally expensive queries were ran together.
To be more specific, there is a specific type of query Heap supports that consumes a lot of resources on our master database. Every query we run goes through our master database. If our master database becomes overloaded with queries, the amount of time the master DB spends processing queries will go up.
I determined this was the root cause through two symptoms. According to Perf Trace, the increase in query time came from queries spending more time in our master database. I also looked at the CPU usage of our master database and noticed a large spike in CPU usage around the time of the incident.
In finding the root cause, most of my time was spent in one of following three categories:
- 9 minutes – Looking at various pieces of data. This includes looking at both Perf Trace data and looking at various dashboards.
- 8 minutes – Writing or debugging SQL queries. The main SQL query I wrote was for looking at the Perf Trace data for queries around the time of the incident.
- 5 minutes – Writing up the results in a document.
Based on the footage, there are two main takeaways I have:
- When looking at data, I should have a clear sense of what I’m looking for.
- I should have a repository of common SQL snippets I use.
For the first point, I noticed that I would spend a lot of time scrolling through the data without looking for anything in particular. Scrolling mindlessly through the data wasted a lot of time and I didn’t really find anything particularly interesting. I think if I had a clear sense of what I was looking for when digging through the data, I could have reduced the amount of time I spent looking through the data.
As for the second point, I spent a lot of time getting the SQL query I wrote to work. The SQL query I wrote for analyzing the Perf Trace data is similar to many queries I’ve written in the past. It took a fair amount of time getting the SQL query to work and debugging some issues with it. If I kept a repository of common SQL snippets I used, I wouldn’t have needed to spend as much time writing the SQL. This is because the repository would already have a snippet that worked and was already debugged.