As part of my Getting Dramatically Better as a Programmer Series, I’m going to write up a few chapters of the book I’m currently reading and a few takeaways I got. I’m currently reading the book Debugging by David Agans. The book covers Agans’ “nine golden rules of debugging”. Each rule is a general guideline to follow that makes debugging easier. Each chapter in the book covers one of the nine rules.
Today, I’m going to go through chapters 5 and 6. Chapter 5 covers the rule “Quit Thinking and Look” and chapter 6 covers the rule “Divide and Conquer”.
Chapter 5: Quit Thinking and Look
There are two main points Agans makes in this chapter:
- If you try to guess where the bug is, you will likely guess wrong. Instead of guessing where the bug is, you should look until you can “see” the bug.
- You should add lots of instrumentation throughout your code to make it easier to see the bug.
Throughout my time as a programmer, I’ve been amazed at how many times people, including myself, incorrectly guess where a bug is. One of my favorite examples is when one of my friends was making a simple 2d game. The game wound up being incredibly slow. My friend thought the slow part must have been the collision detection. He was beginning to think about how to optimize the collision detection when I suggested that he should try hooking up a profiler and actually see where the problem was.
My friend reluctantly hooked up a profiler to his program and found, lo and behold, that collision detection was not the problem. It turned out the program was reloading the image files for the sprites on every frame. After he changed the program so it only loaded the sprites once, the game moved a lot more smoothly.
My friend had no evidence that the problem was what he thought it was. He was only relying on his intuition. Once he hooked up a profiler, he was able to see exactly where the real problem was.
If my friend had gone through and optimized the code for collision detection, he would have wasted time optimizing the code only to discover the issue was somewhere else. This is why it’s important to gather evidence that the bug is where you think it is before you attempt to fix the problem. The phrase “measure twice, cut once” comes to mind. You should be able to “see” the bug before you start to fix it.
This is one reason why profilers are so great. They let you see exactly where the slow part of your program is. There’s no need to guess at which part of your program is slow when you have access to a profiler.
This is also why you should instrumentation throughout your program. The more data you collector about your program, the easier it is to see the bugs inside your program.
One of the most useful pieces of code I’ve ever written was a tool I wrote for work called “perf trace”. For my day job, I’m responsible for optimizing the performance of a 1pb distributed database. For every query that runs in our database, perf trace gathers information about the different stages of the query and how long each stage took. This data gives us a clear sense of what queries are slow for what reasons. Perf trace makes it easy to debug slow queries because we are able to see exactly how much time each query spent at each stage of query execution.
We also use perf trace for understanding what the common causes of slow queries are. This allows us to determine what optimizations will improve query performance the most. If you want to read more about perf trace, I wrote a blog post about it that’s on my company’s blog.
Chapter 6: Divide and Conquer
The main point of chapter 6 is that the process of “Divide and Conquer” is one of the most efficient ways to locate a bug. This rule goes hand in hand with “Quit Thinking and Look”. Quit thinking and look suggests looking for the bug until you are able to see it. divide and conquer is the method you should use to locate the bug.
The idea behind divide and conquer is simple. Try to find the general region where the bug is. Then repeatedly narrow down the location of the bug into smaller and smaller regions until you can see the bug. This is the same idea behind binary search, only applied to debugging.
At my job, a common issue that comes up is query times went up for some duration of time. When something like that happens, I have to investigate the issue and figure out why exactly query times went up. I usually use the perf trace tool I mentioned above to aid in debugging the problem.
When I have to look into why queries were slow, the first thing I do is try to figure out what area of the stack the increase in query times came from. Did it come from the web server? Did it come from the master database? Did it come from one of the worker databases?
Once I narrow down the query time increase to a specific area of our stack, I try to figure out the high level reason for why that part of the stack got slower. Did we max out CPU on a machine for some reason? Did we exceed the maximum number of queries we can run on a single server at a time?
Then, once I determine the high level reason query times go up, I can then start looking for the root cause. An issue that came up before was that a job didn’t have a limit on the number of queries it was running in parallel. When the job ran, it ran many more queries in our database than we normally process in our database. Processing so many queries in parallel caused a spike in CPU usage on our master database. That in turn caused all queries to become slower during the master database part of the query.
Divide and conquer is a general and efficient way to find the problem when dealing with issues like this. By applying divide and conquer when doing investigations, I’m able to make consistent forward progress until I can find the issue.
As an aside, one of the least effective ways to debug problems is to look through the code line by line until you find the bug. I’ve done this myself in the past and I’ve seen a lot of other people do it too. This approach to debugging is bad for several reasons. For one, it’s inefficient. It’s more of a linear search than a binary search because you have to read a large chunk of code before you have any chance of finding the bug. It’s also likely that the bug is non-obvious. Whoever wrote the bug probably didn’t see it so it’s unlikely that you’ll be able to spot it by just looking at the code. By reproducing the bug and following divide and conquer, you can find the bug much more quickly than if you were to read all the relevant code line by line.
Of course, it’s important to have a good understanding of the code base and how it works. If you have a good understanding of the system in advance, that will be helpful in debugging the system. Although, walking through the code for the system line by line is an inefficient way to debug.
To recap, the two rules covered are:
- Quit Thinking and Look – Instead of guessing where the bug is, you should produce evidence that the bug is where you think it is.
- Divide and Conquer – Try to find the general region in your code where the bug is. Then narrow down the region it can be in until you are looking at the bug.
I find both of these to be solid pieces of advice. I’m definitely going to keep them in mind the next time I’m debugging a problem.