Debugging your horrible production issue – a short motivational guide.
Author: Andy MacDonald
You’ve been tasked with finding a bug that has reared its ugly head in a production system.
You’ve been staring at the keyboard for hours now with growing doubts in your mind about your suitability for a software engineering career; you have drawn a complete blank.
You feel like a fraud, an impostor.
Relax. You aren’t the impostor here—the bug is.
For years, it’s been happily impersonating a functioning part of the system, but now you’re onto the trail to find this fake, this fraud, this impostor, and you won’t give in.
So where do you start?
Reproduce the Problem
At an absolute minimum, you need to be able to take the problem and confidently reproduce it. Again and again and again.
Without this very simple step, you can’t test whether a change you’ve introduced has made any sort of impact, you can’t expose vital clues from the system, and you can’t even eventually prove to someone else that the problem has been resolved.
Reproduce the Problem in an Environment You Can Control
So you can reproduce the problem? That’s a good start. But if you can only reproduce the problem in production, you’re going to be a bit constrained in the approaches you can use to narrow down the source.
Hopefully, you’re in a situation where your organisation has a staging, UAT, pre-production, or test environment that somewhat resembles production.
Use it to your advantage and try to reproduce your problem there. This will open up many approaches you can use to get more visibility and control over your problem that you don’t have the luxury of in production, or at least the luxury of being able to do so quickly.
Here’s a small number of examples:
- Increase logging levels of applications.
- Remote debugging of applications.
- Ability to release modifications of application code outside of a regular quality gate to trial changes quickly. (I think this should only be exercised in times of urgent need, and you should always clean up any mess.)
- Suspend or terminate adjacent processes to determine the flow of events.
Understand the System
Whenever you find a problem you can’t immediately solve, with a system you think you know, you need to face facts. You don’t know the system as well as you thought you did.
Take a step back and try to draw out the system at a high level. Gradually and continuously focus inwards on the parts that are most likely to be the culprit in the particular scenario you’re working on.
Simplify the Problem
Take the problem you’re working on, and instead of focusing on tackling the problem directly, try to reduce the complexity and the number of moving parts to keep track of in your mind.
We are fragile, meaty beings with only a limited ability to store and process information. Make the problem more digestible.
In a production system, eliminate the components and behaviour that can’t possibly be contributing to the dysfunction with simple irrefutable tests.
Eventually, you’ll be left with a much smaller and simpler problem to work on and comprehend. If you continue doing this, you will find the solution eventually.
Bugs Do Not Go Away on Their Own!
Oh, and please don’t fall foul of this common logical fallacy: “I could reproduce it 20 minutes ago, and I haven’t changed anything, but it seems to have gone. Maybe it’s gone forever!”
If someone has spotted a bug, you haven’t made any changes and at some point you couldn’t reproduce it, it will happen again.
If you make a change to validate a theory or introduce a potential fix, if the change has no effect, you should immediately roll back the change.
Don’t add one change on top of an ineffective change. Something you introduce may have no effect at first, but if you add more changes, you end up with a more complicated, messier problem to solve.
Worse still, you could wrongly attribute some change in behaviour with an unrelated change you’ve introduced. Don’t do it! Keep your changes isolated and test every change.
Rubber Duck Debugging
Though it can sometimes feel like it, there is absolutely no shame in asking for help on a problem that is eluding you.
Practice Rubber Duck Debugging: the act of explaining your current problem to another person, even an inanimate object, to allow you to gather your thoughts and regain focus.
Sometimes the reason you haven’t solved the problem yet is precisely because you’ve been focusing on the problem too long, and you’ve fallen down a rabbit hole.
Document Everything You’ve Already Tried
In a complex scenario with high stakes, it’s very easy to lose track of what you’ve done already to solve a problem. You could end up going back and treading old ground without even knowing it. Simply document everything you’ve done so far and as you go along.
You may think, “I’ve tried everything and I’m still stuck!”
You really haven’t. There is going to be something that can be done to highlight what the cause of your issue is.
You might be too close to the problem. You might never be able to think of the solution yourself. But you must absolutely accept that there is something that can be done to highlight what the problem is. You haven’t found it yet, and your job is to find it.
Don’t give up.
Learn New Debugging Techniques
Debugging isn’t just setting breakpoints and hoping for the best. There are lots of techniques to learn. One such technique worthy of attention (though there are many others) is the use of Conditional Breakpoints:
As with breakpoints, you can cause the execution of a program to halt, but with conditional breakpoints this is only if a certain condition is met.
This can save a lot of time in tracking down your elusive bug because when the program freezes, you know it’s happened at a condition you’ve decided is important.
There Isn’t Always a Clear Way to Solve the Problem
Sometimes you have to get creative.
Above All Else, Take Regular Breaks
Solving a tricky production bug can be tough, even emotionally draining.
Go for a walk, grab a drink, and get away from your desk. Do it now.
Sometimes it can make the world of difference to have your subconscious mind take the wheel and do some offline processing, while you distract yourself with something else.
And if you’re lucky, you might just get to that magical moment when …
You’ve Solved It! But it isn’t over.
Are you sure you’ve solved it?
Test and re-test your fix. Revert to the original configuration and run your reproduction steps. Confirm the bug exists. Apply your changes and run the reproduction steps again. Confirm the bug has gone.
Do it again.
… Are you really really sure now?
After all your pain and suffering, it’s easy to forget very simple confirmatory steps. Don’t cause yourself more heartache; be confident in your solution before proclaiming the issue is resolved.
… Oh, and you should definitely write some tests now
A bug being found in production can mean only one thing about your application’s test suite: It isn’t good enough because the bug wasn’t caught early enough.
Use the opportunity now to write some tests to exercise that area of code so no one else will have to experience the suffering you’ve just experienced again.
… at least until a different bug comes up, anyway!
Hope you’ve enjoyed this and it’s been useful.
Thanks for reading 😁!
Sign up to our newsletter
To read more from Andy MacDonald and our other regular contributors, sign up to our newsletter. You’ll receive a monthly summary of recently or soon-to-be published content; blogs, whitepapers, eBooks, guides and webinars, produced by technologists for technologists.
Who contributed to this article
Andy MacDonaldSenior Software & DevOps Engineer
Andy MacDonald is a senior software and DevOps Engineer at BlackCat. He is passionate about all things technology and loves learning about new technologies and their application. Andy has extensive technical skills across product development, application architecture and agile/DevOps process improvement. In his spare time, he’s a volunteer mentor and coding coach and an active member of the Birmingham tech scene. He’s also a regular guest writer for a number of online technical journals as well as a regular contributor to BlackCat’s own technical blog.