Because Production debugging is inevitable!
3 min read
I recently moved on from Airtel. At Airtel, I worked on scale engineers can only imagine! This scale taught me something very important:
Production debugging is inevitable!
Well, who doesn't connect with this famous meme?
Jokes apart, but why not take time & write test cases? Why not use ELK stack, Datadog, Logrocket, or other logging tools? Why not plan properly?
Let me go one by one.
Take time and write test cases - devs will be covering only what they know.
Using ELK stack & other logging tools - devs have to guess what to log. I often found devs adding logs after a bug was found, and repeating it till the bug was fixed.
The reality is that when we deploy on production, our code acts as a black box. We are not only testing our code but an ever-evolving constellation of:
- Specific user data & several concurrent users
- Production grade downstream services
- Schedulers, Queues & their quirks
- Third-party APIs
- Race conditions
- Storage I/Ops
- Environment Settings
- Specific time of the day/month/year
We can't just plan to test things we don't know.
I want to share an example where we relied on a downstream API which gave us 500 internal server errors intermittently. The flow was: we receive a customer email, we extract a few details from this email, ask the downstream API to feed us further information about this email.
The downstream API threw 500 about 6% of the total requests. We did about 100k requests/day. So, 6% was a huge number!
After adding more logs, and debugging for 2 days - we realized that we were sending the wrong parameter value to this API but only in some cases. Our regex was failing to extract correct values from a few emails. We quickly fixed it. The 500s went down to 1%. We wanted to debug this 1% but it became increasingly difficult to debug with existing logging tools.
I can share at least 3 more examples where third-party API integrations gave unexpected responses, storage didn't give enough throughput for I/Ops, and major design flaws caused hard to debug race conditions.
Eventually, I realized that my production issues were calmly telling me:
I wished to see how the code was executed on a remote machine, for every thread, line by line. It would just ease our lives.
I called up some 50 developers in my network and asked them if they go through the same loop every time. Everyone resonated with my sentiments.
I put together a team & started working on Videobug - it runs the code with a Java agent that records code execution and plays it line by line, right in the IDE. Developers could now see any historical code execution, and trace it back line by line.
Videobug works everywhere - localhost, staging, or production.
Take a look at our demo:
How can devs search for code executions?
- Search by Exceptions - devs can even add their custom exceptions.
- Search with a string value - devs can search code executions with specific string variable values. E.g. show the code executions where the value of the email field was firstname.lastname@example.org or show the code executions where the mobile number was 1234567890 and so on!
We plan to add several such filters in the future.
We currently support Java and have thoroughly tested this on Spring Boot. If you wish to test this on your codebase and help us improve: do join our discord channel.
We will be supporting Scala & Kotlin very soon.
Production Issues, here we come!!
Special thanks to Sumit Deshmukh at Overjet for being the testbed for Videobug.
P.S. I will write about the performance impact and storage needed for Videobug in my next post. Do join our Discord Channel in case you want to learn more.