Finding Stubborn Bugs with Meaningful Debug Info

When a user reports a bug you can't duplicate, let your application help you find the problem. Add logging now and be a debugging master after the software gets deployed.
Preventing Bugs

All the techniques I've described in this article are useful, but they shouldn't be deployed in a vacuum. It's also important to adopt practices that help reduce the likelihood of bugs occurring. Here are some to consider:

  • Adopt unit testing. Java, Python, OCaml, Perl and C all have unit testing frameworks available. Use them and exercise as many code paths as possible. This is especially important for a language such as Python where certain executions of a program may not even parse all of your code. It also can be important for Java; for instance, runtime exceptions can occur due to improper casting to or from Object.

  • Avoid globals. Avoiding global (or class-global) variables helps isolate problems and helps prevent synchronization issues in multithreaded programs. Global variables can be the source of unexpected side effects in function calls, which can be hard to track down.

  • Use the right tool for the job. Languages each have their own strengths and weaknesses, and no single language is the best tool for every task. For instance, although Perl makes it easy to parse delimited text files with regular expressions, OCaml provides tools specifically designed for writing a compiler. Problems that are expressed easily in one language may become much more difficult to express in another.

  • But, don't use too many different tools. Most projects benefit from a standardized toolset. Pick a language and libraries that are most useful for the project at hand, and don't introduce new ones unless there's a compelling reason to do so.

  • Use string and memory management tools. Many languages, including Java, Python, OCaml, Perl and Ruby, provide transparent memory management. You do not need to allocate and deallocate memory. You also do not need to concern yourself explicitly with end-of-string markers and string size limitations. Both of these are common problems with C programs that lead to runtime bugs or security holes. If you must use C, consider a garbage collection or memory pool library.

  • Make it work first, then optimize. In many cases, it's better to develop working code first, then optimize it later. Many people optimize first, which does work in some cases. However, simple, bug-free code is usually more important than code that is as fast as it can possibly be.

  • Write clean code. Split out code into functions. Write comments. Document what each function does and its effect on the environment.

Case Study: a Bug in OfflineIMAP

OfflineIMAP is a program that talks to IMAP servers and synchronizes an IMAP folder tree with a local tree. Many IMAP servers exist, and they don't all work exactly the same. Through its two-year history, OfflineIMAP has gained more and more of the debugging techniques discussed in this article. Problems that users encounter often are unreproducible with my particular setup, so detailed logging is a must. Some IMAP servers are buggy themselves, so the first question that has to be resolved with many reports is: is this even a bug in OfflineIMAP? In a surprising number of cases, the answer is no. OfflineIMAP uses certain IMAP features that most other IMAP clients do not, and those features tend to be poorly tested in some servers.

I'd like to walk you through one particularly stubborn OfflineIMAP bug I've been working on. About a year ago, someone reported a bug in OfflineIMAP using the Debian bug-tracking system. Unfortunately, I couldn't duplicate the problem, and the original submitter didn't have logging turned on when the problem happened. He also wasn't able to obtain debugging information. Given the information he did have, which included an error message, I was able to gather some information following the steps outlined earlier in this article. I didn't have information on the input and output, but the program flow and stack both looked normal. In the end, I was able to determine where the program crashed but not why, so the bug sat there for a while. Things were made more difficult because the bug was intermittent—sometimes the program would work fine, and occasionally it would crash.

Later, a second person experienced the same problem. He noticed the existing bug report on Debian and sent in his information. OfflineIMAP automatically tries to print out parts of a debug log if a fatal error occurs, and he was able to capture this output. This OfflineIMAP feature has proven valuable in the past, because it is not always possible to reproduce the situation leading to a problem.

In this case, the information helped. I was now able to see what OfflineIMAP was doing immediately prior to the bug occurring. But, it still was not enough information to discover the exact problem—everything still looked normal. However, the bug was intermittent, and he couldn't capture any additional information.

Eventually, a third person experienced the same problem. Again, he had some information but not quite enough to figure it out. Something else needed to happen, so I made the logging in the particular section of code more detailed. Hopefully, with the additional logging, the next time the problem is encountered, I'll have enough information to track it down.

Several things played an important role in this process. First, OfflineIMAP always generates a usable stack trace when a fatal error occurs. Even the least-detailed report showed exactly where the program was when it crashed. Secondly, error logs are helpful, but less so if people can't reproduce a particular bug easily. Printing out debugging information when a program crashes or malfunctions can be a useful way to help combat that problem.

Also, the bug-tracking system played an important role in tracking down the problem. Because Debian bug reports are public, the three submitters involved were able to identify an outstanding bug report and add their information to it. This helped everyone to manage the information related to the particular issue and also provided a place to start for the people who encountered the problem for the first time.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix