Finding Stubborn Bugs with Meaningful Debug Info

SysAdmin

by John Goerzen

on December 1, 2004

Bug tracking is often one of the most difficult processes in software development. Users may have situations different from developers, and bugs that are a big problem for users may not even be visible on developers' machines. Sometimes bugs can come and go, or networked programs may encounter bugs only when talking to specific servers or clients. In this article, I discuss techniques software developers can employ to help track down bugs more easily.

First, I discuss two ways to make it easier to receive and manage bugs, and then I show how to make your programs generate more useful debugging output. Then, I talk about tracking down troublesome bugs. Finally, I cover some practices that can help prevent bugs in the first place. Many of the techniques described in this article are employed in OfflineIMAP (see “Fast Convenient Mail for Travel: OfflineIMAP”, LJ, March 2004).

Tracking Bugs

Before examining how to make better bug reports possible, a critical first step is making sure you can deal with the bug reports you receive. For some small projects, simply publishing an e-mail address is sufficient. However, most projects need something more. Developers often get busy and forget about things. Bugs may be complicated to solve, requiring input from several people, or there simply may be a lot of bug reports.

A bug-tracking system (BTS) is a great way to help ensure that bugs are not forgotten. Most BTS implementations provide a way to track correspondence, handle attached files and delegate responsibility to particular people. Some also support categorization based on things such as severity, user environment and specific components.

If your project is hosted at a project hosting site such as SourceForge or Savannah, you already have a BTS available for your use. You should use it and encourage your users to submit bugs through that interface rather than to a mailing list.

If you need more flexibility, you can find BTS programs for Linux. Some of the most popular free software BTS programs are:

Bugzilla, the BTS used by the Mozilla Project, is a flexible system primarily used through its Web interface.
Request Tracker can be used as both a bug-tracking system and a support-tracking system. It features both Web and e-mail interfaces, though some administrative functions can occur only through the Web interface.
Jitterbug is the BTS used by the Samba Project. It is similar in concept to Bugzilla but is more lightweight.
Debbugs is the BTS used by the Debian Project. Debbugs has a Web interface, though it is read-only; all manipulations occur by e-mail. Debbugs is best suited for large projects with clearly identifiable components and responsibility for those components.

I personally prefer Request Tracker, because it seems to have a nice blend of features for a BTS. Your own requirements may differ.

Make It Easy to Submit Bugs

Sometimes I find a nasty bug in a program and want to report it. But to do so, I have to fill out a detailed questionnaire and perhaps divulge information I'd rather not. It should be easy for people to submit bugs and the information needed to track them down. If you take submissions on the Web, make the process simple. Don't require too much information, and accept submissions even if people don't know some information. Don't expect users to know anything about the different components of the project or which developers are responsible for a given problem.

Logging

When tracking down problems, you often want to know what state the program is in. Other times, you may want to know what actions were carried out prior to triggering the bug. Because users of programs don't necessarily have expertise with your code and a debugger, logging often is called for. Logging simply means writing out a record of the actions carried out. Simple programs might merely print out information, but usually you'll want something a little more capable.

Non-interactive programs, such as network servers, do not have a screen on which to display information. These programs often maintain a log file or use the syslog facility built in to Linux and UNIX systems.

Interactive programs may display information on-screen or also may generate a file. Having a log file available can make bug reporting easier, because the user simply can attach it to a bug report.

Sometimes, you might need quite a bit of data to figure out what's going on with a specific problem. However, all this data may be overkill for a normal session—it could flow right off a user's screen or fill up a hard disk. Therefore, many programs have a notion of a log level. The user can set, at runtime, how much information should be logged. Some programs even may have log categories, where users can configure which types of information are logged. OfflineIMAP uses this approach. For troublesome problems, users can turn on a communications log, which logs all data sent to or received from the IMAP server.

Python 2.3 introduced a useful module called logging. The logging module provides a uniform interface to several different ways of logging messages. Its supported logging methods include writing messages to files, network services, syslog, e-mailing messages and several others. The following is a simple example that illustrates use of the logging module:

#!/usr/bin/env python
import logging, sys

# Create the logger object
l = logging.getLogger('testlog')

# Create a handler and assign it to the object
handler = logging.StreamHandler(sys.stderr)
l.addHandler(handler)

# Levels are DEBUG, INFO, WARNING, ERROR, CRITICAL.
# Set the default level here. Any log messages
# beneath that level are dropped.
l.setLevel(logging.INFO)

# Try it out.

l.debug("Debug message -- system initialized.")
l.info("Here's some info.  I've just debugged.")
l.warning("I don't have many messages left.")
l.error("Only one more message to go.")
l.critical("Nothing else to do!")

This program begins by initializing the logger. It uses the StreamHandler to write logged text to standard error. It also sets the log level to INFO. Then it logs five messages. When you run this program, you see only the last four. The debug message was filtered out by setting the level to INFO. Many programs have a configuration or command-line option to set the level at runtime. You can use different logging methods simply by adding a different handler to your Logger object. The Python documentation has a reference for all the available handlers.

Check Input

Make sure the input you are receiving is valid. For instance, if you are expecting something on the command line, check to make sure you have the appropriate number of arguments before trying to use them (or trap the resulting exception). This gives users a better error message. Here's a sample Python program that demonstrates this:

#!/usr/bin/env python
import sys

try:
    print "You supplied: %s" % sys.argv[1]
except IndexError:
    print "You forgot an argument."

Handle Exceptions

Several programming languages, such as Java, Python and OCaml, include support for exceptions. With exceptions, you can catch errors at the place you choose, rather than having to check and handle errors with each call that may produce a problem. Sometimes, it might be correct to let exceptions go unhandled, but usually that is not the case. Exceptions should be caught and handled. Although it may be appropriate to terminate the program if you can't open the file a user asks for, it is still better to do so with an error message giving the filename and problem rather than let the user receive an ugly exception message.

Capture Exceptions

For exceptions that really are fatal to your program, you still may want to capture them. This would allow you, for instance, to log them to a file or display the exception in a pop-up box in the GUI application. This makes it easier for users to send the stack trace back to you. You also can use a generic exception catcher to perform other activities, perhaps output contents of various buffers to help you figure out what was going on at the time.

The following is an example that logs any exceptions along with some information about the program currently running. It then re-raises the exception and exits:

#!/usr/bin/env python
import logging, sys, StringIO, traceback, os

l = logging.getLogger('testlog')

handler = logging.StreamHandler(sys.stderr)
l.addHandler(handler)
formatter = logging.Formatter("LOG: %(message)s")
handler.setFormatter(formatter)

l.setLevel(logging.INFO)

def logexception():
    sbuf = StringIO.StringIO()
    traceback.print_exc(file = sbuf)
    excval = sbuf.getvalue()
    l.critical(" *** Exception Detected ***")
    l.critical("Current PID: %d" % os.getpid())
    l.critical("Program name: %s" % sys.argv[0])
    l.critical("Command line: %s" % \
               str(sys.argv[1:]))
    for line in excval.split("\n"):
        l.critical(line)

def main():
    print "Hello, I'm running."
    raise RuntimeError("Oops! I've had a problem!")

try:
    main()
except:
    logexception()
    raise

When you run this program, you should see something like this on your screen:

Hello, I'm running.
LOG:  *** Exception Detected ***
LOG: Current PID: 28441
LOG: Program name: /tmp/logerror.py
LOG: Command line: []
LOG: Traceback (most recent call last):
LOG:   File "/tmp/logerror.py", line 30, in ?
LOG:     main()
LOG:   File "/tmp/logerror.py", line 27, in main
LOG:     raise RuntimeError("Oops! I've had a problem!")
LOG: RuntimeError: Oops! I've had a problem!
LOG:

Here, the exception handler found the exception, grabbed the information about it and was able to log it. You also can see the traceback a second time. The raise statement at the end of the program causes the exception to be raised and handled in the normal fashion also. This means it aborts your program with a traceback. Depending on your requirements, you may opt to use sys.exit() to terminate instead.

Finding Reported Bugs

Now that you have some ways to help users submit good bug reports, let's look at ways to use those bug reports to track down problems. Armed with a log and perhaps traceback information, here are some questions to ask yourself:

Can I duplicate the bug in my environment? If you can duplicate the problem on your own machine, you're a long way toward being able to resolve it easily. Use a debugger or other tool to track it down now that you can trigger it at will.
Was the input and output what I expected? Perhaps the user supplied a value you didn't contemplate when you wrote the program. Or, perhaps a network client or server treats a protocol slightly differently from what you expected. Maybe the input or output is itself malformed, and the bug isn't even in your program. A debug log showing all I/O can be very helpful here.
Was the program flow as expected? If your log calls to various functions or methods, you should be able to trace the flow of execution in a program. Perhaps certain conditions cause vital code to be skipped, leading to trouble later on.
Where was the last point of correct execution? This may have been right before the error, or perhaps incorrect data was passed around for some time prior to a crash. Pinpointing the most recent time in the program's history where it was functioning normally can help track down the precise place where things went awry.
If a traceback is on-hand, does the stack look normal? Check to make sure the function calls are as expected and that the data passed to them looks legitimate.

Preventing Bugs

All the techniques I've described in this article are useful, but they shouldn't be deployed in a vacuum. It's also important to adopt practices that help reduce the likelihood of bugs occurring. Here are some to consider:

Adopt unit testing. Java, Python, OCaml, Perl and C all have unit testing frameworks available. Use them and exercise as many code paths as possible. This is especially important for a language such as Python where certain executions of a program may not even parse all of your code. It also can be important for Java; for instance, runtime exceptions can occur due to improper casting to or from Object.
Avoid globals. Avoiding global (or class-global) variables helps isolate problems and helps prevent synchronization issues in multithreaded programs. Global variables can be the source of unexpected side effects in function calls, which can be hard to track down.
Use the right tool for the job. Languages each have their own strengths and weaknesses, and no single language is the best tool for every task. For instance, although Perl makes it easy to parse delimited text files with regular expressions, OCaml provides tools specifically designed for writing a compiler. Problems that are expressed easily in one language may become much more difficult to express in another.
But, don't use too many different tools. Most projects benefit from a standardized toolset. Pick a language and libraries that are most useful for the project at hand, and don't introduce new ones unless there's a compelling reason to do so.
Use string and memory management tools. Many languages, including Java, Python, OCaml, Perl and Ruby, provide transparent memory management. You do not need to allocate and deallocate memory. You also do not need to concern yourself explicitly with end-of-string markers and string size limitations. Both of these are common problems with C programs that lead to runtime bugs or security holes. If you must use C, consider a garbage collection or memory pool library.
Make it work first, then optimize. In many cases, it's better to develop working code first, then optimize it later. Many people optimize first, which does work in some cases. However, simple, bug-free code is usually more important than code that is as fast as it can possibly be.
Write clean code. Split out code into functions. Write comments. Document what each function does and its effect on the environment.

Case Study: a Bug in OfflineIMAP

OfflineIMAP is a program that talks to IMAP servers and synchronizes an IMAP folder tree with a local tree. Many IMAP servers exist, and they don't all work exactly the same. Through its two-year history, OfflineIMAP has gained more and more of the debugging techniques discussed in this article. Problems that users encounter often are unreproducible with my particular setup, so detailed logging is a must. Some IMAP servers are buggy themselves, so the first question that has to be resolved with many reports is: is this even a bug in OfflineIMAP? In a surprising number of cases, the answer is no. OfflineIMAP uses certain IMAP features that most other IMAP clients do not, and those features tend to be poorly tested in some servers.

I'd like to walk you through one particularly stubborn OfflineIMAP bug I've been working on. About a year ago, someone reported a bug in OfflineIMAP using the Debian bug-tracking system. Unfortunately, I couldn't duplicate the problem, and the original submitter didn't have logging turned on when the problem happened. He also wasn't able to obtain debugging information. Given the information he did have, which included an error message, I was able to gather some information following the steps outlined earlier in this article. I didn't have information on the input and output, but the program flow and stack both looked normal. In the end, I was able to determine where the program crashed but not why, so the bug sat there for a while. Things were made more difficult because the bug was intermittent—sometimes the program would work fine, and occasionally it would crash.

Later, a second person experienced the same problem. He noticed the existing bug report on Debian and sent in his information. OfflineIMAP automatically tries to print out parts of a debug log if a fatal error occurs, and he was able to capture this output. This OfflineIMAP feature has proven valuable in the past, because it is not always possible to reproduce the situation leading to a problem.

In this case, the information helped. I was now able to see what OfflineIMAP was doing immediately prior to the bug occurring. But, it still was not enough information to discover the exact problem—everything still looked normal. However, the bug was intermittent, and he couldn't capture any additional information.

Eventually, a third person experienced the same problem. Again, he had some information but not quite enough to figure it out. Something else needed to happen, so I made the logging in the particular section of code more detailed. Hopefully, with the additional logging, the next time the problem is encountered, I'll have enough information to track it down.

Several things played an important role in this process. First, OfflineIMAP always generates a usable stack trace when a fatal error occurs. Even the least-detailed report showed exactly where the program was when it crashed. Secondly, error logs are helpful, but less so if people can't reproduce a particular bug easily. Printing out debugging information when a program crashes or malfunctions can be a useful way to help combat that problem.

Also, the bug-tracking system played an important role in tracking down the problem. Because Debian bug reports are public, the three submitters involved were able to identify an outstanding bug report and add their information to it. This helped everyone to manage the information related to the particular issue and also provided a place to start for the people who encountered the problem for the first time.

Conclusion

There are many ways to help your users report bugs in your program and track them down, but they should not be employed in a vacuum. Don't forget to make it easy to report and track bugs, and to write clean code in the first place. Finally, remember that none of these steps are a magic bullet. Taken together, they can simplify your bug-tracking process and help find many problems, but they won't necessarily solve everything.

Resources for this article: /article/7747.

John Goerzen is a longtime Linux programmer and the author of Foundations of Python Network Programming. He also serves as President of Software in the Public Interest, Inc. John welcomes your comments at jgoerzen@complete.org.

Load Disqus comments