diff -u: Complexifying printk()

What's new in kernel development: complexifying printk().

It's so simple! The kernel decides to output a log message, so it calls printk() to send the message to a serial console—except it's not simple at all. What if the kernel is in the middle of crashing, and the log message is the crucial clue needed to diagnose the problem? How do you output a log message when you don't know what parts of the system you even can rely on? What if the system's out of memory or trapped in an atomic context, unable to switch from whatever's breaking to the code to execute the printk()?

There are all sorts of corner cases that safely can be ignored by user code producing output, but that are essential to get right when the kernel is the one producing output.

To make matters worse, these corner cases tend to occur in ways that are difficult to reproduce, creating potential controversy over whether a bug exists at all. How do you reproduce a bug that causes the very logging system to fail to tell you what happened?

A multi-year debate recently hit the kernel mailing list again, as Sergey Senozhatsky posted a patch to fix a system crash he and others had been seeing in their companies' data centers. Unfortunately, his solution added a lot of complexity to the already complicated printk() code, and the bug it fixed could not be reproduced on demand.

Steven Rostedt had a separate patch for printk(), also very complicated, but much simpler than Sergey's patch. The only problem was, it didn't address the issue Sergey was trying to fix. So, much of the email thread consisted of Sergey and others trying to convince Steven and others that the bug was real. They posted process traces and logic paths, but Steven was never convinced. In particular, in the cases where Sergey's side was able to demonstrate an actual problem, Steven objected that the use-case was completely unrealistic—for example, when a single CPU was doing all the work while all the other CPUs on the system remained idle.

Meanwhile, a significant number of people simply wanted to push Steven's code into the mainline kernel source and see what happened. Maybe Sergey's system crashes would stop? But, just throwing something into the wild like that is never an appealing option, and it seemed to grow out of the fact that it's so difficult to know what printk() is actually doing when things go wrong.

There was absolutely no conclusion during the discussion. The multi-year debate remains a multi-year debate. However, there may have been a slight nudge in one direction or another, because Steven has started to feel that the system crashes Sergey is trying to fix may in fact be real, but caused by something else, and he's begun working with Sergey to try to diagnose that.

It's all very interesting, because even though the developers experience a lot of frustration with each other in a situation like this, they still keep working together, trying to find a way through. There's no way to know for sure how it will end up. Maybe someone will find a way to explode the entire printk() code into a bunch of smaller yet simpler elements, so that bugs can be reproduced and analyzed, instead of remaining intractable.

Load Disqus comments