Dealing with printk()
It's odd that printk() would pose so many problems for kernel development, given that it's essentially just a replacement for printf() that doesn't require linking the standard C library into the kernel.
And yet, it's famously a mess, full of edge cases, corner cases, deadlocks, race conditions and a variety of other tough-to-solve problems. The reason for this is, unlike printf(), the printk() system call has to produce reasonable behavior even when the entire system is in the midst of crashing. That's really the whole point—printk() needs to report errors and warnings that can be used to debug whatever strange and unexpected catastrophe has just hit a running system.
Trying to fix all the deadlocks and other problems at the same time would be too large a task for anyone, especially since each one is a special case defined by the particular context in which the printk() call appeared. But, sometimes a bunch of instances in a particular region of code can be addressed all together.
Sergey Senozhatsky recently tried to address some printk() deadlocks, although he acknowledged he wouldn't address any instances that were caused by the printk() code itself triggering a separate recursive printk() call. He wanted to concern himself with non-recursion-based deadlocks only.
Sergey focused on the console code, which was where printk() generally sent its output, and which was one place where printk() could deadlock. He added a very small safeguard to the code, but the result seemed to be that drivers all throughout the kernel would have to be updated to use the new safeguard.
His code was not met with universal acclaim. Alan Cox noticed that Sergey's safeguard added code to the "fast path"—a region of code that needed to be as fast and efficient as possible, because it was run all the time, many times per second. Slowing down the fast path would slow down the whole system. Alan suggested instead of this, it would be better for the kernel simply not to call printk() if the console code would be in a position to deadlock.
Sergey was not in any way satisfied, however. He pointed out that his patch solved real-world problems that users had reported experiencing directly. He didn't see how it would help anything simply to pull out the printk() instances that triggered the problem, especially if those instances were doing important work like reporting on the real reason the system was crashing and so on.
Sergey wanted to keep the printk() instances and implement the safeguards to protect them. However, at this point Linus Torvalds joined the discussion, saying:
The rule is simple: DO NOT DO THAT THEN.
Don't make recursive locks. Don't make random complexity. Just stop doing the thing that hurts.
There is no valid reason why an UART driver should do a printk() of any sort inside the critical region where the console is locked.
Just remove those printks, don't add new crazy locking.
If you had a spinlock that deadlocked because it was inside an already spinlocked region, you'd say "that's buggy".
This is the exact same issue. We don't work around buggy garbage. We fix the bug—by removing the problematic printk.
Sergey pointed out that the printk() instances were called from all those drivers he wanted to change. It wasn't a case of some simple part of the kernel having an extra printk(). The drivers all needed to be updated with the safeguard, or they would continue to report the wrong thing.
The conversation ended with no conclusion. It's difficult to know when something should be fixed versus removed. There are all sorts of technical questions that come up, including wondering if the fix is worth all the fuss.
Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to [email protected]