diff -u: What's New in Kernel Development


Containers are very tricky to implement. Trying to isolate sets of resources from each other completely, so that they resemble a discrete system, and doing it in a secure way, has to be addressed on a feature-by-feature basis, with many caveats and uncertainties. Over time, this makes the core kernel code more secure and robust, but each individual feature may have surprising issues.

The whole namespace idea—corralling subsets of system resources like user IDs and group IDs, and performing on-the-fly translations between the resource names within the container and the corresponding names in the outer system—is tough to manage.

Recently, Marian Marinov noticed that process counters in the outer system counted processes as being owned by the same user if his or her UIDs (user IDs) were the same inside two separate containers. The same was true for GIDs (group IDs). He didn't like this, because the two containers represented two logically isolated systems, and in that context, the same UIDs could refer to different users entirely. They shouldn't be counted together.

He wanted to patch the kernel to isolate these values, so that the process counters wouldn't get them mixed up with each other. But, Eric W. Biederman didn't like Marian's idea of creating namespace-specific data structures within the kernel. He offered some workarounds, but Marian didn't like any of them. He said they were less efficient than his idea, and they would require him to put a lot of effort into redesigning his particular use case, which ran a batch of identical containers, each built out of a single master template.

Ultimately, Eric suggested implementing something akin to Marian's idea, but only for certain filesystems. The problem, he said, was that XFS exposed too much of its inner workings to userspace, making it hard to perform the namespace translations correctly. If they limited support only to "normal" filesystems, Eric said, it would be much easier to give Marian what he wanted.

But, James Bottomley pointed out that Linux distributions wouldn't sacrifice XFS for anything. It had already been tried with the USER_NS feature. Distributions wouldn't accept USER_NS unless it supported XFS. James argued that the same would be true here.

Eric replied that the two cases were different. His solution would not preclude using XFS in a Linux distribution; it would only preclude using a particular use case that didn't currently exist anyway. And, Eric also argued that XFS already had serious issues that made it less container-friendly than other filesystems. It was hard to migrate an XFS filesystem to a system with different endianness and word size. This meant one of the common container uses—migrating processes and containers between machines—already was partially off the table for XFS, at least for the moment. That being the case, Eric said, it didn't make sense to go to great lengths to support it in a feature it couldn't use until so many other XFS characteristics had been fixed.

The debate undoubtedly will continue. Ultimately, the question involves identifying where to draw a line between seemingly integrated features of the kernel. What parts of the system can be containerized safely? What parts have to wait until other issues are addressed? In some cases, the end result will be much cleaner kernel code; in other cases, in the short term, much messier.

Some features get so big and complicated that they can't be changed easily anymore. In particular, it becomes harder to fix design flaws, because each fix has to account for all the existing special wonkiness. The printk() function is one example. Its code apparently has become such a nightmare that kernel developers must choose worse solutions to their problems, just in order to avoid a redesign process that is made so difficult by printk()'s current insane implementation.

Recently, Petr Mladek submitted some code to allow calling printk() from within an NMI (non-maskable interrupt). This is useful when a system is in the midst of crashing and needs to output logging data to help the user identify what went wrong. The problem was that printk() needed to take a lock that might be held by another process. That's a big no-no in NMIs, because the whole point of NMIs is that they never can be interrupted by other processes. The printk() would loop forever, waiting for a process to release a lock, when that process would never get the CPU cycles it needed to release that lock. Presto, deadlock.

Petr's code solved this by taking the lock only if available and failing over to an alternate solution if necessary. Overall, Petr's code improved the situation, because users actually were seeing lockups that could be better-diagnosed with printk()s in NMIs. Specifically, Jiri Kosina said, "we've actually seen the lockups triggered by the RCU stall detector trying to dump stacks on all CPUs, and hard-locking the machine up while doing so."

But, as Frédéric Weisbecker put it, the printk() code base was an "ancient design" with "fundamental flaws". Its poor design forced Petr's patch to be 1,000 lines long when such a fix ordinarily might be much smaller (Linus Torvalds later estimated 15 lines as a good size for Petr's features). Frédéric suggested, "shouldn't we rather redesign it to use a lockless ring buffer like ftrace or perf ones?"

Jiri agreed that the printk() code base was "a stinking pile of you-know-what", and that a redesign would be better than Petr's stop-gap patch. But in fact, he said, the correct design was not yet known, but regardless certainly would take a long time to implement and would delay Petr's important fix that addressed real-world crashes. And as Frédéric added, there also was the danger that "if we push back this bugfix, nobody will actually do that desired rewrite."

At some point, Frédéric asked for Linus' opinion, and Linus essentially torpedoed Petr's whole approach. He said:

Printing from NMI context isn't really supposed to work, and we all know it's not supposed to work.

I'd much rather disallow it, and if there is one or two places that really want to print a warning and know that they are in NMI context, have a special workaround just for them, with something that does not try to make printk in general work any better.

Dammit, NMI context is special. I absolutely refuse to buy into the broken concept that we should make more stuff work in NMI context. Hell no, we should not try to make more crap work in NMI. NMI people should be careful.

Make a trivial "printk_nmi()" wrapper that tries to do a trylock on logbuf_lock, and maybe the existing sequence of:

if (console_trylock_for_printk())

then works for actually triggering the printout. But the wrapper should be 15 lines of code for "if possible, try to print things", and not a thousand lines of changes.

Which, Petr said, was exactly what his patch did, but he just needed 1,000 lines of code instead of 15 because of how broken printk() was already. And Jiri said, "I find it rather outrageous that fixing real bugs (leading to hangs) becomes impossible due to printk() being too complex. It's very unfortunate that the same level of pushback didn't happen when new features (that actually made it so complicated) have been pushed; that would be much more valuable and appropriate."

At this point, Paul McKenney offered a compromise. Since Petr's patch was inspired by the RCU (read-copy-update) stall detector using NMIs to dump the stack, and thus needing printk(), Paul could rewrite the RCU code to avoid using NMIs for the stack dump. This way, regular printk() would work, without requiring Petr's patch.

The problem with this was that RCU wouldn't do quite as good a job of dumping the stack data. As Jiri put it, "this is prone to producing not really consistent stacktraces though, right? As the target task is still running at the time the stack is being walked, it might produce stacktraces that are potentially nonsensical."

But, Linus was insistent. He said, "We should stop using nmi as if it was something 'normal'. It isn't. Code running in nmi context should be special, and should be very very aware that it is special. That goes way beyond 'don't use printk'. We seem to have gone way way too far in using nmi context. So we should get rid of code in nmi context rather than then complain about printk being buggy."

So, Paul's solution, even being known to provide worse stack dumps than Petr's, would be adopted, simply because it could avoid making further changes to printk(). Jiri said, "I feel bad about the fact that we are now hostages of our printk() implementation, which doesn't allow for any fixes/improvements. Having the possibility to printk() from NMI would be nice and more robust...otherwise, we'll be getting people trying to do it in the future over and over again, even if we now get rid of it at once."