diff -u: What's New in Kernel Development


Hardware errors are tough to code for. In some cases, they're impossible to code for. A particular brand of hardware error is the Machine-Check Exception (MCE), which means a CPU has a problem. On Windows systems, it's one of the causes of the Blue Screen of Death.

Everyone wants to handle hardware errors well, because it can mean the difference between getting a little indication of what actually went wrong and getting no information at all.

Andy Lutomirski recently suggested some code to clean up non-maskable interrupts (NMIs), which also typically indicate some sort of hardware failure. But over the course of discussion, folks raised questions about how to handle various cases—for example, when an MCE came immediately after an NMI. Typically NMIs are not interruptable by any other code, but should an exception be made for MCEs? If the OS detects a CPU error while already processing another hardware error, should it defer to the more pressing CPU issue or not?

There was a bit of debate, but ultimately Linus Torvalds said that an MCE meant that the system was dead. Any attempt to handle that in software, he said, was just in order to crash as gracefully as possible. But he felt that the kernel should not make any complicated effort in that case, since the end result would just be the same crash. Deadlocks, race conditions and other issues that normally would be important, simply weren't in this case. Make a best effort to log the event, he said, and forget the rest.

Elsewhere, he elaborated more vociferously, saying, "MCE is frankly misdesigned. It's a piece of shit, and any of the hardware designers that claim that what they do is for system stability are out to lunch. This is a prime example of what not to do, and how you can actually spread what was potentially a localized and recoverable error, and make it global and unrecoverable." And he added:

Synchronous MCEs are fine for synchronous errors, but then trying to turn them "synchronous" for other CPUs (where they weren't synchronous errors) is a major mistake. External errors punching through irq context is wrong, punching through NMI is just inexcusable.

If the OS then decides to take down the whole machine, the OS—not the hardware—can choose to do something that will punch through other CPUs' NMI blocking (notably, init/reset), but the hardware doing this on its own is just broken if true.

Tony Luck pointed out that Intel actually was planning to fix this in future machines, although he acknowledged that turn-around time for chips was likely to be very long. However, as Borislav Petkov pointed out, even after the fix went in, Linux still would need to support the bad hardware.

The tightrope-walk of container security has some controversy. One group believes that containers should be able to do whatever an independent system could do. Another group believes that certain abilities render the container inherently insecure. The first group says that without these features, the container isn't truly offering a complete environment. The second group says that's how the cookie crumbles.

Seth Forshee recently posted some patches to allow containerized systems to see hot-plugged devices, just the way a non-containerized system could. But this, apparently, was a bridge too far. Greg Kroah-Hartman said he had long since expressed a clear policy against adding namespaces to devices. And, that was exactly how Seth's code made the hot-plugged devices visible to the containerized system.

It turns out that there are valid use-cases for wanting a containerized system to be able to see hot-plugged devices. Michael H. Warfield described one such. And, Seth described his own—he needed hot-plug support in order to implement loopback devices within the container.

Greg said loopback support in a container was a very bad idea, since it provided all sorts of opportunities to leak data out of the container and into the host system—a security violation.

He said this was not a "normal" use-case for containers. To which Serge Hallyn replied that any feature used by a non-containerized system was a "normal" use case for containerized systems.

Serge argued that these features inevitably would go into containers. There was no way to keep them out. As long as containers excluded features that were included in non-containerized systems, there would be people with an incentive to bridge the gap. Why not bridge it now and fix the bugs as they showed up?

But Richard said, "There are so many things that can hurt you badly. With user namespaces, we expose a really big attack surface to regular users. [...] I agree that user namespaces are the way to go, all the papering with LSM over security issues is much worse. But we have to make sure that we don't add too many features too fast."

And, Greg added that Seth's code was too hacky, implementing just what Seth needed, rather than addressing the overarching issue of how to handle namespaces properly within a container.

Greg also said he supported loopback devices within containers, but he and James Bottomley said that the security issues were real, and the implementation had to take account of them. It wasn't enough simply to implement the feature and then fix bugs. The feature needed a proper design that addressed these concerns.