diff -u: What's New in Kernel Development
Sometimes a new piece of code turns out to be more useful than its author suspected. Alejandra Morales recently came out with the Cryogenic Project as part of his Master's thesis, supervised by Christian Grothoff. The idea was to reduce energy consumption by scheduling input/output operations in batches.
This idea turned out to be so good that H. Peter Anvin didn't want Cryogenic to be a regular driver, he wanted it to be part of the core Linux input/output system. On the other hand, he also felt that the programmer interface needed to be cleaned up and made a bit sleeker.
Pavel Machek also was highly impressed and remarked that this could save power on devices like phones and tablets that were always running low. And, Christian confirmed that this was one of the main goals of the code.
Christian added that power savings seemed to be on the order of 10%, though that number could be tweaked up or down by increasing or decreasing the amount of delay that was tolerable for each data packet.
David Lang also liked Cryogenic and agreed that it should go into the core input/output system. He added that a lot of other folks had attempted to accomplish similar things. It was a highly desired feature in the kernel. David also pointed out that in order to get into the core input/output system, the Cryogenic code would have to demonstrate that it had no performance impact on code that did not use its features, or that the impact would be minimal.
Luis R. Rodriguez recently pointed out that a lot of drivers were routinely backported to a large array of older kernels, all the way down to version 2.6.24. And although he acknowledged that this was currently manageable, he expected the number of drivers and other backportable features to continue to increase, making the situation progressively more difficult to sustain.
Luis said the kernel folks should do more to educate users about the need to upgrade. But, he also wrote up a recommendation that the kernel folks use Coccinelle to automate the backporting process (http://www.do-not-panic.com/2014/04/automatic-linux-kernel-backporting-with-coccinelle.html).
Coccinelle is a tool used to transform source code programmatically. It can be used to generate changes to earlier kernel code to match the functionality provided by newer patches. That's so crazy, it just might work!
But to get started, Luis wanted to draw a line between kernels that would receive backports and kernels that would not. Hopefully, that line would only move forward. So he asked the linux-kernel mailing list members in general, which were the earliest kernels they really needed to keep using.
As it turned out, Arend van Spriel knew of Broadcom WLAN testers that still relied on Fedora 15, running the 2.6.38 kernel. He said he was working with them to upgrade to Fedora 19 and the 3.13 kernel, but that this hadn't happened yet.
So it appears that a certain amount of backporting will become automated, but of course, the Coccinelle transformations still would need to be written and maintained by someone, which is why Luis wanted to limit the number of target kernels.
It turns out Windows does certain things better than Linux—for example, in the area of rebooting. Apparently, there are several techniques that can be done in software to cause a system to reboot. But in some cases, the Linux system will go down successfully, and then not come up again. This is a problem, for example, in server farms with minimal human staff. If 20 systems are acting up and you want to reboot them all, it's easier to give a single command from a remote terminal, than to send a human out into the noise and the cold to press each reset button by hand.
One rebooting technique involves sending certain values to the 0xCF9 port on the system. Another is to use the EFI (Extensible Firmware Interface) BIOS replacement from Intel. Depending on the circumstances, one or the other rebooting technique is preferred, but the logic behind that selection can be tricky. In particular, changing the state of various pieces of hardware can change the appropriate reboot technique. So, if you run through a series of reboot attempts, and somehow change hardware state along the way, you can find that none of the attempts can succeed.
The cool thing about this particular bug is the straightforward way Linus Torvalds said that Windows must be doing something right, and the Linux people needed to figure out what that was so Linux could do it right too.
Steven Rostedt pointed out the boot failure in one of his systems, and this triggered the bug hunt. Part of the problem is that it's very difficult to understand exactly what's going on with a system when it boots up. Strange magical forces are apparently invoked.
During the course of a somewhat heated debate, Matthew Garrett summed up what he felt was the underlying issue, and why the problem was so difficult to solve. In response to any of the various bootups attempted, he said, "for all we know the firmware is running huge quantities of code in response to any of those register accesses. We don't know what other hardware that code touches. We don't know what expectations it has. We don't know whether it was written by humans or written by some sort of simulated annealing mechanism that finally collapsed into a state where Windows rebooted."
Matthew was in favor of ditching the 0xCF9 bootup technique entirely. He argued, "We know that CF9 fixes some machines. We know that it breaks some machines. We don't know how many machines it fixes or how many machines it breaks. We don't know how many machines are flipped from a working state to a broken state whenever we fiddle with the order or introduce new heuristics. We don't know how many go from broken to working. The only way we can be reasonably certain that hardware will work is to duplicate precisely what Windows does, because that's all that most vendors will ever have tested."
But, Linus Torvalds felt that ditching CF9 was equivalent to flailing at the problem. In the course of discussion he said, "It would be interesting if somebody can figure out exactly what Windows does, because the fact that a lot of Dell machines need quirks almost certainly means that it's us doing something wrong. Dell doesn't generally do lots of fancy odd things. I pretty much guarantee it's because we've done something odd that Windows doesn't do."
The discussion had no resolution—probably because it's a really tough problem that hits only a relatively small number of systems. Apparently the bug hunt—and the debate—will continue.