diff -u: Speeding Up the Un-Speed-Up-able
Sometimes kernel developers can work in parallel for years without realizing it. It's one of the inefficiencies of a distributed system that tends to work out as a benefit when you have enough contributors to be able to spare the extra labor—it's sort of a "with enough eyeballs, all bugs are shallow" kind of thing.
This time Kirill A. Shutemov wanted to scratch a particular itch in the memory encryption code. Memory encryption occurs at such a heavily used place in the kernel that a difference of just a couple opcodes can make a noticeable speed difference to a running system. Anything short of introducing security holes would be worth considering, in order to shave off those few Assembly instructions.
But Kirill had a problem—his code made things worse. He wanted to
__PHYSICAL_MASK variable into what he called a "patchable
constant"—a value that had the simplicity of a constant at runtime,
but that could be set dynamically by configuration options or at the
bootloader command line before booting the system.
So far, he had come up with a patch that did this—and that would be applicable to other variables that might be faster to implement as constants. Unfortunately, as implemented, although it achieved the feature he wanted, it caused GCC to produce code that was less efficient than what it had produced before. Instead of speeding things up, this resulted in a 0.2% slowdown on his test system.
He posted his patch, along with a request for ideas, to the linux-kernel mailing list.
Linus Torvalds replied with an analysis of the opcodes produced by GCC, showing why they were slower and why Kirill's general approach always would suffer from the slowdown issue. In particular, Linus pointed out that Kirill moved constant values into registers, which never would be optimal, and his code also included a movabsq instruction, which he felt was rarely needed.
Linus said he actually really wanted this feature in spite of the complex overhead required to accomplish it; in fact, he preferred an even more complex approach, along the lines of something H. Peter Anvin had attempted about 18 months earlier. Of that approach, he said:
It's rather more complex, but it actually gives a much bigger win. The code itself will be much better, and smaller. The *infrastructure* for the code gets pretty hairy, though.
Peter replied that he'd actually been working on this very project for some time from a secret, undisclosed location at the bottom of a well in a town with no name. In fact, his current approach was yet more ambitious (and complex) than what he'd tried 18 months earlier. On the other hand, as he put it, he was now about 85% done with the whole thing, and the code "mostly needs cleanup, testing, and breaking into reasonable chunks." He added:
The main reason I haven't submitted it yet is that I got a bit overly ambitious and wanted to implement a whole bunch of more complex subcases, such as 64-bit shifts on a 32-bit kernel. The win in that case is actually quite huge, but it requires data-dependent code patching and not just immediate patching, which requires augmentation of the alternatives framework.
So it looks like Kirill's itch is about to be scratched...by Peter. An element of the kernel so deep and tangled that it seemed intended never to be touched is being tugged apart beyond its apparent natural limits. And, all of this is achieved by the mere addition of a big honking pile of insanely complex code that even seasoned developers balked at touching.
Note: If you're mentioned above and want to post a response above the comment section, send a message with your response text to email@example.com.