diff -u: What's New in Kernel Development

Over time, memory can become more and more fragmented on a system, making it difficult to find contiguous blocks of RAM to satisfy ongoing allocation requests. At certain times the running system may compact regions of memory together to free up larger blocks, but Vlastimil Babka recently pointed out that this wasn't done regularly enough to avoid latency problems for code that made larger memory requests.

Vlastimil wanted to create a new per-CPU dæmon, called kcompactd, that would do memory compaction as an ongoing system activity.

The basic objection, voiced by David Rientjes, was that creating a whole new thread on all CPUs carried its own overhead issues. He suggested having one of the other per-CPU threads simply take on the additional memory compaction responsibilities. He identified the khugepaged dæmon as the best candidate.

Vlastimil actually had identified khugepaged as a candidate and rejected it, on the grounds that khugepage dealt only with THP (Transparent HugePages) memory use cases. These are an abstraction layer above regular memory allocation, so it wouldn't cover all possible cases, only user code that dealt with THPs.

David argued that THP allocations were where most compaction problems occurred, and that other allocation systems, like the SLUB allocator (used for highly efficient kernel allocations), were not part of the problem.

Eventually, it came out that David actually envisioned two forms of memory compaction. The first would be a periodic compaction effort that would happen regardless of the state of RAM. The second would be a compaction effort that would be triggered when particular regions of RAM were detected as being overly fragmented. By splitting these two forms of memory compaction from each other, David felt it would be possible to piggy-back various pieces of functionality onto different existing threads and avoid having to create a new per-CPU thread in the kernel.

A final design did not get hashed out during the discussion, but no one seemed to be saying that memory compaction itself was a bad goal. The question always was how to implement it. Mel Gorman even suggested that a fair bit of the work could be done from user space, via the SysFS interface. But, that idea wasn't explored during the discussion, so it seems that only technical obstacles could get in the way of some background memory compaction going into the kernel.

One problem with enabling the CONFIG_TRACING option in the kernel, as Tal Shorer recently pointed out, is that it would enable absolutely every tracepoint, causing a significant performance penalty. It made more sense, he felt, to allow users to enable tracepoints on just the subsystems they were interested in testing.

He posted a patch to do this. Or rather, he posted a patch to ditch the old system and allow users to enable tracepoints on only the GPIO subsystem. He picked GPIO, he said, as a test case. If it met with approval, he offered to submit patches for all the rest of the subsystems.

Because of the overall kernel development cycle, it took a couple weeks for his patches to make the rounds. The new merge window had opened, so folks like Steven Rostedt, who ordinarily would be the one looking over Tal's submission, were too busy for any new code, until after the merge window had closed again.

Once that was ironed out though, he seemed to have mostly positive feedback for Tal. It looks as though tracepoints soon will be per subsystem, rather than kernel-wide.

In order to allow non-root users to write good system monitoring software, Prarit Bhargava wanted to expose the MSR values to non-root users, on a read-only basis. MSR registers are an Intel-specific set of registers that Intel originally intended for its own debugging purposes and made no guarantees that future chipsets would provide the same values. But over time, those registers seem to have coalesced around debugging and monitoring features, and the Linux kernel does expose them to the root user.

Prarit felt that allowing read-only access would avoid any potential security issues, because users would be unable to mess around with the actual values of those registers. But as other folks pointed out, the dangers of the MSR registers also existed in the kernel objects and regions of memory they exposed. Even read-only access could be sufficient for a hostile user to gain enough information to attack a running system successfully.

So, working with Andy Lutomirski and Pavel Machek, Prarit wrote a PMU (Performance Monitoring Unit) driver that would expose only a specifically whitelisted set of MSR data to users. This way, they could write their system monitoring software without risking a new attack, and if Intel changed the MSR registers in future chips, the changes would need to be vetted and the whitelist would need to be updated before any of that altered data would be exposed to regular users.

As an example of the importance of this particular issue, Len Brown mentioned during the discussion that Lawrence Livermore National Laboratory was keenly interested in the design and outcome of Prarit's efforts. The folks there wanted a secure and efficient way to access those MSR registers, and this work would provide it.