diff -u: What's New in Kernel Development


Sometimes it's necessary to change function semantics inside the kernel, and then find and update all users of that function to match the new semantics. Such changes can result in huge patches going into the source tree, affecting hundreds of files.

Al Viro wanted to do a change like that to a bunch of memory handling routines. He'd noticed that the existing memory allocation tools all returned plain numbers that users then would have to convert to pointers in the vast majority of cases. Al posted a mega-whopper patch, making those functions all return pointers instead of plain numbers.

Linus Torvalds didn't like it though. One of the problems with those immense semantic-changing patches, he said, was that back-porting other unrelated patches became more difficult. For each patch that needed to be backported—security fixes, new drivers and so on—Linus said the port would need to be reworked significantly just in order to get across the barrier of Al's changes. That would be time-consuming for the developer, would increase the likelihood of new bugs, and it didn't seem to carry enough value to justify it.

The way to go about it, Linus said, was to create all new functions, with new names, with the new semantics, and let the various parts of the kernel switch over to the new calls as they pleased. But, even that seemed hard to justify to him.

Ultimately, Al dropped his big patch and posted a new set of guidelines for memory allocation that would help users resolve questions of which functions to use in which circumstance:

1) Most of the time kmalloc() is the right thing to use. Limitations: alignment is no better than word, not available very early in bootstrap, allocated memory is physically contiguous, so large allocations are best avoided.

2) kmem_cache_alloc() allows to specify the alignment at cache creation time. Otherwise it's similar to kmalloc(). Normally it's used for situations where we have a lot of instances of some type and want dynamic allocation of those.

3) vmalloc() is for large allocations. They will be page-aligned, but *not* physically contiguous. OTOH, large physically contiguous allocations are generally a bad idea. Unlike other allocators, there's no variant that could be used in interrupt; freeing is possible there, but allocation is not. Note that non-blocking variant *does* exist - __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL) can be used in atomic contexts; it's the interrupt ones that are no-go.

4) If it's very early in bootstrap, alloc_bootmem() and friends may be the only option. Rule of the thumb: if it's already printed 'Memory: ...../..... available.....' you shouldn't be using that one. Allocations are physically contiguous and at that point large physically contiguous allocations are still OK.

5) if you need to allocate memory for DMA, use dma_alloc_coherent() and friends. They'll give you both the virtual address for your use and DMA address referring to the same memory for use by device; do *NOT* try to derive the latter from the former; use of virt_to_bus() et.al. is a Bloody Bad Idea(tm).

6) If you need a reference to struct page, use alloc_page/alloc_pages.

7) In some cases (page tables, for the most obvious example), __get_free_page() and friends might be the right answer. In principle, it's case (6), but it returns page_address(page) instead of the page itself. Historically that was the first API introduced, so a _lot_ of places that should've been using something else ended up using that. Do not assume that being lower level makes it faster than e.g. kmalloc() - this is simply not true.

System calls notoriously have insufficient error reporting. Some take lots of inputs, and if any of them are wrong in any way, or fail some obscure bounds check, the call returns "EINVAL" for invalid data, but doesn't give any other clue about which piece of data had the problem, or what the value was, or where in the code the problem occurred.

Alexander Shishkin recently tried to implement a solution to this. The real issue though is that the kernel can't simply change the way system calls handle return values. There's code all through the kernel and in userland that depends upon the current behavior. Any solution, therefore, would somehow have to provide additional reporting information, without changing the way existing calling routines received syscall return values.

Alexander's technique took advantage of the fact that system calls generally were processed through a set of macros before sending their return values back to the calling routines. By designing an entirely new set of return values for the actual system calls, Alexander's code could reference an error message holding tank that the macros would be able to process while still returning the originally intended error code to the calling routine.

The macros would place a pointer to the detailed error reports, in JSON format, into the task_struct data structure, where it could be retrieved by the calling routines, using a prctl() call.

Jonathan Corbet, however, had strong doubts about this approach. For one thing, if the calling routine didn't actively query and reset the new debugging data, that data would just sit in the task_struct, getting stale. Although clearing out the debugging data automatically would defeat the purpose of placing it there originally.

And, Johannes Berg also pointed out that with Alexander's changes in effect, applications could break if they had to run on older kernels and expected the new debugging data to be available.

Ultimately, Alexander's approach was not adopted, although no better idea emerged. It's a thorny and persistent problem. It's not clear that any solution will be able to answer all objections, but maybe something will be able to answer more objections than the status quo.

There's a Y2038 bug in Linux. It's the day when the 32-bit UNIX timestamp rolls back to zero. Since Linux basically runs the known universe these days, the bug has to be dealt with, probably by updating the timestamp to hold a 64-bit value. Deepa Dinamani posted some patches to do that, but the problem didn't end there.

The solution had to account for a wide range of possibilities. For example, each different filesystem (NFS, ext4, FUSE and so on) needed its own hand-crafted Y2038 bugfix. It wasn't the kernel alone that needed the fix. Also, after the year 2038, even if all the filesystems had their own fixes, how would a user be able to mount an older filesystem instance that did not have the fix in place? That needed to be solved as well. Additionally, there were corporate interests to consider. Certain service contracts would require a Y2038 fix to be in place, perhaps decades before the bug actually would hit.

Overall, it's going to be a lot of work. Arnd Bergmann, Dave Chinner and Deepa had a long technical conversation about the ins and outs, but the clearest sense of direction to emerge from the discussion was that they should ignore everything that wasn't directly relevant, and they should hew off as many smaller chunks to solve as they possibly could in the hopes that the main chunk might get easier and more manageable.