Obsolete Microkernel Dooms Mac OS X to Lag Linux in Performance

by Miles Nordin

Disagreements exist about whether or not microkernels are good. It's easy to get the impression they're good because they were proposed as a refinement after monolithic kernels. Microkernels are mostly discredited now, however, because they have performance problems, and the benefits originally promised are a fantasy.

The microkernel zealot believes that several cooperating system processes should take over the monolithic kernel's traditional jobs. These several system processes are isolated from each other with memory protection, and this is the supposed benefit.

Monolithic kernels circumscribe the kernel's definition and implementation as "the part of the system that would not benefit from memory protection".

When I state the monolithic design's motivation this way, it's obvious who I believe is right. I think microkernel zealots are victims of an overgeneralization: they come to UNIX from legacy systems such as Windows 3.1 and Mac OS 6, which deludes them into the impression that memory protection everywhere is an abstract, unquestionable Good. It's sort of like the common mistake of believing in rituals that supposedly deliver more security, as if security were a one-dimensional concept.

Memory protection is a tool, and it has three common motivations:

  1. to help debug programs under development with less performance cost than instrumentation. (Instrumentation is what Java or Purify uses.) The memory protection hopefully makes the program crash nearer to the bug than without it, while instrumentation is supposed to make the program crash right at the bug.

  2. to minimize the inconvenience of program crashes.

  3. to keep security promises even when programs crash.

''Because MS-DOS doesn't have it and MS-DOS sucks'' is not a motivation for memory protection.

Motivation #1 is a somewhat legitimate argument for the additional memory protection in microkernel systems. For example, QNX developers can debug device drivers and regular programs with the same debugger, making QNX drivers easier to write. QNX programmers are neat because drivers are so easy for them to write that they don't seem to share our idea of what a driver is; they think everything that does any abstraction of hardware is a driver. I think the good debugging tools for device drivers maintain QNX as a commercially viable Canadian microkernel. Their claims about stability of the finished product become suspicious to any developer who actually starts working with QNX; the microkernel benefits are all about ease of development and debugging.

Motivation #2 is silly. A real microkernel in the field will not recover itself when the SCSI driver process or the filesystem process crashes. Granted, if there's a developer at the helm who can give it a shove with some special debugging tool, it might, but that advantage is really more like that stated in motivation #1 than #2.

Since microkernel processes cooperate to implement security promises, the promises are not necessarily kept when one of the processes crashes. Therefore motivation #3 is also silly.

These three factors together show that memory protection is not very useful inside the kernel, except perhaps for kernel developers. That's why I claim the microkernel's promised benefits are a fantasy.

Before we move on, I should point out that the two microkernel systems, Mach and QNX, have different ideas about what is micro enough to go into the microkernel. In QNX, only message passing, context switching and a few process scheduling hooks go into the microkernel. QNX drivers for the disk, the console, the network card and all the hardware devices are ordinary processes that show up next to the user's programs in sin or ps. They obey kill, so if you want, you can kill them and crash the system.

Mach, which Apple has adopted for Mac OS X, puts anything that accesses hardware into the microkernel. Under Mach's philosophy, XFree86 still shouldn't be a user process. In the single server abuses of microkernels, like mkLinux, the Linux process made a system call (not message passing) into Mach whenever it needed to access any Apple hardware, so the filesystem's implementation would be inside the Linux process, but the disk drivers are inside the Mach microkernel. This arrangement is a good business argument for Apple funding mkLinux: all the drivers for their proprietary hardware, thus much of the code they funded, stays inside Mach, where it's covered by a more favorable (to them) license.

However, putting Mach device drivers inside the microkernel substantially kills QNX's motivation #1 because Mach device drivers are now as hard to debug as a monolithic kernel's device drivers. I'm not sure how Darwin's drivers work, but it's important to acknowledge this dispute about the organization of real microkernel systems.

What about the performance problem? In short, modern CPUs optimize for the monolithic kernel. The monolithic kernel maps itself into every user process's virtual memory space, but these kernel pages are marked somehow so that they're only accessible when the CPU's supervisor bit is set. When a process makes a system call, the CPU implicitly sets and unsets the supervisor bit when the call enters and returns, so the kernel pages are appropriately lit up and walled off by flipping a single bit. Since the virtual memory map doesn't change across the system call, the processor can retain all the map fragments that it has cached in its TLB.

With a microkernel, almost everything that used to be a system call now falls under the heading "passing a message to another process". In this case, flipping a supervisor bit is no longer enough to implement the memory protection, as a single user process's system calls involve separate memory maps for 1 user process + 1 microkernel + n system processes, but a single bit has enough states for only two maps. Instead of using the supervisor bit trick, the microkernel must switch the virtual memory map at least twice for every system-call-equivalent; once from the user process to the system process, and once again from the system process back to the user process. This requires more overhead than flipping a supervisor bit; there's more overhead to juggle the maps, and there are also two TLB flushes.

A practical example might involve even more overhead since two processes is only the minimum involved in a single system-call-equivalent. For example, reading from a file on QNX involves a user process, a filesystem process and a disk driver process.

What is the TLB flush overhead? The TLB stores small pieces of the virtual-to-physical map so that most memory access ends up consulting the TLB instead of the definitive map stored in physical memory. Since the TLB is inside the CPU, the CPU's designers arrange that TLB consultations shall be free.

All the information in the TLB is a derivative of the real virtual-to-physical map stored in physical memory. The TLB can represent one virtual-to-physical mapping, but the whole point of memory protection is to give each process a different virtual-to-physical mapping, thus reserving certain blocks of physical memory for each process. The virtual-to-physical map stored in physical memory can represent this multiplicity of maps, but the map-fragment represented in the high-speed hardware TLB can represent only one mapping. That's why switching processes involves TLB flushing.

Once the TLB is flushed, it becomes gradually reloaded from the definitive map in physical memory as the new process executes. The TLB's gradual reloading, amortized over the execution of each newly-awakened process, is overhead. It therefore makes sense to switch between processes as seldom as possible and make maximal use of the supervisor bit trick.

Microkernels also harm performance by complicating the current trend toward zero copy design. The zero copy aesthetic suggests that systems should copy around blocks of memory as little as possible. Suppose an application wants to read a file into memory. An aesthetically perfect zero copy system might have the application mmap(..) the file rather than using read(..). The disk controller's DMA engine would write the file's contents directly into the same physical memory that is mapped into the application's virtual address space. Obviously it takes some cleverness to arrange this, but memory protection is one of the main obstacles. The kernel is littered conspicuously with comments about how something has to be copied out to userspace. Microkernels make eliminating block copies more difficult because there are more memory protection barriers to copy across and because data has to be copied in and out of the formatted messages that microkernel systems pass around.

Existing zero copy projects in monolithic kernels pay off. NetBSD's UVM is Chuck Cranor's rewrite of virtual memory under the zero copy aesthetic. UVM invents page loanout and page transfer functions that NetBSD's earlier VM lacked. These functions embody the zero copy aesthetic because they sometimes eliminate the kernel's need to copy out to userspace, but only when the block that would have been copied is big enough to span an entire VM page. Some of his speed improvement no doubt comes from cleaner code, but the most compelling part of his PhD thesis discusses saving processor cycles by doing fewer bulk copies.

VxWorks is among the kernels that boasted zero copy design earliest, with its TCP stack. They're probably motivated by reduced memory footprint, but their zero copy stack should also be faster than a traditional TCP stack. Applications must use the zbuf API to experience the benefit, not the usual Berkeley sockets API. For comparison, VxWorks has no memory protection, not even between the kernel and the user's application.

BeOS implements its TCP stack in a user process, microkernel-style, and so does QNX. Both have notoriously slow TCP stacks.

Zero copy is an aesthetic, not a check-box press release feature, so it's not as simple as something a system can possess or lack. I suspect the difference between VxWorks's and QNX's TCP stack is one of zero copy vs. excessive copies.

The birth and death of microkernels didn't happen overnight, and it's important to understand that these performance obstacles were probably obvious even when microkernels were first proposed. Discrediting microkernels required actually implementing them, optimizing message-passing primitives, and so on.

It's also important not to laugh too hard at QNX. It's somewhat amazing that one can write QNX drivers at all, much less do it with unusual ease, given that their entire environment is rigidly closed-source.

However, I think we've come to a point where the record speaks for itself, and the microkernel project has failed. Yet this still doesn't cleanly vindicate Linux merely because it has a monolithic kernel. Sure, Linux need no longer envy Darwin's microkernel, but the microkernel experiment serves more generally to illustrate the cost of memory protection and of certain kinds of IPC.

If excessive switching between memory-protected user and system processes is wasteful, then might not also excessive switching between two user processes be wasteful? In fact, this issue explains why proprietary UNIX systems use two-level thread architectures that schedule many user threads inside each kernel thread. Linux stubbornly retains one-level kernel-scheduled threads, like Windows NT. Linux could perform better by adopting proprietary UNIX's scheduler activations or Masuda and Inohara's unstable threads. This performance issue is intertwined with the dispute between the IBM JDK's native threads and the Blackdown JDK's optional green threads.

Given how the microkernel experiment has worked out, I'm surprised by Apple's quaint choice to use a microkernel in a new design. At the very least, it creates an opportunity for Linux to establish and maintain performance leadership on the macppc platform. However, I think the most interesting implications of the failed microkernel experiment are the observations it made about how data flows through a complete system, rather than just answering the obvious question about how big the kernel should be.

Miles Nordin is a grizzled FidoNet veteran and an activist with Boulder 2600 (the 720) currently residing in exile near the infamous Waynesboro Country Club in sprawling Eastern Pennsylvania.

Load Disqus comments