Kernel Korner - Kernel Mode Linux for AMD64

When user code runs inside the kernel, system calls become function calls, 50 times faster. How does that affect the performance of a real application, MySQL?
How to Speed Up System Call Invocations

In KML for IA-32, system call invocations are translated automatically into fast, direct function calls without modifying user programs. This is possible because the recent GNU C Library for IA-32 has a mechanism to choose one of several methods that the kernel provides for system call invocation, and the KML provides direct function calls as one way of invoking system calls.

However, the GNU C Library for AMD64 doesn't have such a mechanism for choosing among methods of system call invocations. Therefore, I created a patch for the GNU C Library. With the patch, kernel-mode user processes can invoke system calls rapidly, because the invocations automatically are translated to function calls. The patch is available from the KML site (see Resources).

What Kernel-Mode User Processes Can Do

One of advantages of KML is the kernel-mode user processes are almost the same as usual user processes except for their privilege level. That is, kernel-mode user processes can do almost anything that ordinary user processes can do. For example, kernel-mode user processes can invoke all system calls. This means they can use filesystems. They also can call open, read, write and other functions, including network systems, with socket, connect and bind. They even can create processes and threads with fork, clone and execve. In addition, they have their own memory address space that they can access freely. Even if a kernel-mode user process uses tons of memory, the kernel pages out the memory.

Moreover, the scheduling mechanism and the signal mechanism of the original Linux kernel work for the kernel-mode user processes. You can check this by executing the following commands:

% cp /usr/bin/yes /trusted/bin
% /trusted/bin/yes

You should notice that your system does not hang. This is true, because the kernel's scheduler preempts the kernel-mode yes and gives CPU time to other processes. You can stop the kernel-mode yes by sending Ctrl-C. This means the kernel can interrupt the kernel-mode yes and send a signal to kill it.

What Kernel-Mode User Processes Cannot Do

As described in the previous section, kernel-mode user processes are ordinary user processes and can perform almost every task that user processes can perform. However, there are a few exceptions:

  1. Kernel-mode user processes cannot modify their GS segment register, because KML uses the GS segment register internally to eliminate the overhead of SWAPGS instruction.

  2. 32-bit binaries cannot be executed in kernel mode on AMD64. KML for AMD64, like other typical OS kernels for AMD64, runs in 64-bit mode and there is no efficient way to let 32-bit programs directly call 64-bit functions.

Please notice that, as in the case of KML for IA-32, these limitations are present only in kernel-mode user processes. Ordinary user processes can alter their GS selector, and IA-32 binaries can be executed if an IA-32 emulation environment is set up.

How KML Executes User Processes in Kernel Mode

The way to execute user processes in kernel mode in AMD64 is almost the same as it is in IA-32. To execute user processes in kernel mode, the only thing KML does is launch user processes with the CS segment register, which points to the kernel code segment instead of user code segment.

In AMD64 CPUs, the privilege level of running programs is determined by the privilege level of their code segment. This is almost the same as in IA-32 CPUs; the only difference is the segmentation memory system is degenerated in AMD64. Although segment registers still are used in 64-bit mode of AMD64, the only segment that the segment registers can use is the 16 EB flat segment. Thus, the role of the segment descriptors is simply to specify privilege levels. Therefore, only four segments—kernel code segment, kernel data segment, user code segment—exist in 64-bit mode.

The Stack Starvation Problem and Its Solution

Although it is fairly easy to execute user processes in kernel mode, as shown in the previous section, there is a big problem—the stack starvation problem. The problem itself is almost the same as that of KML for IA-32, so I describe it briefly here. Further details are available in my previous article.

The original Linux kernel for AMD64 handles interrupts and exceptions by using the legacy interrupt gates mechanism. For each interrupt/exception, the kernel specifies an interrupt handler by using the interrupt gates in advance, typically at boot time. If an interrupt occurs, the AMD64 CPU suspends the running program, saves the execution context of the program and executes the interrupt handler specified in the corresponding interrupt gate.

The important point is the AMD64 CPU may or may not switch stacks before saving the execution context, depending on the privilege level of the suspended program. If the program is running in user mode, the CPU automatically switches from the stack of the running program to the kernel stack, whereas the CPU does not switch stacks if the program is running in kernel mode. The CPU then saves the execution context—RIP, CS, RFLAGS, RSP and SS register—to the stack.

Now, let us assume that a kernel-mode user process accesses its memory stack, which is not mapped by the page tables of the CPU. First, the CPU raises a page fault exception, suspends the process and tries to save the execution context. This cannot be done, however, because the CPU does not switch stacks, and the stack where the CPU is ready to save the context is nonexistent. To signal this serious situation, the CPU tries to raise a special exception, a double fault exception. Again, the CPU tries to access the nonexistent stack to save the context. Finally, the CPU gives up and resets itself. This process is known as the stack starvation problem.

To solve the stack starvation problem, KML for IA-32 uses the task management mechanism of IA-32 CPUs. The mechanism can be used to switch CPU contexts including all registers and all segment registers, when interrupts or exceptions are raised. KML for IA-32 switches stacks using the mechanism when double faults are raised. However, in 64-bit mode on AMD64, the task management mechanism cannot be used because it simply does not exist.

Instead, KML for AMD64 uses the Interrupt Stack Table (IST) mechanism, which is a newly introduced mechanism of the AMD64 architecture. In AMD64, the task state segment (TSS) has fields for seven pointers to interrupt stacks. In addition, each interrupt gate descriptor has a field for specifying whether the CPU should use the IST mechanism instead of the legacy stack switching, and if so, which interrupt stack should be used. If an interrupt occurs that is specified to use the IST mechanism, the CPU unconditionally switches from a user stack to the interrupt stack specified in the interrupt gate descriptor.

In KML for AMD64, all interruptions and exceptions are handled with the IST mechanism. Therefore, even if an interrupt or exception occurs while a kernel-mode user process is running with its %rsp pointing to an invalid memory, the kernel can keep running without any problem, because the CPU switches stacks automatically.

There are two reasons why KML for AMD64 handles not only double faults but also other interrupts and exceptions with the IST mechanism. One reason is that the overhead incurred by the IST mechanism is negligibly small. Therefore, I think it is better to keep it simple. Handling only double faults with the IST mechanism requires complex modifications to the original kernel, as in KML for IA-32. Second, the red zone of the stack is required by System V Application Binary Interface for AMD64 architecture. The red zone is a 128-byte memory range located just below the stack, that is, from %rsp - 8 to %rsp - 128. System V ABI for AMD64 specifies that user programs can use the red zone for temporary data storage and signal handlers, and interrupt handlers should never touch the zone. If KML handles an interrupt with the usual interrupt handling mechanism, this red zone is corrupted, because a stack is not switched. In this case, some CPU contexts are overwritten to the red zone if a kernel-mode user process is running. Therefore, KML for AMD64 handles all interrupts/exceptions with the IST mechanism in order to provide System V ABI to user programs correctly.

There also is a limitation in KML for IA-32: kernel-mode user processes cannot change their CS segment registers. This is not possible because KML for IA-32 requires at least one scratch register to switch from a user stack to a kernel stack manually when exceptions or interrupts are raised. It prepares the register by using the memory where the CS register is saved. This limitation is not applicable to KML for AMD64, because stacks are switched by the IST mechanism. It is not so important, however, to change the CS segment register in 64-bit mode of AMD64 because there can be only two code segments.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix