Kernel Korner - Kernel Mode Linux for AMD64
In KML for IA-32, system call invocations are translated automatically into fast, direct function calls without modifying user programs. This is possible because the recent GNU C Library for IA-32 has a mechanism to choose one of several methods that the kernel provides for system call invocation, and the KML provides direct function calls as one way of invoking system calls.
However, the GNU C Library for AMD64 doesn't have such a mechanism for choosing among methods of system call invocations. Therefore, I created a patch for the GNU C Library. With the patch, kernel-mode user processes can invoke system calls rapidly, because the invocations automatically are translated to function calls. The patch is available from the KML site (see Resources).
One of advantages of KML is the kernel-mode user processes are almost the same as usual user processes except for their privilege level. That is, kernel-mode user processes can do almost anything that ordinary user processes can do. For example, kernel-mode user processes can invoke all system calls. This means they can use filesystems. They also can call open, read, write and other functions, including network systems, with socket, connect and bind. They even can create processes and threads with fork, clone and execve. In addition, they have their own memory address space that they can access freely. Even if a kernel-mode user process uses tons of memory, the kernel pages out the memory.
Moreover, the scheduling mechanism and the signal mechanism of the original Linux kernel work for the kernel-mode user processes. You can check this by executing the following commands:
% cp /usr/bin/yes /trusted/bin % /trusted/bin/yes
You should notice that your system does not hang. This is true, because the kernel's scheduler preempts the kernel-mode yes and gives CPU time to other processes. You can stop the kernel-mode yes by sending Ctrl-C. This means the kernel can interrupt the kernel-mode yes and send a signal to kill it.
As described in the previous section, kernel-mode user processes are ordinary user processes and can perform almost every task that user processes can perform. However, there are a few exceptions:
Kernel-mode user processes cannot modify their GS segment register, because KML uses the GS segment register internally to eliminate the overhead of SWAPGS instruction.
32-bit binaries cannot be executed in kernel mode on AMD64. KML for AMD64, like other typical OS kernels for AMD64, runs in 64-bit mode and there is no efficient way to let 32-bit programs directly call 64-bit functions.
Please notice that, as in the case of KML for IA-32, these limitations are present only in kernel-mode user processes. Ordinary user processes can alter their GS selector, and IA-32 binaries can be executed if an IA-32 emulation environment is set up.
The way to execute user processes in kernel mode in AMD64 is almost the same as it is in IA-32. To execute user processes in kernel mode, the only thing KML does is launch user processes with the CS segment register, which points to the kernel code segment instead of user code segment.
In AMD64 CPUs, the privilege level of running programs is determined by the privilege level of their code segment. This is almost the same as in IA-32 CPUs; the only difference is the segmentation memory system is degenerated in AMD64. Although segment registers still are used in 64-bit mode of AMD64, the only segment that the segment registers can use is the 16 EB flat segment. Thus, the role of the segment descriptors is simply to specify privilege levels. Therefore, only four segments—kernel code segment, kernel data segment, user code segment—exist in 64-bit mode.
Although it is fairly easy to execute user processes in kernel mode, as shown in the previous section, there is a big problem—the stack starvation problem. The problem itself is almost the same as that of KML for IA-32, so I describe it briefly here. Further details are available in my previous article.
The original Linux kernel for AMD64 handles interrupts and exceptions by using the legacy interrupt gates mechanism. For each interrupt/exception, the kernel specifies an interrupt handler by using the interrupt gates in advance, typically at boot time. If an interrupt occurs, the AMD64 CPU suspends the running program, saves the execution context of the program and executes the interrupt handler specified in the corresponding interrupt gate.
The important point is the AMD64 CPU may or may not switch stacks before saving the execution context, depending on the privilege level of the suspended program. If the program is running in user mode, the CPU automatically switches from the stack of the running program to the kernel stack, whereas the CPU does not switch stacks if the program is running in kernel mode. The CPU then saves the execution context—RIP, CS, RFLAGS, RSP and SS register—to the stack.
Now, let us assume that a kernel-mode user process accesses its memory stack, which is not mapped by the page tables of the CPU. First, the CPU raises a page fault exception, suspends the process and tries to save the execution context. This cannot be done, however, because the CPU does not switch stacks, and the stack where the CPU is ready to save the context is nonexistent. To signal this serious situation, the CPU tries to raise a special exception, a double fault exception. Again, the CPU tries to access the nonexistent stack to save the context. Finally, the CPU gives up and resets itself. This process is known as the stack starvation problem.
To solve the stack starvation problem, KML for IA-32 uses the task management mechanism of IA-32 CPUs. The mechanism can be used to switch CPU contexts including all registers and all segment registers, when interrupts or exceptions are raised. KML for IA-32 switches stacks using the mechanism when double faults are raised. However, in 64-bit mode on AMD64, the task management mechanism cannot be used because it simply does not exist.
Instead, KML for AMD64 uses the Interrupt Stack Table (IST) mechanism, which is a newly introduced mechanism of the AMD64 architecture. In AMD64, the task state segment (TSS) has fields for seven pointers to interrupt stacks. In addition, each interrupt gate descriptor has a field for specifying whether the CPU should use the IST mechanism instead of the legacy stack switching, and if so, which interrupt stack should be used. If an interrupt occurs that is specified to use the IST mechanism, the CPU unconditionally switches from a user stack to the interrupt stack specified in the interrupt gate descriptor.
In KML for AMD64, all interruptions and exceptions are handled with the IST mechanism. Therefore, even if an interrupt or exception occurs while a kernel-mode user process is running with its %rsp pointing to an invalid memory, the kernel can keep running without any problem, because the CPU switches stacks automatically.
There are two reasons why KML for AMD64 handles not only double faults but also other interrupts and exceptions with the IST mechanism. One reason is that the overhead incurred by the IST mechanism is negligibly small. Therefore, I think it is better to keep it simple. Handling only double faults with the IST mechanism requires complex modifications to the original kernel, as in KML for IA-32. Second, the red zone of the stack is required by System V Application Binary Interface for AMD64 architecture. The red zone is a 128-byte memory range located just below the stack, that is, from %rsp - 8 to %rsp - 128. System V ABI for AMD64 specifies that user programs can use the red zone for temporary data storage and signal handlers, and interrupt handlers should never touch the zone. If KML handles an interrupt with the usual interrupt handling mechanism, this red zone is corrupted, because a stack is not switched. In this case, some CPU contexts are overwritten to the red zone if a kernel-mode user process is running. Therefore, KML for AMD64 handles all interrupts/exceptions with the IST mechanism in order to provide System V ABI to user programs correctly.
There also is a limitation in KML for IA-32: kernel-mode user processes cannot change their CS segment registers. This is not possible because KML for IA-32 requires at least one scratch register to switch from a user stack to a kernel stack manually when exceptions or interrupts are raised. It prepares the register by using the memory where the CS register is saved. This limitation is not applicable to KML for AMD64, because stacks are switched by the IST mechanism. It is not so important, however, to change the CS segment register in 64-bit mode of AMD64 because there can be only two code segments.
|Non-Linux FOSS: libnotify, OS X Style||Jun 18, 2013|
|Containers—Not Virtual Machines—Are the Future Cloud||Jun 17, 2013|
|Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer||Jun 12, 2013|
|Weechat, Irssi's Little Brother||Jun 11, 2013|
|One Tail Just Isn't Enough||Jun 07, 2013|
|Introduction to MapReduce with Hadoop on Linux||Jun 05, 2013|
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Validate an E-Mail Address with PHP, the Right Way
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- RSS Feeds
- Introduction to MapReduce with Hadoop on Linux
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?