Kernel Korner - Kernel Mode Linux for AMD64
In KML for IA-32, system call invocations are translated automatically into fast, direct function calls without modifying user programs. This is possible because the recent GNU C Library for IA-32 has a mechanism to choose one of several methods that the kernel provides for system call invocation, and the KML provides direct function calls as one way of invoking system calls.
However, the GNU C Library for AMD64 doesn't have such a mechanism for choosing among methods of system call invocations. Therefore, I created a patch for the GNU C Library. With the patch, kernel-mode user processes can invoke system calls rapidly, because the invocations automatically are translated to function calls. The patch is available from the KML site (see Resources).
One of advantages of KML is the kernel-mode user processes are almost the same as usual user processes except for their privilege level. That is, kernel-mode user processes can do almost anything that ordinary user processes can do. For example, kernel-mode user processes can invoke all system calls. This means they can use filesystems. They also can call open, read, write and other functions, including network systems, with socket, connect and bind. They even can create processes and threads with fork, clone and execve. In addition, they have their own memory address space that they can access freely. Even if a kernel-mode user process uses tons of memory, the kernel pages out the memory.
Moreover, the scheduling mechanism and the signal mechanism of the original Linux kernel work for the kernel-mode user processes. You can check this by executing the following commands:
% cp /usr/bin/yes /trusted/bin % /trusted/bin/yes
You should notice that your system does not hang. This is true, because the kernel's scheduler preempts the kernel-mode yes and gives CPU time to other processes. You can stop the kernel-mode yes by sending Ctrl-C. This means the kernel can interrupt the kernel-mode yes and send a signal to kill it.
As described in the previous section, kernel-mode user processes are ordinary user processes and can perform almost every task that user processes can perform. However, there are a few exceptions:
Kernel-mode user processes cannot modify their GS segment register, because KML uses the GS segment register internally to eliminate the overhead of SWAPGS instruction.
32-bit binaries cannot be executed in kernel mode on AMD64. KML for AMD64, like other typical OS kernels for AMD64, runs in 64-bit mode and there is no efficient way to let 32-bit programs directly call 64-bit functions.
Please notice that, as in the case of KML for IA-32, these limitations are present only in kernel-mode user processes. Ordinary user processes can alter their GS selector, and IA-32 binaries can be executed if an IA-32 emulation environment is set up.
The way to execute user processes in kernel mode in AMD64 is almost the same as it is in IA-32. To execute user processes in kernel mode, the only thing KML does is launch user processes with the CS segment register, which points to the kernel code segment instead of user code segment.
In AMD64 CPUs, the privilege level of running programs is determined by the privilege level of their code segment. This is almost the same as in IA-32 CPUs; the only difference is the segmentation memory system is degenerated in AMD64. Although segment registers still are used in 64-bit mode of AMD64, the only segment that the segment registers can use is the 16 EB flat segment. Thus, the role of the segment descriptors is simply to specify privilege levels. Therefore, only four segments—kernel code segment, kernel data segment, user code segment—exist in 64-bit mode.
Although it is fairly easy to execute user processes in kernel mode, as shown in the previous section, there is a big problem—the stack starvation problem. The problem itself is almost the same as that of KML for IA-32, so I describe it briefly here. Further details are available in my previous article.
The original Linux kernel for AMD64 handles interrupts and exceptions by using the legacy interrupt gates mechanism. For each interrupt/exception, the kernel specifies an interrupt handler by using the interrupt gates in advance, typically at boot time. If an interrupt occurs, the AMD64 CPU suspends the running program, saves the execution context of the program and executes the interrupt handler specified in the corresponding interrupt gate.
The important point is the AMD64 CPU may or may not switch stacks before saving the execution context, depending on the privilege level of the suspended program. If the program is running in user mode, the CPU automatically switches from the stack of the running program to the kernel stack, whereas the CPU does not switch stacks if the program is running in kernel mode. The CPU then saves the execution context—RIP, CS, RFLAGS, RSP and SS register—to the stack.
Now, let us assume that a kernel-mode user process accesses its memory stack, which is not mapped by the page tables of the CPU. First, the CPU raises a page fault exception, suspends the process and tries to save the execution context. This cannot be done, however, because the CPU does not switch stacks, and the stack where the CPU is ready to save the context is nonexistent. To signal this serious situation, the CPU tries to raise a special exception, a double fault exception. Again, the CPU tries to access the nonexistent stack to save the context. Finally, the CPU gives up and resets itself. This process is known as the stack starvation problem.
To solve the stack starvation problem, KML for IA-32 uses the task management mechanism of IA-32 CPUs. The mechanism can be used to switch CPU contexts including all registers and all segment registers, when interrupts or exceptions are raised. KML for IA-32 switches stacks using the mechanism when double faults are raised. However, in 64-bit mode on AMD64, the task management mechanism cannot be used because it simply does not exist.
Instead, KML for AMD64 uses the Interrupt Stack Table (IST) mechanism, which is a newly introduced mechanism of the AMD64 architecture. In AMD64, the task state segment (TSS) has fields for seven pointers to interrupt stacks. In addition, each interrupt gate descriptor has a field for specifying whether the CPU should use the IST mechanism instead of the legacy stack switching, and if so, which interrupt stack should be used. If an interrupt occurs that is specified to use the IST mechanism, the CPU unconditionally switches from a user stack to the interrupt stack specified in the interrupt gate descriptor.
In KML for AMD64, all interruptions and exceptions are handled with the IST mechanism. Therefore, even if an interrupt or exception occurs while a kernel-mode user process is running with its %rsp pointing to an invalid memory, the kernel can keep running without any problem, because the CPU switches stacks automatically.
There are two reasons why KML for AMD64 handles not only double faults but also other interrupts and exceptions with the IST mechanism. One reason is that the overhead incurred by the IST mechanism is negligibly small. Therefore, I think it is better to keep it simple. Handling only double faults with the IST mechanism requires complex modifications to the original kernel, as in KML for IA-32. Second, the red zone of the stack is required by System V Application Binary Interface for AMD64 architecture. The red zone is a 128-byte memory range located just below the stack, that is, from %rsp - 8 to %rsp - 128. System V ABI for AMD64 specifies that user programs can use the red zone for temporary data storage and signal handlers, and interrupt handlers should never touch the zone. If KML handles an interrupt with the usual interrupt handling mechanism, this red zone is corrupted, because a stack is not switched. In this case, some CPU contexts are overwritten to the red zone if a kernel-mode user process is running. Therefore, KML for AMD64 handles all interrupts/exceptions with the IST mechanism in order to provide System V ABI to user programs correctly.
There also is a limitation in KML for IA-32: kernel-mode user processes cannot change their CS segment registers. This is not possible because KML for IA-32 requires at least one scratch register to switch from a user stack to a kernel stack manually when exceptions or interrupts are raised. It prepares the register by using the memory where the CS register is saved. This limitation is not applicable to KML for AMD64, because stacks are switched by the IST mechanism. It is not so important, however, to change the CS segment register in 64-bit mode of AMD64 because there can be only two code segments.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Home, My Backup Data Center
- A Topic for Discussion - Open Source Feature-Richness?
- What's the tweeting protocol?
- Dart: a New Web Programming Experience
- Developer Poll
- Trying to Tame the Tablet
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




2 hours 29 min ago
3 hours 46 min ago
4 hours 21 min ago
4 hours 44 min ago
9 hours 32 min ago
10 hours 19 min ago
11 hours 53 min ago
13 hours 29 min ago
15 hours 27 min ago
15 hours 44 min ago