Virtualization in Xen 3.0
Editor's Note: This article has been updated since its original posting.
Virtualization has existed for over 40 years. Back in the 1960s, IBM developed virtualization support on a mainframe. Since then, many virtualization projects have become available for UNIX/Linux and other operating systems, including VMware, FreeBSD Jail, coLinux, Microsoft's Virtual PC and Solaris's Containers and Zones.
The problem with these virtualization solutions is low performance. The Xen Project, however, offers impressive performance results--close to native--and this is one of its key advantages. Another impressive feature is live migration, which I discussed in a previous article. After much anticipation, Version 3.0 of Xen recently was released, and it is the focus of this article.
The main goal of Xen is achieving better utilization of computer resources and server consolidation by way of paravairtualization and virtual devices. Here, we discuss how Xen 3.0 implements these ideas. We also investigate the new VT-x/VT-i processors from Intel, which have built-in support for virtualization, and their integration into Xen.
The idea behind Xen is to run guest operating systems not in ring 0, but in a higher and less privileged ring. Running guest OSes in a ring higher than 0 is called "ring deprivileging". The default Xen installation on x86 runs guest OSes in ring 1, termed Current Privilege Level 1 (or CPL 1) of the processor. It runs a virtual machine monitor (VMM), the "hypervisor", in CPL 0. The applications run in ring 4 without any modification.
About 250 instructions are contained in the IA-32 instruction set, of which 17 are problematic in terms of running them in ring 1. These instructions can be problematic in two senses. First, running the instruction in ring 1 can cause a general protection exception (GPE), which also may be called a general protection fault (GPF). For example, running HLT immediately causes a GPF. Some instructions, such as CLI and STI, may can cause a GPF if a certain condition is met. That is, a GPF occurs if the CPL is greater than the IOPL of the current program or procedure and, as a result, has less privilege.
The second problem occurs with instructions that do not cause a GPF but still fail. Many Xen articles use the term "fail silently" to describe thess cases. For example, the POPF at the restored EFLAGS has a different interrupt flag (IF) value than the current EFLAGS.
How does Xen handles these problematic instructions? In some cases, such as the HLT instruction, the instruction in ring 1--where the guest OSes run--is replaced by a hypercall. For example, consider sparse/arch/xen/i386/kernel/process.c in the cpu_idle() method. Instead of calling the HLT instruction, as is done eventually in the Linux kernel, we call the xen_idle() method. It performs a hypercall instead, namely, the HYPERVISOR_sched_op(SCHEDOP_block, 0) hypercall.
A hypercall is Xen's analog to a Linux system call. A system call is an interrupt (0x80) called in order to move from user space (CPL3) to kernel space (CPL0). A hypercall also is an interrupt (0x82). It passes control from ring 1, where the guest domains run, to ring 0, where Xen runs. The implementation of a system call and a hypercall is quite similar. Both pass the number of the syscall/hypercall in the eax register. Passing other parameters is done in the same way. In addition, both the system call table and the hypercall table are defined in the same file, entry.S.
You can batch some hypercalls into one multicall by building an array of hypercalls. You can do this by using a multicall_entry_t struct. You then can use one hypercall, HYPERVISOR_multicall. This way, the number of entries to and exits from the hypervisor is reduced. Of course, reducing such interprivilege transitions when possible results in better performance. The netback virtual drivers, for example, uses this multicall mechanism.
Here's another example: the CLTS instruction clears the task switch (TS) flag in CR0. This instruction causes a GPF, however, when issued in ring 1, as is the case with HLT. But the CLTS instruction itself is not replaced by some hypercall. Instead, it is delegated to ring 0 in the following way. When it is issued in ring 1, we get a GPF. But this GPF is handled by do_general_protection(), located in xen/arch/x86/traps.c. Note, though, that do_general_protection() is the hypervisor handler, which runs in ring 0. From there, do_general_protection() calls do_fpu_taskswitch(). Under certain circumstances, this handler scans the opcode of the instructions received in the CPU. In the case of CLTS, where the opcode is 0x06, it calls do_fpu_taskswitch(0). Eventually, do_fpu_taskswitch(0) calls the CLTS instruction, but this time it is called from ring 0. Note: be sure _VCPUF_fpu_dirtied is set to enable this.
Those who are curious about further details can look at the emulate_privileged_op() method in that same file, xen/arch/x86/traps.c. The instructions that may "fail silently" usually are replaced by others.