Virtualization in Xen 3.0
The idea behind split devices is safe hardware isolation. Domain 0 is the only one that has direct access to the hardware devices, and it uses the original Linux drivers. But domain 0 has another layer, the backend, that contains netback and blockback virtual drivers. (On a side note, support for usbback will be added in the future, and work on the USB layer is being done by Harry Butterworth.)
Similarly, the unprivileged domains have access to a frontend layer, which consists of netfront and blockfront virtual drivers. The unprivileged domains issue I/O requests to the frontend in the same way that I/O requests are sent to an ordinary Linux kernel. However, because the frontend is only a virtual interface with no access to real hardware, these requests are delegated to the backend. From there they are sent to the real devices.
When an unprivileged domain is created, it creates an interdomain event channel between itself and domain 0. This is done with the HYPERVISOR_event_channel_op hypercall, where the command is EVTCHNOP_bind_interdomain. In the case of the network virtual drivers, the event channel is created by netif_map() in sparse/drivers/xen/netback/interface.c. The event channel is a lightweight channel for passing notifications, such as saying when an I/O operation has completed.
A shared memory area exists between each guest domain and domain 0. This shared memory is used to pass requests and data. The shared memory is created and handled using the grant tables API.
When an interrupt is asserted by the controller, the APIC, we arrive at the do_IRQ() method, which also can be found in the Linux kernel (arch/x86/irq.c). The hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling __do_IRQ_guest(). In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts.
__do_IRQ_guest() sends the interrupt by calling send_guest_pirq() for all guests registered on this IRQ. The send_guest_pirq() creates an event channel--an instance of evtchn--and sets the pending flag of this event channel by calling evtchn_set_pending(). Then, asynchronously, Xen notifies this domain of the interrupt, and it is handled appropriately.
Intel currently is developing the VT-x and VT-i technologies for x86 and Itanium processors, respectively, which will provide virtualization extensions. Support for the VT-x/VT-i extensions is part of the Xen 3.0 official code; it can be found in xen/arch/x86/vmx*.c., xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.
The most important structure in Xen's implementation of VT-x/VT-i is the VMCS (vmcs_struct in the code), which represents the VMCS region. The VMCS region contains six logical regions; most relevant to our discussion are the Guest-state area and Host-state area. The other four regions are VM-execution control fields, VM-exit control fields, VM-entry control fields and VM-exit information fields.
Intel added 10 new opcodes in VT-x/VT-i to support Intel Virtualization Technology. Let's take a look at the new opcodes and their wrappers in the code:
VMCALL: (VMCALL_OPCODE in vmx.h) This simply calls the VM monitor, causing the VM to exit.
VMCLEAR: (VMCLEAR_OPCODE in vmx.h) copies VMCS data to memory in case it is written there. wrapper: _vmpclear (u64 addr) in vmx.h.
VMLAUNCH: (VMLAUNCH_OPCODE in vmx.h) launches a virtual machine, and changes the launch state of the VMCS to be launched, if it is clear.
VMPTRLD: (VMPTRLD_OPCODE in vmx.h) loads a pointer to the VMCS. wrapper: _vmptrld (u64 addr) in vmx.h
VMPTRST: (VMPTRST_OPCODE in vmx.h) stores a pointer to the VMCS. wrapper: _vmptrst (u64 addr) in vmx.h.
VMREAD: (VMREAD_OPCODE in vmx.h) read specified field from VMCS. wrapper: _vmread(x, ptr) in vmx.h
VMRESUME: (VMRESUME_OPCODE in vmx.h) resumes a virtual machine. In order it to resume the VM, the launch state of the VMCS should be "clear".
VMWRITE: (VMWRITE_OPCODE in vmx.h) write specified field in VMCS. wrapper _vmwrite (field, value).
VMXOFF: (VMXOFF_OPCODE in vmx.h) terminates VMX operation. wrapper: _vmxoff (void) in vmx.h.
VMXON: (VMXON_OPCODE in vmx.h) starts VMX operation. wrapper: _vmxon (u64 addr) in vmx.h.
When using this technology, Xen runs in VMX root operation mode. The guest domains, which are unmodified OSes, run in VMX non-root operation mode. Because the guest domains run in non-root operation mode, they are more restricted, meaning that certain actions cause a VM exit to occur.
Xen enters the VMX operation in start_vmx() method, xen/arch/x86/vmx.c. This method is called from init_intel() method in xen/arch/x86/cpu/intel.c.; CONFIG_VMX should be defined.
First, we check the X86_FEATURE_VMXE bit in the ecx register to see if the cpuid shows support for VMX in the processor. For IA-32, Intel added a part to the CR4 control register that specifies whether we want to enable VMX. Therefore, we must set this bit to enable VMX on the processor by calling set_in_cr4(X86_CR4_VMXE). It is bit 13 in CR4 (VMXE).
We then call _vmxon to start the VMX operation. If we try to start the VMX operation with _vmxon when the VMXE bit in CR4 is not set, we get an #UD exception, telling us we have an undefined opcode.
Some instructions can cause VM to exit unconditionally, and some can cause VM to exit certain VM-execution control fields. (See the discussion about the VMX region above.) The following instructions cause VM to exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR and all the new VT-x instructions listed above. Other instructions, such as HLT, INVPLG (invalidate TLB entry instruction), MWAIT and others, cause a VM exit if a corresponding VM-execution control was set.
Apart from VM-execution control fields, two bitmaps are used for determining whether to perform a VM exit. The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum in xen/include/asm-x86/vmx_vmcs.h), which is a 32-bit field. When a bit is set in this bitmap, it causes a VM exit if a corresponding exception occurs. By default, the entries set are EXCEPTION_BITMAP_PG, for page fault, and EXCEPTION_BITMAP_GP, for general protection (see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h).
The second bitmap is the I/O bitmap. In truth, there are two 4KB I/O bitmaps, A and B, which control I/O instructions on various ports. I/O bitmap A contains the ports in the range of 0000-7FFF, and I/O bitmap B contains the ports in the range of 8000-FFFF. (See IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum.)
When a VM exit occurs, we are sent to the vmx_vmexit_handler() in vmx.c. We handle the VM exit according to the exit reason provided, which we can see in the VMCS region. There are 43 basic exit reasons; you can find some of them in vmx.h. The fields start with EXIT_REASON_, such as EXIT_REASON_EXCEPTION_NMI (which is exit reason 0) and so on.
When working with VT-x/VT-i, guest operating systems cannot work in real mode. This is the reason why we load the guests with a special loader, the vmxloader. The vmxloader loads ROMBIOS at 0xF0000, VGABIOS at 0xC0000 and then VMXAssist at D000:0000. VMXAssist is an emulator for real mode that uses the virtual-8086 mode of IA32. After setting virtual-8086 mode, the vmxloader executes in a 16-bit environment.
Certain instructions are not recognized in virtual-8086 mode, however, such as LIRT (load interrupt register table) and LGDT (load global descriptor table). When trying to run these instructions in protected mode, they produce #GP(0) errors. VMXAssist checks the opcode of the instructions being executed and handles them so that they do not cause GPFs.
- Unikernels, Docker, and Why You Should Care
- Server Hardening
- diff -u: What's New in Kernel Development
- Controversy at the Linux Foundation
- 22 Years of Linux Journal on One DVD - Now Available
- Non-Linux FOSS: Snk
- Giving Silos Their Due
- Don't Burn Your Android Yet
- What's New in 3D Printing, Part III: the Software