Virtualization in Xen 3.0

Dive into the new Xen release and find out what it offers for paravirtualization, split drivers and Intel's new virtualization technology.
Virtual Split Drivers

The idea behind split devices is safe hardware isolation. Domain 0 is the only one that has direct access to the hardware devices, and it uses the original Linux drivers. But domain 0 has another layer, the backend, that contains netback and blockback virtual drivers. (On a side note, support for usbback will be added in the future, and work on the USB layer is being done by Harry Butterworth.)

Similarly, the unprivileged domains have access to a frontend layer, which consists of netfront and blockfront virtual drivers. The unprivileged domains issue I/O requests to the frontend in the same way that I/O requests are sent to an ordinary Linux kernel. However, because the frontend is only a virtual interface with no access to real hardware, these requests are delegated to the backend. From there they are sent to the real devices.

When an unprivileged domain is created, it creates an interdomain event channel between itself and domain 0. This is done with the HYPERVISOR_event_channel_op hypercall, where the command is EVTCHNOP_bind_interdomain. In the case of the network virtual drivers, the event channel is created by netif_map() in sparse/drivers/xen/netback/interface.c. The event channel is a lightweight channel for passing notifications, such as saying when an I/O operation has completed.

A shared memory area exists between each guest domain and domain 0. This shared memory is used to pass requests and data. The shared memory is created and handled using the grant tables API.

When an interrupt is asserted by the controller, the APIC, we arrive at the do_IRQ() method, which also can be found in the Linux kernel (arch/x86/irq.c). The hypervisor handles only timer and serial interrupts. Other interrupts are passed to the domains by calling __do_IRQ_guest(). In fact, the IRQ_GUEST flag is set for all interrupts except for timer and serial interrupts.

__do_IRQ_guest() sends the interrupt by calling send_guest_pirq() for all guests registered on this IRQ. The send_guest_pirq() creates an event channel--an instance of evtchn--and sets the pending flag of this event channel by calling evtchn_set_pending(). Then, asynchronously, Xen notifies this domain of the interrupt, and it is handled appropriately.

Xen and the New Intel VT-x Processors

Intel currently is developing the VT-x and VT-i technologies for x86 and Itanium processors, respectively, which will provide virtualization extensions. Support for the VT-x/VT-i extensions is part of the Xen 3.0 official code; it can be found in xen/arch/x86/vmx*.c., xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

The most important structure in Xen's implementation of VT-x/VT-i is the VMCS (vmcs_struct in the code), which represents the VMCS region. The VMCS region contains six logical regions; most relevant to our discussion are the Guest-state area and Host-state area. The other four regions are VM-execution control fields, VM-exit control fields, VM-entry control fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x/VT-i to support Intel Virtualization Technology. Let's take a look at the new opcodes and their wrappers in the code:

  1. VMCALL: (VMCALL_OPCODE in vmx.h) This simply calls the VM monitor, causing the VM to exit.

  2. VMCLEAR: (VMCLEAR_OPCODE in vmx.h) copies VMCS data to memory in case it is written there. wrapper: _vmpclear (u64 addr) in vmx.h.

  3. VMLAUNCH: (VMLAUNCH_OPCODE in vmx.h) launches a virtual machine, and changes the launch state of the VMCS to be launched, if it is clear.

  4. VMPTRLD: (VMPTRLD_OPCODE in vmx.h) loads a pointer to the VMCS. wrapper: _vmptrld (u64 addr) in vmx.h

  5. VMPTRST: (VMPTRST_OPCODE in vmx.h) stores a pointer to the VMCS. wrapper: _vmptrst (u64 addr) in vmx.h.

  6. VMREAD: (VMREAD_OPCODE in vmx.h) read specified field from VMCS. wrapper: _vmread(x, ptr) in vmx.h

  7. VMRESUME: (VMRESUME_OPCODE in vmx.h) resumes a virtual machine. In order it to resume the VM, the launch state of the VMCS should be "clear".

  8. VMWRITE: (VMWRITE_OPCODE in vmx.h) write specified field in VMCS. wrapper _vmwrite (field, value).

  9. VMXOFF: (VMXOFF_OPCODE in vmx.h) terminates VMX operation. wrapper: _vmxoff (void) in vmx.h.

  10. VMXON: (VMXON_OPCODE in vmx.h) starts VMX operation. wrapper: _vmxon (u64 addr) in vmx.h.

When using this technology, Xen runs in VMX root operation mode. The guest domains, which are unmodified OSes, run in VMX non-root operation mode. Because the guest domains run in non-root operation mode, they are more restricted, meaning that certain actions cause a VM exit to occur.

Xen enters the VMX operation in start_vmx() method, xen/arch/x86/vmx.c. This method is called from init_intel() method in xen/arch/x86/cpu/intel.c.; CONFIG_VMX should be defined.

First, we check the X86_FEATURE_VMXE bit in the ecx register to see if the cpuid shows support for VMX in the processor. For IA-32, Intel added a part to the CR4 control register that specifies whether we want to enable VMX. Therefore, we must set this bit to enable VMX on the processor by calling set_in_cr4(X86_CR4_VMXE). It is bit 13 in CR4 (VMXE).

We then call _vmxon to start the VMX operation. If we try to start the VMX operation with _vmxon when the VMXE bit in CR4 is not set, we get an #UD exception, telling us we have an undefined opcode.

Some instructions can cause VM to exit unconditionally, and some can cause VM to exit certain VM-execution control fields. (See the discussion about the VMX region above.) The following instructions cause VM to exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR and all the new VT-x instructions listed above. Other instructions, such as HLT, INVPLG (invalidate TLB entry instruction), MWAIT and others, cause a VM exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, two bitmaps are used for determining whether to perform a VM exit. The first is the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum in xen/include/asm-x86/vmx_vmcs.h), which is a 32-bit field. When a bit is set in this bitmap, it causes a VM exit if a corresponding exception occurs. By default, the entries set are EXCEPTION_BITMAP_PG, for page fault, and EXCEPTION_BITMAP_GP, for general protection (see MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h).

The second bitmap is the I/O bitmap. In truth, there are two 4KB I/O bitmaps, A and B, which control I/O instructions on various ports. I/O bitmap A contains the ports in the range of 0000-7FFF, and I/O bitmap B contains the ports in the range of 8000-FFFF. (See IO_BITMAP_A and IO_BITMAP_B in vmcs_field enum.)

When a VM exit occurs, we are sent to the vmx_vmexit_handler() in vmx.c. We handle the VM exit according to the exit reason provided, which we can see in the VMCS region. There are 43 basic exit reasons; you can find some of them in vmx.h. The fields start with EXIT_REASON_, such as EXIT_REASON_EXCEPTION_NMI (which is exit reason 0) and so on.

When working with VT-x/VT-i, guest operating systems cannot work in real mode. This is the reason why we load the guests with a special loader, the vmxloader. The vmxloader loads ROMBIOS at 0xF0000, VGABIOS at 0xC0000 and then VMXAssist at D000:0000. VMXAssist is an emulator for real mode that uses the virtual-8086 mode of IA32. After setting virtual-8086 mode, the vmxloader executes in a 16-bit environment.

Certain instructions are not recognized in virtual-8086 mode, however, such as LIRT (load interrupt register table) and LGDT (load global descriptor table). When trying to run these instructions in protected mode, they produce #GP(0) errors. VMXAssist checks the opcode of the instructions being executed and handles them so that they do not cause GPFs.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Novell virtualization information page

Nick Page's picture

Understand me now!

Novell offers various networking and virtualization solutions including 'SUSE linux enterprise' which has the added benefit of being able to support numerous operating systems such as Linux, Netware and Windows in unison (by sharing the same physical servers) due to Novell's collabiration with Microsoft. Users are therefore provided with the best virtualization platform for Windows server consolidation. Novells virtualization software also includes an integrated suite of tools for virtualization management and automation.

Here is a link to the Novell virtualization information page (http://www.novell.com/linux/virtualization/) using the link text virtualization or novell virtualization. I strongly believe that you readers will benefit from the networking and virtualization information and support offered by our website.

I think this Nick Page guy

David McGloin's picture

I think this Nick Page guy is right, I was just thinking the same myself. I checked out Novell's site and its filled with quality info. I love open source!

As per the comment on

Anonymous's picture

As per the comment on FreeBSD Jail, Solaris Zones have a very low overhead usually <1%.

typo

Anonymous's picture

There is a typo in the first paragraph under "paravirtualization":

"The applications run in ring 4 without any modification."

I believe that should be "ring 3."

FreeBSD Jails have _no_

Anonymous's picture

FreeBSD Jails have no performance impact! It's simply another technique with other uses.

Have you ever tried OpenVZ

Anonymous's picture

Have you ever tried OpenVZ project?
It is much easier to use and allows to run more Virtual Servers than Xen.

Easier, maybe, but if performance matters

JohanBV's picture

Perhaps it's easier for home usage or simple installs for your own infrastructure. If you simply need a hosted and installed OS on a good connection, you should look for a VPS. My finding was that OpenVZ servers I've rented were much slower that those from Xen providers. I recommend BudgetDedicated.com's Xen offerings

--
Johan

I was hoping to see more on alternative operating systems

Ken Yee's picture

Since the VT and Pacifica support was supposed to be the enabler for being able to load WinXP, etc. and run it inside Xen.

The Hypervisor really needs to be integrated into the Linux kernel code...it's too much of a pain to keep patching kernels as they're released...

I agree Xen can be hard to

mangoo's picture

I agree Xen can be hard to set up manually.

On the other hand, kernel and other needed binaries are often shipped with most major distros.

Thanks for useful article

Dobrica Pavlinusic's picture

I was wondering about Xen support on AMD, and this article was very useful. Keep up the good work.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix