Quantcast
Username/Email:  Password: 

Virtualization in Xen 3.0

Dive into the new Xen release and find out what it offers for paravirtualization, split drivers and Intel's new virtualization technology.


Editor's Note: This article has been updated since its original posting.

Virtualization has existed for over 40 years. Back in the 1960s, IBM
developed virtualization support on a mainframe. Since then, many virtualization
projects have become available for UNIX/Linux and other operating systems,
including VMware, FreeBSD Jail, coLinux, Microsoft's Virtual
PC and Solaris's Containers and Zones.

The problem with these virtualization solutions is low performance.
The Xen
Project
, however, offers impressive performance results--close
to native--and this is one of its key advantages. Another
impressive feature is live migration, which I discussed in
a previous
article
. After much anticipation, Version 3.0 of Xen recently
was released, and it is the focus of this article.

The main goal of Xen is achieving better utilization of computer
resources and server consolidation by way of paravairtualization and virtual
devices. Here, we discuss how Xen 3.0 implements these ideas. We also
investigate the new VT-x/VT-i processors from Intel, which have built-in support for
virtualization, and their integration into Xen.
Paravirtualization
The idea behind Xen is to run guest operating systems not in ring 0,
but in a higher and less privileged ring. Running guest OSes in a ring
higher than 0 is called "ring deprivileging". The default Xen
installation on x86 runs guest OSes in ring 1, termed
Current Privilege Level 1 (or CPL 1) of the processor. It runs a virtual
machine monitor (VMM), the "hypervisor", in CPL 0. The applications
run in ring 4 without any modification.

About 250 instructions are contained in the IA-32 instruction set, of which
17 are problematic in terms of running them in ring 1. These
instructions can be problematic in two senses. First, running the
instruction in ring 1 can cause a general protection
exception (GPE), which also may be called a general protection fault
(GPF). For example, running HLT immediately
causes a GPF. Some instructions, such as CLI and
STI, may can cause a GPF if a certain condition
is met. That is, a GPF occurs if the CPL is greater than the IOPL of the
current program or procedure and, as a result, has less privilege.

The second problem occurs with instructions that do not cause a GPF but
still fail. Many Xen articles use the term "fail silently" to describe
thess cases. For example, the POPF at the restored EFLAGS has a
different interrupt flag (IF) value than the current EFLAGS.

How does Xen handles these problematic instructions? In some cases,
such as the HLT instruction, the instruction in ring 1--where the guest
OSes run--is replaced by a hypercall. For example, consider
sparse/arch/xen/i386/kernel/process.c in the cpu_idle() method. Instead
of calling the HLT instruction, as is done eventually in the Linux kernel,
we call the xen_idle() method. It performs a hypercall
instead, namely, the HYPERVISOR_sched_op(SCHEDOP_block, 0) hypercall.

A hypercall is Xen's analog to a Linux system call. A system call
is an interrupt (0x80) called in order to move from user space
(CPL3) to kernel space (CPL0). A hypercall also is an interrupt (0x82).
It passes control from ring 1, where the guest domains run, to
ring 0, where Xen runs. The implementation of a system call and a
hypercall is quite similar. Both pass the number of the syscall/hypercall
in the eax register. Passing other parameters is done in the same way.
In addition, both the system call table and the hypercall table are defined
in the same file, entry.S.

You can batch some hypercalls into one multicall by building an array of
hypercalls. You can do this by using a multicall_entry_t struct. You
then can use one hypercall, HYPERVISOR_multicall. This way, the number
of entries to and exits from the hypervisor is reduced.
Of course, reducing such interprivilege transitions when possible
results in better performance. The netback virtual drivers, for
example, uses this multicall mechanism.

Here's another example: the CLTS instruction clears the task switch (TS) flag in CR0.
This instruction causes a GPF, however, when issued in ring 1, as is the case
with HLT. But the CLTS instruction itself is not replaced by some
hypercall. Instead, it is delegated to ring 0 in the following way.
When it is issued in ring 1, we get a GPF. But this GPF is handled
by do_general_protection(), located in xen/arch/x86/traps.c. Note,
though, that do_general_protection() is the hypervisor handler,
which runs in ring 0. From there, do_general_protection() calls do_fpu_taskswitch().
Under certain circumstances, this handler scans the opcode of the instructions
received in the CPU. In the case of CLTS, where the opcode is
0x06, it calls do_fpu_taskswitch(0). Eventually, do_fpu_taskswitch(0)
calls the CLTS instruction, but this time it is called from ring 0.
Note: be sure _VCPUF_fpu_dirtied is set to enable this.

Those who are curious about further details can look at the
emulate_privileged_op() method in that same file, xen/arch/x86/traps.c.
The instructions that may "fail silently" usually are replaced by others.
Virtual Split Drivers
The idea behind split devices is safe hardware isolation. Domain 0
is the only one that has direct access to the hardware devices, and it
uses the original Linux drivers. But domain 0 has another layer, the backend,
that contains netback and blockback virtual drivers. (On a side note, support for
usbback will be added in the future, and work on the USB layer is being
done by Harry Butterworth.)

Similarly, the unprivileged domains have access to a frontend layer, which
consists of netfront and blockfront virtual drivers. The unprivileged
domains issue I/O requests to the frontend in the same way that I/O requests
are sent to an ordinary Linux kernel. However, because the frontend is only
a virtual interface with no access to real hardware, these requests are
delegated to the backend. From there they are sent to the real devices.

When an unprivileged domain is created, it creates an interdomain
event channel between itself and domain 0. This is done with
the HYPERVISOR_event_channel_op hypercall, where the command is
EVTCHNOP_bind_interdomain. In the case of the network virtual drivers,
the event channel is created by netif_map() in sparse/drivers/xen/netback/interface.c.
The event channel is a lightweight channel for passing notifications,
such as saying when an I/O operation has completed.

A shared memory area exists between each guest domain and domain 0.
This shared memory is used to pass requests and data. The shared memory
is created and handled using the grant tables API.

When an interrupt is asserted by the controller, the APIC, we arrive at
the do_IRQ() method, which also can be found in the Linux kernel (arch/x86/irq.c).
The hypervisor handles only timer and serial interrupts. Other
interrupts are passed to the domains by calling __do_IRQ_guest().
In fact, the IRQ_GUEST flag is set for all interrupts except for
timer and serial interrupts.

__do_IRQ_guest() sends the interrupt by calling send_guest_pirq() for all
guests registered on this IRQ. The send_guest_pirq() creates an
event channel--an instance of evtchn--and sets the pending flag of this
event channel by calling evtchn_set_pending(). Then, asynchronously, Xen
notifies this domain of the interrupt, and it is handled appropriately.
Xen and the New Intel VT-x Processors
Intel currently is developing the
VT-x
and VT-i
technologies for x86 and Itanium processors,
respectively, which will provide virtualization extensions.
Support for the VT-x/VT-i extensions is part of the Xen 3.0 official code; it
can be found in xen/arch/x86/vmx*.c., xen/include/asm-x86/vmx*.h and xen/arch/x86/x86_32/entry.S.

The most important structure in Xen's implementation of VT-x/VT-i is the VMCS
(vmcs_struct in the code), which represents the VMCS region. The VMCS
region contains six logical regions; most relevant to our discussion
are the Guest-state area and Host-state area. The other four regions are
VM-execution control fields, VM-exit control fields, VM-entry control
fields and VM-exit information fields.

Intel added 10 new opcodes in VT-x/VT-i to support Intel Virtualization
Technology. Let's take a look at the new opcodes and their wrappers in
the code:

  1. VMCALL: (VMCALL_OPCODE in vmx.h) This simply calls the VM monitor,
    causing the VM to exit.
  2. VMCLEAR: (VMCLEAR_OPCODE in vmx.h) copies VMCS data to memory in case
    it is written there. wrapper: _vmpclear (u64 addr) in
    vmx.h.
  3. VMLAUNCH: (VMLAUNCH_OPCODE in vmx.h) launches a virtual
    machine, and changes the launch state of the VMCS to be launched, if it is
    clear.
  4. VMPTRLD: (VMPTRLD_OPCODE in vmx.h) loads a pointer to the VMCS.
    wrapper: _vmptrld (u64 addr) in vmx.h
  5. VMPTRST: (VMPTRST_OPCODE in vmx.h) stores a pointer to the
    VMCS. wrapper: _vmptrst (u64 addr) in vmx.h.
  6. VMREAD: (VMREAD_OPCODE in vmx.h) read specified field from VMCS.
    wrapper: _vmread(x, ptr) in vmx.h
  7. VMRESUME: (VMRESUME_OPCODE in vmx.h) resumes a virtual
    machine. In order it to resume the VM, the launch state of the VMCS should be
    "clear".
  8. VMWRITE: (VMWRITE_OPCODE in vmx.h) write specified field in
    VMCS. wrapper _vmwrite (field, value).
  9. VMXOFF: (VMXOFF_OPCODE in vmx.h) terminates VMX operation. wrapper:
    _vmxoff (void) in vmx.h.
  10. VMXON: (VMXON_OPCODE in vmx.h) starts VMX operation. wrapper: _vmxon
    (u64 addr) in vmx.h.

When using this technology, Xen runs in VMX root operation mode. The
guest domains, which are unmodified OSes, run in VMX non-root operation
mode. Because the guest domains run in non-root operation mode, they
are more restricted, meaning that certain actions cause a VM exit to
occur.

Xen enters the VMX operation in start_vmx() method, xen/arch/x86/vmx.c. This
method is called from init_intel() method in xen/arch/x86/cpu/intel.c.;
CONFIG_VMX should be defined.

First, we check the X86_FEATURE_VMXE bit in the ecx register to see if the
cpuid shows support for VMX in the processor. For IA-32,
Intel added a part to the CR4 control register that specifies whether we want
to enable VMX. Therefore, we must set this bit to enable VMX on the processor by
calling set_in_cr4(X86_CR4_VMXE). It is bit 13 in CR4 (VMXE).

We then call _vmxon to start the VMX operation. If we try to start the
VMX operation with _vmxon when the VMXE bit in CR4 is not set, we get an
#UD exception, telling us we have an undefined opcode.

Some instructions can cause VM to exit unconditionally, and some can
cause VM to exit certain VM-execution control fields. (See the discussion
about the VMX region above.) The following instructions cause VM to
exit unconditionally: CPUID, INVD, MOV from CR3, RDMSR, WRMSR and all
the new VT-x instructions listed above. Other instructions, such as HLT,
INVPLG (invalidate TLB entry instruction), MWAIT and others, cause a VM
exit if a corresponding VM-execution control was set.

Apart from VM-execution control fields, two bitmaps are used
for determining whether to perform a VM exit. The first is
the exception bitmap (see EXCEPTION_BITMAP in vmcs_field enum in
xen/include/asm-x86/vmx_vmcs.h), which is a 32-bit field. When a bit
is set in this bitmap, it causes a VM exit if a corresponding exception
occurs. By default, the entries set are EXCEPTION_BITMAP_PG, for page
fault, and EXCEPTION_BITMAP_GP, for general protection (see
MONITOR_DEFAULT_EXCEPTION_BITMAP in vmx.h).

The second bitmap is the I/O bitmap. In truth, there are two 4KB
I/O bitmaps, A and B, which control I/O instructions on various ports. I/O bitmap
A contains the ports in the range of 0000-7FFF, and I/O bitmap B contains
the ports in the range of 8000-FFFF. (See IO_BITMAP_A and IO_BITMAP_B in
vmcs_field enum.)

When a VM exit occurs, we are sent to the vmx_vmexit_handler() in vmx.c.
We handle the VM exit according to the exit reason provided, which we
can see in the VMCS region. There are 43 basic exit reasons; you can
find some of them in vmx.h. The fields start with EXIT_REASON_, such
as EXIT_REASON_EXCEPTION_NMI (which is exit reason 0) and so on.

When working with VT-x/VT-i, guest operating systems cannot work in real mode.
This is the reason why we load the guests with a special loader, the
vmxloader. The vmxloader loads ROMBIOS at 0xF0000, VGABIOS at
0xC0000 and then VMXAssist at D000:0000. VMXAssist is an emulator for real
mode that uses the virtual-8086 mode of IA32. After setting virtual-8086 mode,
the vmxloader executes in a 16-bit environment.

Certain instructions are not recognized in virtual-8086 mode, however,
such as LIRT (load interrupt register table) and LGDT (load
global descriptor table). When trying to run these instructions in
protected mode, they produce #GP(0) errors. VMXAssist checks the opcode of
the instructions being executed and handles them so that they do not
cause GPFs.
HVM, a Common Interface for VT-x/VT-i and AMD SVM
VT-x/VT-i and AMD's SVM architectures have much in common, which was the
motivation for developing their common interface layer, the Hardware
Virtual Machine (HVM). The code for the HVM layer was written by Leendert
van Doorn from the Watson Research Center at IBM, and it resides
in
a
separate branch
in the Xen repository.

An example of a common interface for VT-x/VT-i and AMD SVM is the
domain builder, xc_hvm_build(), located in xc_hvm_build.c. Because the loader
now is common to both architectures, the vmxloader now is called the hvmloader.
The hvmloader identifies the processor simply by calling its CPUID;
see tools/firmware/hvmloader/hvmloader.c.

The AMD SVM has a paged real mode, which virtualizes a real mode inside
of a protected mode. So in the case of AMD SVM, we should set operations
to real mode only, SVM_VMMCALL_RESET_TO_REALMODE. In the case of VT-x/VT-i, we
should use VMXAssist, as explained above.

HVM defines a table called hvm_function_table, which is a structure
containing functions that are common to both VT-x/VT-i and AMD
SVM. These methods, including initialize_guest_resources() and
store_cpu_guest_regs(), are implemented differently in VT-x/VT-i and
AMD SVM.

Xen 3.0 also includes support for the AMD SVM processor. One of SVM's
benefits is a tagged TLB: guests are mapped to address spaces different
from what the VMM sets. The TLB is tagged with address space identifiers
(ASIDs), so a TLB flush does not occur when there is a context switch.
Live Migration
One of the fascinating features of Xen is live migration, which can be
used as a solution for load balancing and maintenance. The downtime when
using live migration is quite low--tens of milliseconds. Live migration
implementation in Xen is managed by domain 0.

There are two stages to live migration. The first stage is "pre-copying",
in which the physical memory is copied to the target by way of TCP while the
migrating domain continues to run. After some iterations, during
which only the pages that were dirtied from the last iteration are
copied, the migrating domain stops running. Then, in the second stage,
the remaining pages are copied, and the domain resumes its work on the
target machine.

In addition, Jacob Gorm Hansen, from the University of Copenhagen,
Denmark, is doing some interesting work on "self migration". In self
migration, the unprivileged domain being migrated handles the migration itself. Although there are
some benefits
to having this ability, such as security, self migration is more complex
than live migration. For instance, the memory pages containing the code
that manages the migration are dirtied during the transfer.
Conclusion
In the future, it appears as though all of Intel's new 64-bit processors
will have virtualization extension support, and Xen seems to adopt
mainly CPUs with virtualization support. Currently, Xen has support for VT-x and
VT-i in the official tree, and a branch in the repository has AMD SVM support.

Overall, Xen is an interesting virtualization project with many features and
benefits. And, there's a chance that Xen will be integrated into the official
Linux kernel tree sometime in the future, as happened with UML and LVS.
Resources
/xen/irc/logs

Rami Rosen is a computer science graduate of Technion, the Israel
Institute of Technology, located in Haifa. He works as a Linux
kernel programmer for a networking start-up, and he can be reached at
ramirose@gmail.com. In his spare time, he likes running,
solving cryptic puzzles and helping everyone he knows to move to this
wonderful operating system, Linux.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Novell virtualization information page

Nick Page's picture

Understand me now!

Novell offers various networking and virtualization solutions including 'SUSE linux enterprise' which has the added benefit of being able to support numerous operating systems such as Linux, Netware and Windows in unison (by sharing the same physical servers) due to Novell's collabiration with Microsoft. Users are therefore provided with the best virtualization platform for Windows server consolidation. Novells virtualization software also includes an integrated suite of tools for virtualization management and automation.

Here is a link to the Novell virtualization information page (http://www.novell.com/linux/virtualization/) using the link text virtualization or novell virtualization. I strongly believe that you readers will benefit from the networking and virtualization information and support offered by our website.

I think this Nick Page guy

David McGloin's picture

I think this Nick Page guy is right, I was just thinking the same myself. I checked out Novell's site and its filled with quality info. I love open source!

As per the comment on

Anonymous's picture

As per the comment on FreeBSD Jail, Solaris Zones have a very low overhead usually <1%.

typo

Anonymous's picture

There is a typo in the first paragraph under "paravirtualization":

"The applications run in ring 4 without any modification."

I believe that should be "ring 3."

FreeBSD Jails have _no_

Anonymous's picture

FreeBSD Jails have no performance impact! It's simply another technique with other uses.

Have you ever tried OpenVZ

Anonymous's picture

Have you ever tried OpenVZ project?
It is much easier to use and allows to run more Virtual Servers than Xen.

Easier, maybe, but if performance matters

JohanBV's picture

Perhaps it's easier for home usage or simple installs for your own infrastructure. If you simply need a hosted and installed OS on a good connection, you should look for a VPS. My finding was that OpenVZ servers I've rented were much slower that those from Xen providers. I recommend BudgetDedicated.com's Xen offerings

--
Johan

I was hoping to see more on alternative operating systems

Ken Yee's picture

Since the VT and Pacifica support was supposed to be the enabler for being able to load WinXP, etc. and run it inside Xen.

The Hypervisor really needs to be integrated into the Linux kernel code...it's too much of a pain to keep patching kernels as they're released...

I agree Xen can be hard to

mangoo's picture

I agree Xen can be hard to set up manually.

On the other hand, kernel and other needed binaries are often shipped with most major distros.

Thanks for useful article

Dobrica Pavlinusic's picture

I was wondering about Xen support on AMD, and this article was very useful. Keep up the good work.

Post new comment

  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.
  • Use to create page breaks.

More information about formatting options