Introduction to the Xen Virtual Machine
This article is intended mainly for developers who are new to Xen and who
want to know more about it. The first two sections, however, are general
and do not deal with code.
The
Xen
VMM (virtual machine monitor) is an open-source project that
is being developed in the computer laboratory of the University of
Cambridge, UK. It enables us to create many virtual machines, each of
which runs an instance of an operating system.
These guest operating systems can be a patched Linux kernel, version 2.4
or 2.6, or a patched NetBSD/FreeBSD kernel. User applications can run on
guest OSes as they are, without any change in code. Sun also is working
on a Solaris-on-Xen
port.
I have been following the Xen project closely for more than a year.
My interest in Xen began after I read about it in the OLS (Ottawa Linux
Symposium) 2004 proceedings. It increased after hearing
an interesting
lecture on the subject at a local UNIX group meeting.
Full virtualization has been done with some hardware emulators; one of the
popular open-source projects is the Bochs IA-32 Emulator.
Another known project is qemu.
The disadvantage of hardware emulators is their performance.
The idea behind the Xen Project (para-virtualization) is not new.
The
performance metrics and the high efficiency it achieves,
however, can be seen as a breakthrough. The overhead of running Xen is
very small indeed, about 3%.
As was said in the beginning, currently Xen patches the kernel. But, future
processors will support virtualization so that the kernel can run on
it unpatched. For example, both Intel VT and AMD Pacifica processors
will include such support.
In August 2005,
XenSource,
a commercial company that develops virtualization solutions
based on Xen, announced in Intel Developer Forum (IDF) that it has used
Intel VT-Enabled Platforms with Xen to virtualize both Linux and
Microsoft Windows XP SP2.
Xen with Intel VT or Xen with AMD Pacifica would be competitive with if
not superior to other virtualization methods, as well as to native operation.
In the same arena, VMware is a commercial company that develops the ESX server,
a virtualization solution not based on Xen. VMware announced in early August 2005
that it will be providing its partners with access to VMware ESX Server source
code and interfaces under a new program called
VMware Community Source.
A clear advantage of VMware is that it does not require a patch on the guest
OS. The VMware solution also enables the guest OS to be Windows.
VMware solution is probably slower than Xen, though, because it uses shadow
page tables whereas Xen uses both direct and shadow page tables.
Xen already is bundled in some distributions, including Fedora Core 4,
Debian and SuSE Professional 9.3, and it will be included in RHEL5.
The Fedora Project has
RPMs
for installing Xen, and other Linux distros have prepared
installation packages for Xen as well.
In addition, there is a port of Xen to IA-64. Plus, an interesting
Master's Thesis already has been written on the topic,
"HPC Virtualization with Xen on Itanium".
Support for other processors is in progress. The Xen team is working on
an x86_64 port, while IBM is working on Power5 support.
The Xen Web site has some versions available for download, both the 2.0.*
version and the xen-unstable version, also termed xen-3.0-devel. You
also can use the Mercurial source code management system to download
the latest version.
I installed the xen-3.0-devel, because at the time, the 2.0.*
version did not have the AGP support I had needed. This may have changed
since my installation. I found the installation process to be quite
simple. You should run make world and
make install, update the bootloader
conf file and that's it--you're ready to boot into Xen. You should
follow the instructions in
the
user manual for best results.
The Return of the Ring
The protection model of the Intel x386 CPU is built from four rings: ring
0 is for the OS and ring 3 is for user applications. Rings 1 and 2 are
not used except in rare cases, such as OS/2; see the
IA-32 Intel Architecture Software Developer's Manual,
Volume 1: Basic Architecture, section 4.5 (privilege levels).
In Xen, a "hypervisor" runs in ring 0, while guest OSes run in ring 1 and
applications run in ring 3. The x64/64 is a little different in this
respect: both guest kernel and applications run in ring 3 (see
Xen
3.0 and the Art of Virtualization, section 4.1 in OLS 2005
proceedings).
Xen itself is called a hypervisor because it operates at a higher privilege
level than the supervisor code of the guest operating systems that
it hosts.
At boot time, Xen is loaded into memory in ring 0. It starts a patched
kernel in ring 1; this is called domain 0. From this domain you can
create other domains, destroy them, perform migrations of domains,
set parameters to a domain and more. The domains you create also run their
kernels in ring 1. User applications run in ring 3. See Figure
1, illustrating the x86 protection rings in Xen.
Figure 1
Currently, domain 0 can be a patched 2.4 or 2.6 Linux kernel. According
to the Xen developer mailing list, however, it seems that in the
future, domain 0 will support only a 2.6 kernel patch. Much of the
work of building domain0 is done in construct_dom0() method, in
xen/arch/x86/domain_build.c.
The physical device drivers run only in the privileged domain, domain 0.
Xen relies on Linux or another patched OS kernel to provide
virtually all of its device support. The advantage of this is it liberates
the Xen development team from having to write its own device drivers.
Using Xen on a processor that has a tagged TLB improves performance.
A tagged TLB enables attaching address space identifier (ASID) to the TLB
entries. With this feature, there is no need to flush the TLB when the
processor switches between the hypervisor and the guest OSes, and this reduces
the cost of memory operations.
Some manufacturers offer this tagged TLB feature. For example, a document titled
"AMD64 Virtualization Codenamed 'Pacifica' Technology Secure Virtual
Machine Architecture Reference Manual" was published in May
2005. According to it, this architecture uses a tagged TLB.
Next up is an overview of the Xend and XCS layers. These layers are the
management layers that enable users to manage and control both the domains
and Xen. Following it is a discussion of the communication mechanism
between domains and of virtual devices. The Xen Project source code is
quite complex, and I hope this may be a starting point for delving into it.
The Xend Daemon
First, what is the Xend daemon? It is the Xen controller daemon, meaning
it handles creating new domains, destroying extant domains, migration and
many other domain management tasks. A large
part of its activity is based on running an HTTP server. The default
port of the HTTP socket is 8000, which can be configured. Various requests
for controlling the domains are handled by sending HTTP requests for
domain creation, domain shutdown, domain save and restore, live migration
and more. A large part of the Xend code is written in Python, and it
also uses calls to C methods from within Python scripts.
We start the Xend daemon by running from the command line, after booting
into Xen, xend start. What exactly does this command involve? First,
Xend requires
Python 2.3 to support its logging functions.
The work of the Xend daemon is based on interaction with an XCS server,
the control Switch. So, when we start the Xend daemon, we check to see if
the XCS is up and running. If it is not, we try to start
XCS. This step is discussed more fully later in this article. .
The SrvDaemon is, in fact, the Xend main program; starting
the Xend daemon creates an instance of SrvDaemon
class (tools/python/xen/xend/server/SrvDaemon.py.).
Two log files are created here, /var/log/xend.log and /var/log/xend-debug.log.
We next create a Channel Factory in createFactories() method. The Channel
Factory has a notifier object embedded inside. Much of the work
of the Xend daemon is based on messages received by this
notifier. This factory creates a thread that reads the notifier in an
endless loop. The notifier delegates the read request to the XCS server;
see xu_notifier_read() in xen/lowlevel/xu.c. This method sends the
read request to the XCS server by calling xcs_data_read().
Creating a Domain
The creation of a domain is accomplished by using a hypercall
(DOM0_CREATEDOMAIN). What is a hypercall? In the Linux kernel, there is
a system call with which a user space can call a method in the kernel;
this is done by an interrupt (Int 0x80). In Xen, the analogous call is a
hypervisor call, through which domain 0 calls a method in the
hypervisor. This also is accomplished by an interrupt (Int 0x82). The
hypervisor accesses each domain by its virtual CPU, struct vcpu in include/xen/sched.h.
The XendDomain class and the XendDomainInfo class play a significant
part in creating and destroying domains. The domain_create() method in
XendDomain class is called when we create a new domain; it starts the
process of creating of a domain.
The XendDomainInfo class and its methods are responsible for the actual
construction of a domain. The construction process includes setting up the
devices in the new domain. This involves a lot of messaging between the
front end device drivers in the domain and the back end device drivers in
the back end domain. We talk about the back end and front end device
drivers later.
The XCS Server
The XCS server opens two TCP sockets, the control connection and the data
connection. The difference between the control connection and the data
connection is the control connection is synchronous
while the data connection is asynchronous. The notifier object, which
was mentioned before, for example, is a client of the XCS server.
A connection to the XCS server is represented by an object of type
connection_t. After a connection is bound, it is added to a list of
connections, connection_list, which is iterated every five seconds to
see whether new control or data messages arrived. Control messages, which
can be control or data messages, are handled by handle_control_message()
or by handle_data_message(), respectively.
Creating Virtual Devices When Creating a Domain
The create() method in XendDomainInfo starts a chain of actions to create
a domain. The virtual devices of the domain first are created. The create()
method calls create_blkif() to create a block device interface (blkif);
this is a must even if the VM doesn't use a disk. The other virtual devices
are created by create_configured_devices(), which eventually calls the
createDevice() method of DevController class (see controller.py). This
method calls the newDevice() method of the corresponding class. All the
device classes inherit from Dev, which is an abstract class representing
a device attached to a device controller. Its attach() abstract (empty)
method is implemented in each subclass of the Dev class; this method
attaches the device to its front end and back end. Figure 2 shows the
devices hierarchy, and Figure 3 shows the device controller hierarchy.
Figure 2Figure 3
Domain 0 runs the back end drivers, and the newly created domain runs the
front end drivers. A lot of messages pass between the back end
and front end drivers. The front end driver is a virtual driver in the
sense that it does not use specific hardware details; the code resides
in drivers/xen, in the sparse tree.
Event channels and shared-memory rings are the means of communication
among domains. For example, in the case of netfront device (netfront.c),
which is the network card front end interface, the np->tx and the
np->rx are the shared memory pages, one for the receiver buffer and one for the
transmitted buffer. In send_interface_connect(), we tell the netback end
to bring up the interface. The connect message travels through the event
channel to the netif_connect() method of the back end, interface.c. The
netif_connect() method calls the get_vm_area(2*PAGE_SIZE, VM_IOREMAP)).
The get_vm_area() method searches in the kernel virtual mapping area
for an area whose size equals two pages.
In the blkif case, which is the block device front end interface,
blkif_connect() also calls get_vm_area(). In this case, however, it uses only
one page of memory.
The interrupts associated with virtual devices are virtual interrupts.
When you run cat /proc/interrupts from domainU,
look at the interrupts with numbers higher than 256; they are labeled "Dynamic-irq".
How are IRQs redirected to the guest OS? The do_IRQ() method was changed
to support IRQs for the guest OS. This method calls __do_IRQ_guest()
if the IRQ is for the guest OS, xen/arch/x86/irq.c. The __do_IRQ_guest()
uses the event channel mechanism to send the interrupt to the guest OS,
send_guest_pirq() method in event_channel.c.
Conclusion
The Xen Project is an interesting and promising project that received
increasing notice over the past year. The code is complex, especially
the virtual memory management, the live migration implementation and
the grant tables mechanism. This article is an introductory article,
however, and does not deal with these topics. I hope, though, that it
has provided a starting point to those who want to learn more and delve
into the code.
Note: This article refers to Xen-unstable, xen-3.0-devel, which is the
basis for Xen-3.0, which should be released soon. The kernel referred to
for dom0/domU is a 2.6.* kernel. Whenever the term class is used, it
refers to a Python class.
Rami Rosen is a Computer Science graduate of Technion, the Israel
Institute of Technology, located in Haifa. He works as a Linux
kernel programmer for a networking start-up, and he can be reached at
ramirose@gmail.com. In his spare time he likes running,
solving cryptic puzzles and convincing and helping everyone he knows to
move to this wonderful operating system, Linux.










This week 5 lucky Members will receive a copy of The Official Ubuntu Server Book by Benjamin Mako Hill and Linux Journal's very own Kyle Rankin. No entry necessary. Check back here early next week to find out who the lucky Online Members are.




Comments
Some performace test result
I installed Xen on two of our server. The kernel and xen are the modern versions (The platform information can be found here ). I have done some test on this platform, the result is that the CPU performance is nearly 100% while the memory performance is only 90% compared to the physical machine. Details can also be found here . I am satisfied by the cpu performance. Maybe the 10% memory overhead is a bit large. I am wondering whether there are some mistake in my configuration or how to improve it.
Good
Good article - shown few concepts behind Xen and it's useful for beginners.
is it possible to do the memory copy operation between two VM's?
is it possible to do the memory copy operation between two guest VM's directly through through XEN without involving dom 0?
Inter-Domain Comms
Hello,
It is a little confusing on how you have described the interaction between XenStore and a domain. How exactly does a Domain interact with Xenstore i.e. TCP ports, sockets, etc...? Since XenStore resides in ring 3 how does it access the hypervisor itself? Thanks.
Mr. B
xen against qemu/bochs
With Xen on x86(_32) running guest OS kernel in ring 1 and guest OS applications in ring 3 carefuly exploited guest OS is wide open door to hijack host OS root applications in ring 3 and this way compromise host OS.
That's something I guess can't happen with qemu/bochs etc. Other words: you trade that for speed.
And at first guess the enhanced CPU architecture will have just tags at descriptors and more complicated descriptor access rules to enable more page tables separated and being loaded simultaneously switched/selected on demand and privileges. But then how can it provide applications/OSes existing in different page tables with similar amount of cpu time to run? Maybe someone can summarize the tech a bit and publish it?
carefuly exploited guest OS
Xen does validation of memory accesses
does xen overhead include OS overhead?
one clarification that i need is if the 3% overhead of xen includes the overhead of running multiple identical guest kernels. yes, xen adds 3% overhead, but is there also some duplication when running 3 linux kernels, whether in memory or in processing?
i recently investigated viritualization for the purposes of consolidating, yet keeping partitioned, a linux server & desktop. as there is very little difference between my current linux server & desktop kernels, i would prefer not to duplicate the linux kernel, but merely have different userlands. i am currently testing linux vserver as it allows me to run a single linux kernel, but maintain multiple userland "instances", each "instance" with its own ip address and other features.
granted vserver, chroot, etc does not help when a user wants to run different operating systems (linux & windows), and if full separation between userland images, even down to the kernel level (kernel-level exploits, user-visible features like nfsd, etc), is desired, then xen is the proper tool for that job. heck, give the xen livecd a test drive and marvel at xen's accomplishment.
just wanted to share my holiday weekend's research to help save someone else some time.
we tested it recently. yup,
we tested it recently. yup, it involves 3% overhead on simple operations, but overhead is more than 20-30% on disk I/O, network etc.
And sure, memory pressure/requirements you mentioned are rather big.
I would recommend you to take a look at OpenVZ project as well. It is more mature, than vserver. We successfully run 30-50 VPSs on 1GB of RAM with it.
disk/network io
Why not use separate drives for each server slice instead of a file system on a file? Perhaps separate network cards also?
This might mitigate the slow down but perhaps satiating the buses.
Anyone doing that?
Cheers,
-b
disk i/o
things run much faster if you give each domU it's own partition. LVM helps a lot here, both to run many small domains on one disk, and to keep track of who owns what partition.
anyone care to write proof of concept exploit?
> (kernel-level exploits,
I guess this may be still issue with xen compared to qemu/bochs. It's not that straight forward but have a look at the access privileges model behind ring. Once you gain ring 1 privileges then the userland of host OS is toast.
windows applications
I am curios, after the VT and Pacifica gets in and you can then run windows on xen directly, could you run games, graphics, etc...
I guess it depends what kind of drivers xen would provide or allow access too. Anyone?
I.e. work on linux and windows in tandem.
For example, applications that can't or have not yet been ported to linux will work on windows (such as games, proprietery...) and the rest would be linux.
Sadly the SMP support is rath
Sadly the SMP support is rather unstable (and therefore currently only in xen unstable. :-) ).
VMware Community Source is nonsense
VMware's "Community Source" program is exactly like open source, only they don't share their software with anyone except their corporate partners, and don't share the contributed code.
Agreed: VMware Community Source is a load...
I've been reading VMware press releases for the last few weeks with zero substance except how they were going to "open" something up. I went to Intel's Developer Forum and spoke with numerous developers from IBM, HP and Intel at and asked them straight up what the deal was. I asked, "Where is the "open code"? They all kind of (quietly) said the same thing. VMware is getting freaked out by Xen and wanted some press. In reality, they may document a few more APIs, but this is just a load...
Author Response
Hello,
I had written in this article about the advantages and disadvantages of the Xen and VMWare virtualization solutions.
One of the Xen advantages I pointed out was it was
free and open source project.
I felt it will be unfair not to mention that VMware
started that Community Source program in the beginning
of this August.
In the article I wrote aboout this Community Source program : "..it will be providing its partners with access to VMware ESX Server source code"; also VMWare news release (to which I gave a link) talks about giving source to ***partners***.
I think your comment should be read considering this and in this light.
Regards,
Rami Rosen
NetBSD doesn't need to get patched
NetBSD has native support for Xen for some time in the official releases now, and does NOT need to be patched. See www.NetBSD.org/Ports/xen for more information.
- Hubert
no POWER5 support
I am one of the developers working on the PowerPC port of Xen, and we are supporting the PowerPC 970, not POWER5.
Xen in IBM
Hello,
Please look here:
http://lwn.net/Articles/139964/
It says:
...
IBM is working on Power5 support...
...
Are you shure your team is the only one in IBM working on
Xen ?
Yes, I'm positive. The LWN pa
Yes, I'm positive. The LWN page is also incorrect, though it cites its source so you can see where the information comes from.
Why bother with POWER5 support?
Why would IBM waste resources on POWER5 support? They already have a rock-solid micropartitioning and virtualization environment on the POWER5 that supports Linux, and one that appears to provide even greater protection across partitions then xen does with domains. I'm running my own distribution on one as I write this, and I'm sold. I'd rather manage a SAN-backed POWER5 installation over a blade server any day.
I can see a big advantage for the PPC970, though, given that you can get JS20 blades for their blade center, and the HS20 already.
More on VM and Emulators
--->
http://www.futuredesktop.org/hpc_linux.html#VM
// moma
Post new comment