Home, My Backup Data Center

New Linux users often ask me "what is the best way to learn about Linux?" My advice always comes down to this: install and use Linux (any distribution will do but something stable works better), and play around with it. Inevitably, you will break something, and then instead of re-installing, force yourself to fix what you broke. That's my advice, because I've personally learned more about Linux by fixing my own problems than just about any other way. After years of doing this, you start to build confidence in your Linux troubleshooting skills, so that no matter what problem comes your way, you figure if you work at it long enough, you can solve it.

That confidence was put to the test recently when I had a problem with a KVM host. After a power outage, it refused to boot a virtual machine that was my primary personal server for just about everything. In this article, I walk through a problem that almost had me stumped and show how I was able to find a solution in an unorthodox place (at least for me).

The Setup

Before I dive too deep into my problem, it would help to understand my setup. Although I do have servers at home, my primary server is colocated in a data center. I share the server with a friend, so the physical server simply acts as a secured KVM host, and I split the server's RAM and CPU across two virtual machines 50/50. All of my most important services from my primary DNS server and e-mail for me and my immediate family, a number of different Web sites and blogs, and even my main Irssi session sits on one of those two VMs. I end up hosting secondary DNS and e-mail from a server on my home connection, but due to a one-megabit upstream connection, I don't host much else at home for the outside world.

One day (while a relative happened to be visiting from out of town), I noticed that both my main server and the physical server that was hosting it were unavailable. I notified my contact at the data center, and it ended up being an accidental power outage that affected my cabinet. I was taking my relative out to the coast for the day, far away from decent cell-phone reception. So, since there wasn't much I could do, I assumed that long before I got back into town that afternoon, power would be restored, and other than losing over a year's uptime, I would be back up and running.

Everything but the Sync

The first time I knew there was a real problem was when I got back into town and my main server still was down. I could log in to the physical host, however; so at first I wasn't too worried. After all, I had seen KVM instances not recover from a physical host reboot before. In the past, it was either from not setting a VM to start at boot or sometimes even a wayward libvirt apparmor profile that got in the way. Usually once I logged in to the physical host, I could change any bad settings, disable any troublesome apparmor module, then manually launch my VM with virsh. This time was different.

When my VM wouldn't boot manually, I was ready to blame AppArmor. It had blocked VMs from booting in the past, but this time, neither setting the libvirtd AppArmor module to complain mode, disabling all AppArmor modules nor even forcefully stopping AppArmor seemed to help. I even resorted to rebooting the physical host to heed AppArmor's warning that forcibly stopping it after it was running may cause some modules to misbehave. Nothing helped. When I connected a console to the VM as it booted, I started seeing initial kernel errors as though it was having trouble mounting the root filesystem. Great. Did the power outage corrupt my data?

The next step in the troubleshooting process was to attempt to boot from a rescue disk. With KVM, it's relatively easy to add a local ISO image as though it were a CD-ROM. So after not much effort, I discovered I could, in fact, boot a rescue disk and confirmed from the rescue disk I could mount my VM's drives, and the data did not seem corrupted. So then why wouldn't it boot? After I ran a manual fsck from the rescue disk, I attempted to reload GRUB, and that was when I got my first strange clue about the nature of the problem—even from the rescue disk, I wasn't able to write to the filesystem reliably. I would get virtual ATA resets, even though I seemed to be able to read fine.

So, I assumed I had some level of corruption with that particular VM, but because my data wasn't affected, I figured in the worst case, I could spawn a fresh VM and migrate the data over. So, that's what I tried next using the ubuntu-vm-builder wrapper script I used previously to build my VM. The VM seemed to spawn fine; however, once again, even this brand-new VM refused to boot properly and had the same strange disk errors.

It was at this point that my troubleshooting steps start to get a bit hazy, because I starting trying more desperate things. I booted different kernel versions in GRUB (after all, the kernel had been updated a few times in the year the server had been up). I audited all of the filesystem permissions on my VM disk images, and I tried to launch the VMs as root just in case. I even tried converting one VM's disks from qcow2 to raw with no results. Even Web searches came up empty. This server had been down longer than it ever had before, and I was starting to run out of options.

The Sync

My first break came when I decided to copy the VM I had just spawned over to almost identical hardware I had at home with the same distribution and see if I could reproduce the problem there. I picked the new host simply because since qcow2 filesystems grow on demand, it happened to have the smallest disks and was the fastest to sync over. The process was pretty straightforward. First I exported that KVM instance's configuration XML file with virsh on the colocated host:


$ virsh dumpxml test1.example.net > test1.example.net.xml

Then I copied that XML file to my home server, created a local directory named after this VM to store its disk images and synced them over from the physical host:


$ mkdir test1.example.net
$ rsync -avx --progress remotehost:/var/lib/libvirt/
↪images/test1.example.nett/est1.example.net/

Once the disk images were copied, I had to edit the test1.example.net.xml file, because the disk images now were stored in a new location. After I did that, I used virsh again to import this XML configuration file and start the VM:


$ virsh define test1.example.net.xml
$ virsh start test1.example.net

The VM actually started! Although I still had no idea what the problem was on the colocated server, I felt pretty confident that if I could sync over my main server, it would run on this home machine. Of course, with a 12Mb-down, 1Mb-up connection at home, it was going to take a bit longer to copy the 45GB disk images for this VM. Other than the time it took, the process was essentially the same as with the test machine, except once the host booted, I had to change its network configuration to reflect its new public IP.

With my server back up and running, I just had to change a number of DNS entries and firewall rules to reflect the new IP, and even with my slower upstream connection at home, I at least had some breathing room to troubleshoot the problem on the colocated server.

The Last Resort

Now that my VM and its data were safe and services were restored (if a bit slow), I felt free to perform more drastic steps on my colocated server. The first step was trying to figure out what was so different about it compared to my home server. They had the same Ubuntu 10.04 server install and most of the same packages. Luckily, I had a number of old cached libvirt and KVM packages on my home server, so at first I iterated through all of those packages to see if the problem was due to some upgrade. Once I exhausted that, I tried different kernel versions on the physical host and still no results.

Believe me when I tell you that during that week I tried every troubleshooting measure I could think of before I finally went to the second-to-last resort. The fact that I was even considering this should tell you how desperate I was getting. The last resort would be to do a complete re-install from scratch—something I wasn't ready to do yet. I was desperate enough though that I went with the second-to-last resort: an in-place distribution upgrade from 10.04 to 12.04. Once the dust settled, I tried my small test image, and it actually worked. We were back in business.

The Sync Back

Well, we were almost back in business. See, I had been using that server at home for a number of days now, and between the e-mail, blogs and other services, it had a lot of new data on it. This meant I couldn't just start up the image that was already on the colocated server. I had to sync up the changes from my home server.

The real trick to this was that I couldn't just sync the server hot. For one, the disk would be changing all the time, and two, I didn't want to risk having the same server running in weird states on two different physical hosts. This meant syncing the actual disk images. The problem was that while the 45GB disk images synced to my house relatively quickly over my 12Mb-downstream (plus the server was already down at the time, so downtime wasn't a consideration), syncing the same data up with my 1Mb upstream was going to take a long time—too long for a pure cold sync to be a solution, as I just couldn't have that much downtime.

The solution here was going to be two-fold, and it was based on a few assumptions I could make:

  • Although a fair number of files had changed on my local VM instance, the actual size of the change was relatively small compared to the size of the disk images.

  • rsync has an excellent mechanism for syncing over only the parts of large files that have changed.

  • A lot of the changes in my qcow2 files were likely going to be at the end of those files anyway.

  • If I use rsync with the --inplace option, it will modify the existing disk image on the remote machine directly and save disk space and time.

So my plan for phase 1 was to run rsync from physical host to physical host and sync over the qcow2 disk images hot while the VM was running and tell rsync to sync the disk images in place. Because I could assume the remote images would be somewhat corrupted anyway (that's the downside of syncing a disk image while the disk is being used), I didn't have to care about --inplace leaving behind a potentially corrupted file if it were stopped midway through the sync. I could clean it up later.

The advantage of doing the phase 1 rsync hot was that I could get all of the main differences between the home and colocated images sorted out while the server was still running at home. I even could potentially run that rsync multiple times leading up to phase 2 just to make sure it was as up to date as it could be. Here are the rsync commands I used to perform the phase 1 hot sync:


$ rsync -avz --progress --inplace disk0.qcow2
 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk0.qcow2
$ rsync -avz --progress --inplace disk1.qcow2
 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2

Between rsync's syncing only the bits that changed and the fact that I used -z to compress the data before it was transferred, I was able to sync these files way faster than you would think possible on a 1Mb connection. Of course, these commands ended up saturating my bandwidth at home, so since I wasn't under time pressure for the hot sync to complete, I ended up setting a bandwidth limit of 10 kilobytes per second for the larger disk1.qcow2 image:


$ rsync -avz --progress --inplace --bwlimit=10 disk1.qcow2
 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2

Once phase 1 was complete, I could start with phase 2. I needed the phase 2 rsync to run while the VM was powered off so I could make sure the disk wasn't being written to during the sync. Otherwise, I would risk corruption on the filesystem. Because this required downtime, I picked a proper maintenance window for my server when it would be less busy, finished a final phase 1 hot sync a few hours before, then halted the VM cleanly before I performed the final syncs:


$ rsync -avz --progress --inplace disk0.qcow2
 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk0.qcow2
$ rsync -avz --progress --inplace disk1.qcow2
 ↪remotehost:/var/lib/libvirt/images/www.example.net/disk1.qcow2

Because of the previous work of syncing up the disk images, the final cold sync took only an hour or two with most of the time being spent with rsync seeking between the local and remote image to confirm they were in sync. Once the commands completed, I was able to power up the server again on my colocated host, change its IPs back, and I was back in business.

Data graphic via Shutterstock.com.

______________________

Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

True that, rsync has an

Benjamin Fernandez's picture

True that, rsync has an excellent mechanism for syncing over only the parts of large files that have changed.

Some thoughts

Anonymous's picture

Have you checked smart of the disks on physical server?

It's 2013, do You really have 1Mb/s connection at home? Where do You live?

I found lots of interesting

aryo1on's picture

I found lots of interesting information here. Thank you for the great article I did enjoyed reading it. Margahayu Land

Thank you for the effort you

Voyance gratuitement's picture

Thank you for the effort you have made in creating this blog, better shared information that's also one of the values ​​of democracy ... if I can do anything to help this site I'd be happy .. Good luck!

It really needs a verify

Anonymous's picture

It really needs a verify mode. rxangmrulThe described problem above sounds like a corrupted library which presumably dpkg or apt or whatever package installer was used (most likely apt given you said it was Ubuntu) shoudl have the functionality to check sums/MD5 Sums, sizes and ownership permissions against the package defaults and throw out a list of warnings.

Very nice post. I just

zeennate's picture

Very nice post. I just stumbled upon your blog and wanted to say that I have truly enjoyed surfing around your blog posts. In any case I’ll be subscribing to your feed and I hope you write again soon!
voyance gratuite par mail

Not only you I too assumed

Scott Nelson's picture

Not only you I too assumed that long before I got back into town that afternoon, power would be restored

  • A number of online gambling are speedy to consider your hard attained revenue, whilst not so joyful about forking out a payment of $30,000 in the event you strike a winner.

A weakness of dpkg/apt

Joseph Greene's picture

It really needs a verify mode. The described problem above sounds like a corrupted library which presumably dpkg or apt or whatever package installer was used (most likely apt given you said it was Ubuntu) shoudl have the functionality to check sums/MD5 Sums, sizes and ownership permissions against the package defaults and throw out a list of warnings.

Interesting post.

RoseHosting's picture

Interesting post.
Thanks for sharing your experience.

reason for the corruption

atreyu's picture

i'm curious...so you never discovered the reason for the corruption on the original VM image on the remote server?

Great article Kyle!

Adam vonNieda's picture

I was not aware of the inplace functionality of rsync, I'll definitely be using that sometime soon.

Thanks!

You should check out LVMsync

OERNii's picture

It saved me a LOT of bandwidth, memory and CPU. And nerves.

https://github.com/mpalmer/lvmsync

It is worth switching to LVM if you are not already using it.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState