Bare Metal Recovery

Most us don't take the time to plan for disaster recovery. One excuse is not wanting to figure out what to do. One excuse down—this article gives you the step-by-step.
Second Stage Restoration

As the computer reboots, go back to the BIOS and verify that the clock is more or less correct.

Once you have verified the clock, exit the BIOS and reboot, this time to the hard drive. You will see a lot of error messages, mostly along the lines of “I can't find blah! Waahhh!” Well, if you have done your homework correctly up until now, those error messages won't matter. You don't need linuxconf or apache to do what you need to do.

You should be able to log into a root console (no X, no users, sorry). You should now be able to use the network, for example, to NFS mount the backup of your system.

If you did the two stage backup I suggested for Arkeia, you can restore Arkeia's database and executables. Now, you should be able to run /etc/rc.d/init.d/arkeia start and start the server. If you have the GUI installed on another computer with X installed, you should be able to log in to Arkeia on your tape server, and prepare your restoration.

When you restore, read the documentation for your restoration programs carefully. For example, tar does not normally restore certain characteristics of files, like suid bits. File permissions are set by the user's umask. To restore your files exactly as you saved them, use tar's p option. Similarly, make sure your restoration software will restore everything exactly as you saved it.

To restore the test computer:

[root@tester ~]# restore.all

If you used tar for your backup and restoration, and used the -k (keep old files, don't overwrite) option, you will see a lot of this:

tar: usr/sbin/rpcinfo: Could not create file:  File exists
tar: usr/sbin/zdump: Could not create file:  File exists
tar: usr/sbin/zic: Could not create file:  File exists
tar: usr/sbin/ab: Could not create file:  File exists
This is normal, as tar is refusing to overwrite files you restored during the first stage of restoration.

Just to be paranoid, run LILO after you perform your restoration. I doubt it is necessary, but if it is necessary, it's a lot easier than the alternative. You will notice I have it in my script, restore.all (see Listing 3).

Listing 3. restore.all Script

Now reboot. On the way down, you will see a lot of error messages, such as “no such pid.” This is a normal part of the process. The shutdown code is using the pid files from dæmons that were running when the backup was made to shut down dæmons that were not started on the last boot. Of course there's no such pid.

Your system should come up normally, with a lot fewer errors than it had before. The acid test of how well your restore works on an RPM based system is to verify all packages:

rpm -Va

Some files, such as configuration and log files, will have changed in the normal course of things, and you should be able to mentally filter those out of the report.

If you took my advice earlier and keep RPM metadata as a normal part of your backup process, you should be able to diff the two files, thereby speeding up this step considerably.

You should be up and running. It is time to test your applications, especially those that run as dæmons. The more sophisticated the application, the more testing you may need to do. If you have remote users, disable them from using the system, or make it “read only” while you test it. This is especially important for databases, to prevent making any corruption or data loss worse than it already might be.

If you normally boot to X, and disabled it above, test X before you re-enable it. Re-enable it by changing that one line in /etc/inittab back to: id:5:initdefault:

You should now be ready to rock and roll—and for some Aspirin and a couch.

Some Advice for Disaster Recovery

You should take your Zip disk for each computer and the printouts you made, and place them in a secure location in your shop. You should also store copies of these in your off-site storage location. The major purpose of off-site backup storage is to enable disaster recovery, and restoring each host onto replacement hardware is a part of disaster recovery.

You should also have several tomsrtbt floppies and possibly some Zip drives in your off-site storage as well. Have copies of the tomsrtbt distribution on several of your computers so that they back each other up. In addition, you should probably keep copies of this article, with your site-specific annotations on it, with your backups and in your off-site backup storage.

What Now?

This article is the result of experiments on one computer. No doubt you will find some other directories or files you need to back up in your first stage backup. I have not dealt with saving and restoring X on the first stage. Nor have I dealt with other operating systems in a dual boot system, or with processors other than Intel.

I would appreciate your feedback as you test and improve these scripts on your own computers. I also encourage vendors of backup software to document how to do a minimal backup of their products. I'd like to see the whole Linux community sleep just a little better at night.

Charles Curley ( lives in Wyoming, where he rides horses and herds cattle, cats and electrons. Only the last of those pays well, so he also writes documentation for a small software company headquartered in Redmond, Washington.