Scary Backup Stories
Backups. We all know the importance of making a backup of our most important systems. Unfortunately, some of us also know that realizing the importance of performing backups often is a lesson learned the hard way. Everyone has their scary backup stories. Here are mine.
Like a lot of people, my professional career started out in technical support. In my case, I was part of a help-desk team for a large professional practice. Among other things, we were responsible for performing PC LAN backups for a number of systems used by other departments. For one especially important system, we acquired fancy new tape-backup equipment and a large collection of tapes. A procedure was put in place, and before-you-go-home-at-night backups became a standard. Some months later, a crash brought down the system, and all the data was lost. Shortly thereafter, a call came in for the latest backup tape. It was located and dispatched, and a recovery was attempted. The recovery failed, however, as the tape was blank. A call came in for the next-to-last backup tape. Nervously, it was located and dispatched, and a recovery was attempted. It also failed because this tape also was blank. Amid long silences and pink-slip glares, panic started to set in as the tape from three nights prior was called up. This attempt resulted in a lot of shouting.
All the tapes were then checked, and they were all blank. To add insult to injury, the problem wasn't only that the tapes were blank--they weren't even formatted! The fancy new backup equipment wasn't smart enough to realize the tapes were not formatted, so it allowed them to be used. Note: writing good data to an unformatted tape is never a good idea.
Now, don't get me wrong, the backup procedures themselves were good. The problem was that no one had ever tested the whole process--no one had ever attempted a recovery. Was it no small wonder then that each recovery failed?
For backups to work, you need to do two things: (1) define and implement a good procedure and (2) test that it works.
To this day, I can't fathom how my boss (who had overall responsibility for the backup procedures) managed not to get fired over this incident. And what happened there has always stayed with me.
When it comes to doing backups on Linux systems, a number of standard tools can help avoid the problems discussed above. Marcel Gagné's excellent book (see Resources) contains a simple yet useful script that not only performs the backup but verifies that things went well. Then, after each backup, the script sends an e-mail to root detailing what occurred.
I'll run through the guts of a modified version of Marcel's script here, to show you how easy this process actually is. This bash script starts by defining the location of a log and an error file. Two mv commands then copy the previous log and error files to allow for the examination of the next-to-last backup (if required):
#! /bin/bash backup_log=/usr/local/.Backups/backup.log backup_err=/usr/local/.Backups/backup.err mv $backup_log $backup_log.old mv $backup_err $backup_err.old
With the log and error files ready, a few echo commands append messages (note the use of >>) to each of the files. The messages include the current date and time (which is accessed using the back-ticked date command). The cd command then changes to the location of the directory to be backed up. In this example, that directory is /mnt/data, but it could be any location:
echo "Starting backup of /mnt/data: `date`." >> $backup_log echo "Errors reported for backup/verify: `date`." >> $backup_err cd /mnt/data
The backup then starts, using the tried and true tar command. The -cvf options request the creation of a new archive (c), verbose mode (v) and the name of the file/device to backup to (f). In this example, we backup to /dev/st0, the location of an attached SCSI tape drive:
tar -cvf /dev/st0 . 2>>$backup_err
Any errors produced by this command are sent to STDERR (standard error). The above command exploits this behaviour by appending anything sent to STDERR to the error file as well (using the 2>> directive).
When the backup completes, the script then rewinds the tape using the mt command, before listing the files on the tape with another tar command (the -t option lists the files in the named archive). This is a simple way of verifying the contents of the tape. As before, we append any errors reported during this tar command to the error file. Additionally, informational messages are added to the log file at appropriate times:
mt -f /dev/st0 rewind echo "Verifying this backup: `date`" >>$backup_log tar -tvf /dev/st0 2>>$backup_err echo "Backup complete: `date`" >>$backup_log
To conclude the script, we concatenate the error file to the log file (with cat), then e-mail the log file to root (where the -s option to the mail command allows the specification of an appropriate subject line):
cat $backup_err >> $backup_log mail -s "Backup status report for /mnt/data" root < $backup_log
And there you have it, Marcel's deceptively simple solution to performing a verified backup and e-mailing the results to an interested party. If only we'd had something similar all those years ago.
Some years later I found myself working a new job as an IT manager. One of the first things I was asked to do was define and implement a backup procedure for a newly acquired payroll system. This I duly did.
The procedure was, in my humble opinion, good. There was plenty of formatted blank tapes, which were rotated daily over three weeks. Backup tapes were stored in a locked cabinet in a building other than that which housed payroll. Tapes had to be signed in and out. Responsibility for performing the backup was "delegated" to payroll. IT provided the equipment, set everything up, tested that the backups were actually taking place (and that the tapes could be used to recover successfully) and trained the payroll staff. Now, don't get me wrong, it's not that I don't believe in God. It just that I'm convinced God has a evil-twin who enjoys laughing at us. In this case, evil-twin God sent a big bolt of lightening from the heavens and aimed it at the building housing payroll. On scoring a direct hit, the bolt of lightening fried the surge suppressor that protected payroll's hardware and eventually torched the hard-disk. A panic stricken call was made to me, the IT manager.
On assuring the payroll manager that everything was okay and noting the existence of daily backups for the last three weeks, I dispatched a technician to investigate. Sure enough, the hard disk was toast. A replacement was emergency ordered, and it arrived the next morning. We immediately went to work getting things going again.
The most recent full-backup tape was used to restore the system. This went well: the operating system was restored, with all of the applications and user settings in place. Then the most recent backup tape was used to restore the system to its last-known good state before the lightening strike. A quick execution of the payroll application looked okay. We all returned to our desks, happy that we had dodged this particularly nasty bullet.
Then the phone call came: "We are missing payroll data for the last two weeks." How could this be? We checked the correct tape had been used to do the restore. It had. We checked the logs to see that tapes had been handed out and signed for. They had; backups had been occurring every night as scheduled. We checked the actual backup tape used to restore the system to its last known good state and, sure enough, there was no payroll data on it. In fact, there wasn't much of anything on it. Checking the logs again, we noticed that for the preceding two weeks, a staff member other than the designated backup person had been signing out the tapes. We checked the tape to discover that the only files backed up belonged to this other staff member. What transpired was the designated backup person had gone on vacation and delegated the nightly backups to a colleague. Unfortunately (for us), the colleague did not have the necessary permissions to backup the entire hard disk; only the designated backup person had the proper permissions. Trust evil-twin God to test our backup procedure while our designated backup person was on vacation! The moral of this scary backup story is never underestimate the human factor.
So, for backups to work, you actually need to do three things: (1) define and implement a good procedure, (2) test that it works and (3) review your procedure often.
Creating a specific user-id for backups would have helped avoid the problems described above, rather than relying on a specific user (with the correct rights) to perform the work. Even better is a situation whereby the reliance on any one user is minimized. Things can be much more foolproof if all the user has to do is swap the tape each day. Again, Linux and a standard tool can help, and that tool's name is cron. Assuming that the above script is called mntdata.backup and that it resides in /usr/local/.Backups/mntdata.backup, run
0 21 * * 1-5 /usr/local/.Backups/mntdata.backup
Now all that's required is a manual procedure to swap the tape at the start of each working day. The tape is then labeled and stored securely. A minor improvement to this cron-entry would be to capture any errors from the script to a log file (just in case there's ever a problem with its execution).
So, good procedures help make good backups, as does good training. Good technology and a little automation also helps. When it comes to making backups--which has never been the most exciting activity--there really is no excuse not to use every advantage available.
Chapter 17 of Linux System Administration: A User's Guide, Marcel Gagné, Addison-Wesley (Pearson Education), 2002. ISBN: 0-201-71934-7.
Paul Barry no longer worries about backup tapes. For the last five years he has lectured at The Institute of Technology, Carlow in Ireland. He is the author of Programming the Network with Perl, Wiley 2002.