Disk Maintenance under Linux (Disk Recovery)
Here's a hypothetical situation for you to think about. You're working on your Linux box, calling up an application or data file, and Linux hesitates while reading the hard disk. Then, scrolling up the screen (or console box), you see something like this:
Seek error accessing /dev/hdb2 at block 52146, IDE reset (successful).
After some time spent chugging away accessing the drive, Linux continues. If you're lucky, everything is still running along fine. If you're not, your program is refusing to start, or your data file contains garbage.
Chances are, if you're using a hard disk drive that's a few years old, you will begin to see errors when accessing the disk from time to time. At this point, the best prognosis for your disk is that, given time, it'll get worse. So you need to begin resuscitation efforts as soon as possible. Several disk manufacturers have utilities that find and allocate these bad sectors on your hard disk. Unfortunately, these utilities also destroy the information on your disk, and are normally run from DOS, not Linux.
Fortunately, Linux has some system utilities to help you when you are dealing with its (now) native ext2 format. (Utilities are also available for minix. If you need to repair other non-Linux file systems you should use their own native sets of file system utilities.) While not as user-friendly as Norton Disk Doctor or Microsoft ScanDisk, the Linux disk and file system utilities get the job done. In this article, we'll look at a few of the tools to help us overcome the kind of problem I described in the opening paragraph. Other hard disk manipulation utilities can be found in /sbin and /usr/sbin, but they'll have to wait. For now, let's get the hard disk working properly.
Before you dig in, if you're using one of the newer 2.0.x kernels with an IDE drive, check to see if you have the proper bug fixes compiled into the kernel. If you aren't sure which chipset you have in your computer or are unable to ascertain for sure, it is safe to compile in the CMD640, RZ1000 and Intel 82371 options. These options are found under Floppy, ID and other block devices in your make config. This could save your data in the future. These bug fixes may be all you need, but further checks on your hard drive won't hurt.
I hate cliches, although I'm frequently accused of (ab)using them. If it really went without saying that we always do system backups, my income might be somewhat lower than it is. For most people, it's just not true. So, if you've neglected the chore for a while, just let me say that now would be a good time to do that backup. Some of the work I'll be telling you how to do could inadvertently damage or destroy your file system or some of your important files—so be careful and don't say I didn't warn you.
Now that I've gotten the requisite legal protection warnings out up front, let's begin. The safest way to start is with a fairly mundane check of the file system. On my system—a combination Red Hat (I like the SYSVinit style bootup), Slackware, Internet tarball concoction—I have fsck, a front-end program that reads the type of file system on a device (from /etc/fstab), then invokes the appropriate fsck.filesystemtype checker—in my case, fsck.ext2. You may have e2fsck on your system instead of, or in addition to, fsck.ext2. Don't worry, they're the same file. One may be a soft link to the other, but it's better to make that a hard link.
Before starting, let's prepare our systems for the kind of work we're going to be doing. Whenever I perform low-level maintenance on a system, I find it prudent to ensure I am disconnected from the network. Normally this means dropping to single-user mode. You may opt to do some of these tests from init level 2 (with no network connections), but you'll want to ensure that you don't have too many processes running that want to write to the disk, and none that run from the partition you need to work on. Single-user mode was made for this. A simple telinit 1 will get us to single-user mode.
If you're not checking the root file system, unmount the file system you're going to work on before you begin. If you forget, you'll get a prompt from fsck telling you the file system is mounted and asking if you want to continue anyway. Say “No”--running low-level system diagnostics, particularly those that alter the file system by writing directly to the disk as fsck does, with the disk mounted, is a very bad idea. Obviously, we can't unmount the root file system. We should be able to remount it as read-only, but a bug in mount doesn't always allow this option. If you need to check the root file system, you can reboot into single-user mode with the root partition mounted read-only by issuing the -b switch at the LILO prompt. The -b switch will be passed through LILO to init and will cause an emergency boot that does not run any of the startup scripts. If you have always wondered why you would want to create several partitions—for example, for /usr and /home--and restrict the size and scope of the root partition, now you know.
Invoking fsck from the command line on any given partition will probably not result in a check being run, because you have not reached the predetermined maximum mount count; therefore, the system believes the file system is clean and not in need of checking. To force the check, invoke fsck with -f.
At this point, one of two things will happen: fsck will begin to run correctly and check your disk partition (possibly hesitating at the bad spots on the disk and issuing appropriate error messages before continuing) or it will terminate without running, leaving error messages behind. If fsck does not run, you'll have to give the program additional information as indicated in the error messages. Probably the most common information you'll need to pass to e2fsck is the address of the alternate superblock or the block size so that e2fsck can calculate where an alternate superblock is located. The -b switch will tell e2fsck to use the alternate superblock, but we'll have to tell e2fsck where to find one. On ext2 file systems, superblocks are normally located at 8193, 16385 and higher multiples of 8192+1 (see dumpe2fs explanation below). As an alternative, we can pass e2fsck the block size with the -B switch (once we have that information) to allow e2fsck to calculate alternate superblock locations. Later I'll tell you where to get the block size value if you ever need it.
At this point, it's worth mentioning two other mutually exclusive switches available to fsck and e2fsck. The first is the -n switch, which tells fsck to answer no to all queries, and will leave the file system in its original condition making no repairs. The second is the -y switch, which automatically corrects any errors it finds. Generally, to speed things up, you may want to run fsck with the -y switch. So, why don't we just use this option all the time? I strongly recommend against this course of action, if you suspect problems with the file system. While fsck will usually not encounter problems, typing fsck -y and then taking a coffee break, leaving the machine to take care of itself, is not particularly prudent. If, in the interests of speed, you use the automatic answer yes switch to do routine checks, be sure to list your lost+found directories from time to time. Besides, you'll really want to note the block or inode numbers that appear while fsck runs, so that you can check them later to see if they are allocated to files.
The other available options for fsck and e2fsck can be found in the man pages. I consider the fsck and e2fsck man pages fairly well written, as is appropriate considering the importance of these utilities to your file system's health.
You may encounter messages asking if you want fsck to correct an error. Answering no will normally terminate the program so that you may fix the problem and rerun fsck. However, most error messages you're likely to encounter are fairly routine, and you may safely answer yes to them. If you see a message such as inode 1234 unattached, it means the file pointed to by inode (information node) 1234 has, for one reason or another, lost its filename. This can occur for several reasons, including a power failure or a computer reset without a proper disk sync.
Other common errors include zero time inodes, which are also due to the disk not being properly synced before shutdown. If you see these errors frequently and you've been shutting down your system correctly, you may have any number of other problems. In this case, you could begin by checking your power and data connections and your power supply for fluctuations or passing too much noise. Finally, check your hard disk parameters. I must caution you that altering the default hard disk parameters could do serious damage to your file system or corrupt your files—be careful.
One lost+found directory should be located in the root partition of each file system. If you have, for example, two mounted file systems, /usr and /home, you should have three lost+found directories. These directories will contain files whose inodes have become disconnected from their file names. The files in these directories will have the form ./#nnnn, where nnnn is the inode number used as the file name. You may be able to determine what the file is by inspecting it using cat. If cat returns what appears to be garbage, you probably have a binary file. In this case, you can do a chmod +x #nnnn, and then run the file. These procedures should give you enough information to learn what the file is. If the file is important, it can be renamed and moved to its original location; otherwise, it can be deleted.
The next utility we'll look at is dumpe2fs. To invoke this utility, type dumpe2fs device, to get the block group information for a particular device. Actually, you will get more information than you're likely to use, but if you understand the physical file system structure, the output will be comprehensible to you. A sample output is shown in Listing 1.
We really need only the first 22 lines of output. (The very first line with the version number is not part of the output table.) Most of these lines are fairly self-explanatory; however, one or two could use further explanation. The first line tells us the files system's magic number. This number is not random—it is always 0xEF53 for the ext2 file systems. The 0x prefix identifies this number as hexadecimal. The EF53 presumably means Extended Filesystem (EF) version and mod number 53. However, I am unclear about the background of the 53. (Original ext2fs versions had 51 as the final digits, and are incompatible with the current version.) The second line indicates whether a file system is clean or unclean. A file system that has been properly synced and unmounted will be labeled clean. A file system, which is currently mounted read-write or has not been properly synced prior to shutdown (such as with a sudden power failure or computer hard reset), will be labeled not clean. A not clean indication will trigger an automatic fsck on normal system boot.
Another important line for us is the block count (we'll need this later) that tells us how many blocks we have on the partition. We'll use this number when necessary with e2fsck and badblocks. However, I already know how many blocks I have on the partition; I see it every time I invoke df to check my hard drive disk usage. (If this were a game show, the raspberry would have sounded.) Check the output of df against dumpe2fs—it's not the same. The block count in dumpe2fs is the one we need. The number df gives us is adjusted to show us only the number of 1024k blocks we can actually access in one form or another. Superblocks, for example, aren't counted. Have you also noticed that the “used” and “available” numbers don't add up to the number of 1024k blocks? This discrepancy occurs because, by default, the system has reserved approximately five percent of these blocks. This percentage can be changed, as can many other parameters listed in the first 22 lines of the dumpe2fs readout; but again, unless you know what you are doing, I strongly recommend against it.
By the way, the information you are reading in the dumpe2fs is a translation into English of the partition superblock information listed in block one. Copies of the superblock are also maintained at each group boundary for backup purposes. The Blocks per group value tells us the offset for each superblock. The first begins at one, the succeeding are located at multiples of the Blocks per group value plus 1.
While we don't really need to use more than the first 22 lines of information, a quick look at the rest of the listing could be useful. The information is grouped by blocks and reflects how your disk is organized to store data. The superblocks are not specifically mentioned, but they are the first two blocks that are apparently missing from the beginning of each group. The block bitmap is a simple map showing the usage of the blocks in a group. This map contains a one or zero, corresponding to the used or empty blocks, respectively, in the group. The inode (information node) bitmap is similar to the block bitmaps, but corresponds to inodes in the group. The inode table is the list of inodes. The next line is the number of free blocks. Note that, while some groups have no free blocks, they all have free inodes. These inodes will not be used—they are extras. Some files use more than one block to store information, but need only one inode to reference the file, which explains the unused inodes.
Now that we have the information we need (finally), we can run badblocks. This utility does a surface scan for defects and is invoked by typing, as a minimum:
The device is the one we need to check (hda1, sda1, etc.) and the blocks-count is the value we noted after running dumpe2fs (above).
Four options are available with badblocks. The first option is the -b with the block size as its argument. This option is only needed if fsck will not run or is confused about the block size. The second option -o, which has a filename argument, will save to a file the block numbers badblocks considers bad to a file. If this option is not specified, badblocks will send all output to the screen (stdout). The third option is -v for verbose (self- explanatory). The final option is -w, which will destroy all the data on your disk by writing new data to every block and verifying the write. (Once again, you've been warned.)
Your best bet here is to run badblocks with the -o filename option. As bad blocks are encountered, they will be written to the file as a number, one to a line. This will be very helpful later on. In order to run badblocks in this way, the file system you are writing the file to must be mounted read-write. As root—and you should be root to do this maintenance—you can switch to your home directory, which should be located somewhere in the root partition. badblocks will save the file in the current directory unless you qualify the filename with a full pathname. If you need to mount the root partition read-write to write the file, simply type: mount -n -o remount,rw /.
Once you have your list of bad block numbers, you'll want to check these blocks to see if they are in use, and if not, set them as in use. If a block is already marked in use, we may want to clear the block (since the data in it might be corrupted), and reset it as allocated. Print the list of bad blocks—you'll need it later.
The final utility we will discuss is probably the most powerful and dangerous. With debugfs, you can modify the disk with direct disk writes. Since this utility is so powerful, you will normally want to invoke it as read-only until you are ready to actually make changes and write them to the disk. To invoke debugfs in read-only mode, do not use any switches. To open in read-write mode, add the -w switch. You may also want to include in the command line the device you want to work on, as in /dev/hda1 or /dev/sda1, etc. Once it is invoked, you should see a debugfs prompt.
We'll be looking at only a limited set of commands for the purposes of this article. I would refer you to the man pages, but the page for debugfs located on my system is out of date and does not accurately reflect debugfs' commands. To get a list, if not an explanation, at the debugfs prompt type ?, lr or list_requests.
The first command you can try is params to show the mode (read-only or read-write), and the current file system. If you run this command without opening a file system, it will almost certainly dump core and exit.
Two other commands, open and close, may be of interest if you are checking more than one file system. Close takes no argument, and appropriately enough, it closes the file system that is currently open. Open takes the device name as an argument.
If you wish to see disk statistics from the superblock, the command stats will display the information by group.
Now that you've had a chance to look at a few of debugfs' functions, let's get to work fixing our hard disk. From the printed list of bad blocks, we need to see which blocks are in use and which files are using them. For this we'll use testb with each block number as an argument. If the test says the block is not in use, we know we have'nt lost any data here yet.
If the block is marked as in use, you'll want to find out which file is using this block. We can find the inode by using:
which will return the inode that points to the block. From here, we can use
to get the name of the file corresponding to the inode. Now we finally have something we can work with. You may want to try to save the file, but if the block really is bad, you're probably better off reinstalling this file from a backup disk. To free the block, you can use one of several commands; the one I recommend is:
This will deallocate the inode and its corresponding blocks. Remember, you'll have to be in read-write mode to do this. Note that these commands are irrevocable in read-write mode.
Once the bad block has been deallocated, you can use:
to permanently allocate the block, removing the inode that points to it from the pool of free inodes.
That's it. Once the appropriate changes have been made to set the blocks, you can quit debugfs and reboot. You should not see more problems unless you missed a block (or have grown more bad blocks).
Good disk maintenance requires periodic disk checks. Your best tool is fsck, and should be run at least monthly. Default checks will normally be run after 20 system reboots, but if your system stays up for weeks at a time as mine often does, you'll want to force a check from time to time. Your best bet is performing routine system backups and checking your lost+found directories from time to time. The dumpe2fs utility will provide important information regarding hard disk operating parameters found in the superblock, and badblocks will perform surface checking. Finally, surgical procedures to remove areas grown bad on the disk can be accomplished using debugfs.
David Bandel is a Computer Network Consultant specializing in Linux, but he begrudgingly works with Windows and those “real” Unix boxes like DEC 5000s and Suns. When he's not working, he can be found hacking his own system or enjoying the view of Seattle from 2,500 feet up in an airplane. He welcomes your comments, criticisms, witticisms, and will be happy to further obfuscate the issue. You may reach him via e-mail at [email protected] or snail mail c/o Linux Journal.