Recovering from a Hard Drive Failure

Have you ever woke up in the morning and said to yourself, “today is the day that I'm finally going to backup my workstation!” only to find out that you're a day late and about 320Gb short? Well, that's about what happened to me recently, but don't worry, the story has a happy ending. I'm getting ahead of myself though.

Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory.

The system update went without incident and the kernel compiled and installed without error. The next step was to reboot into the new kernel. When the kernel panic'ed, I figured that I had missed something in the kernel configuration, so I rebooted back to my older kernel, which also panic'ed. Since this system had been running not 15 minutes ago, I knew things were about to get ugly.

At this point, I remembered that I had been doing some testing with an Ubuntu live CD, so I booted the live CD. At least now, I could get some work done, even though my workstation was “toes up.” This would also give me a platform from which to work on my regular hard drive, or so I thought. When I attempted to mount /dev/sda3, I was told that it didn't exist. Fdisk told me that my partition table was mostly gone! All that was left was /dev/sda1, where I keep my kernel, and /dev/sda2, which is where I swap. I posted a message describing my situation to the Gentoo user's group and was told that I should look into a program called testdisk.

I figured that I should at least assess /dev/sda1, so I tried to mount it. No such luck. The filesystem wasn't recognized. A quick look at /proc/filesystems told me that Ubuntu hadn't loaded ext2 support into the kernel. Further investigation revealed that Ubuntu loaded all of it's drivers from an initial ram disk and they weren't immediately available in /lib/modules. I couldn't bring myself to dissect an initial ram disk image on a system that was RUNNING on a ram disk, so out came the Gentoo installation CD.

It was while watching the Gentoo CD boot, that I saw the IDE disk seek error messages for the first time. I don't reboot my system very often and the Ubuntu live CD hides those messages from you, so who knows how long I'd been working with a drive that needed to be replaced?

Once the Gentoo CD had booted, it was time to try to recover my system. I discovered that testdisk wasn't installed on the CD, so I had to wget and untar it first. Oddly enough, I had to run testdisk and reboot a couple times before I had a partition table that looked sane. When I tried to mount the filesystem, I was told that mount couldn't find a valid filesystem. As a list ditch effort, I decided to try to fsck the filesystem anyway. The fsck program reported that it couldn't find a superblock, but this was the first good news I had received so far; I knew I could use the -b parameter and ask fsck to use a backup superblock. At least fsck hadn't choked completely. So, I issued a command like fsck -y -t ext2 -b 8192 /dev/sda3 to see what would happen. When fsck started to spew error messages indicating fix-ups it was performing, I decided that the process would take a while and went to be for the night.

When I woke up, I found that fsck had finished so I mounted the resulting filesystem. I was really hoping to see all of my files intact, but no, all I saw was /lost+found. When I cd'ed into the lost+found directory, I got my first glimpse of just how bad things had been. The fsck program had done it's job and recovered my filesystem, but it was unable to recover any of the file names at the root of the partition, so it moved the files to the lost+found directory and renamed each file after it's I-node number. All I had was a list of files and directories with names resembling #19539303. And the directory list was several screens in length; I usually keep a pretty clean / directory, so obviously, fsck had encountered a lot of trouble.

One of these oddly-named directories was my /home directory. I made an educated guess as to which one that was and sure enough, I had user directories. (My /home directory was the one reported with the largest file size.) Deeper inspection revealed that most of my files seemed to be there, and they were properly named! I was in business!

When my new disk arrived, I installed it and started copying my old files onto the new drive. I was immediately struck by how slow this process was going. It was as if I were transferring the files over a dial-up modem! It didn't help that the IDE subsystem had reset a few times in the process. At this rate the new drive would be out of warranty by the time my file recovery was complete, so I had to do something. It turns out that I had accumulated a lot of files in my home directory that I really didn't need. I had downloaded games and other software and simply built them in my home directory rather than installed them on the system. After I had pruned out all of the files and directories that I didn't care about, I was able to recover the rest of my /home directory.

So there you have it. When I started, I had a dead machine, a failing hard drive, a corrupt partition table, and a corrupt filesystem. When I had finished, I had at least recovered the important files from the system and had been able to carry on my day-to-day work without too much interruption, thanks to the Live CD. But there are some lessons to be learned here, which is why I chose to write about my experience.

I should have backed up yesterday. But for the record, my business files were on my server and I have redundant, off-site backups of them. I was mostly interested in recovering my password wallet, a few pictures and videos that I'd saved, and a few miscellaneous documents. OK, lesson learned.

But there's more. I was grateful to be able to keep running using a Live CD. However, I'm a KDE user and the Ubuntu CD that I had was Gnome-based. I got my work done, but it would have been nice to be in an environment that I was accustomed to using. In the future, I'll be keeping a Knoppix or Kubuntu CD handy.

I also found that my Gentoo CD just wasn't up to the task of system recovery. I'll be burning a genuine recovery disk, as soon as I have a system on which to burn CD's.

I really needed to have a set of emergency CD's handy for this situation. I could see having a CD wallet that had a Live CD, a Recovery CD, and an Installation CD. Having these CD's handy would have saved me a lot of time.

That said, I have to say that I'm glad to have been able to recover my data and that I wasn't down too terribly long in the process. I also wanted to mention how helpful the Linux Community is in times like this. I'm a fairly experienced Linux user, but it was sure nice to be able to ask questions before I actually committed changes to disk. I hope my tale of woe serves as both warning and encouragement to you; stuff happens, and you can recover from it.

______________________

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Can i use System Rescue CD

Avanca Linux's picture

In this trick, may i use System Rescue CD?
thank's

Don't touch that disk!

JaapvB's picture

Congrats at getting your data back. Whenever this happens to me (sometimes I use old hard disks from the office...), I turn the computer off, grab a cup of coffee and think. Then I boot my computer with my Parted Magic usb stick and connect by usb an extra hard disk (if pressed for hard disk space). And then, I image with dd (noerror option!) the faulty disk to a file on the usb hard disk. Then I run fsck on this file (you could make a second copy of this file, to stay on the safe side), thus not further degrading my data on the disk.

Parted Magic

Moustafa's picture

I recommend you give Parted Magic a look. It's a Linux-based live CD that runs off a virtual memory that's filled with a number of useful recovery tools.

It uses Xfce to run the desktop, so it's very light and very fast to boot and use.

dd is your friend!

killbox's picture

with a live cd, a external usb hdd and some patience, you can save yourself alot of data recovery headaches, with dd make a dd image of the whole drive and gzip it to your external hdd, that way you can restore the image back to the hdd (or to another drive) and then start with various recovery techniques until you have what you want back.

Wot! No smartmon

Neil's picture

Everyone should be running smartmon on their systems, with a weekly self test. It won't warn you that a spindle is about to snap, but it should detect a gradual failure like this.

"Most people's excuse for

imprezy integracyjne's picture

"Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory."
Thanks for the information

Learned something about hard disks in my life :>

Wladimir Mutel's picture

First, I would power the system off and probably let the disk cool down. Then, only upon arrival of new disk, I would copy old one block-by-block to the new with ddrescue or dd_rescue/gddrescue. In one pass. Then put that old disk aside forever. Then, if any partitiona are lost, I would search them with testdisk or gpart. Then fsck and whatever. The number of either read or write operations on the old failed disk should indeed be minimized.

But instead, from the start, it should be known that using a single large disk for your important data is certainly a recipe for disaster in the long run. And it should never hurt to spend on 2 or more disks and make RAID1 or RAID5 array out of them. And to implement some backup strategy, of course, too.

fsck

coop's picture

I have to echo, doing fsck with repair is about the worst thing you can do and anyone following this advice can do infinitely more damage than has already occurred. (been there, done that :( ). Stick with dd and do dissection on the image.

PCLOS for rescue

Dulwithe's picture

Ubuntu, Gentoo...???

Next time, try PCLOS for recovery issues. Really great, and KDE as you like.

D.

Recovery distros

xutre's picture

My 2 cents worth: distros such as RIPlinux (recovery is possible), SystemRescue, TRK (TrinityRescueKit) etc and to a lesser extent GParted and PartedMagic, were created for times like yours/those; (and whilst on the subject, rebuilding the LiveCD/USB with sleuthkit and autopsy added, is not a bad idea.)

Nice!

Mike Diehl's picture

As I said in the article, I found myself not as well prepared as I would have liked. Xutre, your reply was a wealth of information. Thank you.

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Not fsck!

Aronzak's picture

No, definately don't try fsck on a dead disk. If fsck notices problems in a partition, it will usually try to fix them. This means that you'll just lose more data. Don't fiddle with dead disks; you could end up losing otherwise recoverable data.

The first thing you should do is use dd or ddrescue to get as much as possible off the disk, and into an image on another. Then you can play around with the data without risk of losing the whole lot. There are great tools that can recover partition information, and others, like foremost, that can attempt to recover files from a disk image. This is much safer than playing around with a partially damaged, dying disk.

Full of typos, misspellings and improper use of it's

Anonymous's picture

Your editors may know some geek, but they certainly don't know how to spell or the difference between IT'S and ITS.

One of several examples is that none of them noticed that in your phrase, "The system update went without indecent," the word that you obviously intended was "incident."

You may have intelligent things to say, but your readers won't know that if you can't communicate them intelligently.

someday LJ will have real editors

Anonymous's picture

It's "panicked", not "panic'ed." Once is a typo, twice is a cry for help :)

Nice article, we've all been there once or twice.

Spelling and grammar police

kenholmz's picture

It would be nice if the spelling and grammar police would offer something helpful related to the article or else just go somewhere else. You are useless here.

Poetic License

Mitch Frazier's picture

If the kernel had just received an audit notification from the IRS then "panicked' would be correct, and in this case "panicked" is not incorrect, but since he's talking about a "kernel panic" this is poetic license. E.g. "I ssh'd into the system" or "I googled it".

The great thing about English is that there's no noun that can't be verbed.

Mitch Frazier is an Associate Editor for Linux Journal.

We have 'em

Webmistress's picture

While we have been known to make mistakes, our editors are in fact so multi-talented that they speak Geek as well as English, and thus are able to discern the finer differences between actual mistakes and the intentional alterations of "geek-speak."

In this case "panic'ed" is accepted terminology. It refers to a kernel panic.

Katherine Druckman is webmistress at LinuxJournal.com. You might find her on Twitter or at the Southwest Drupal Summit

So the rest of the

Anonymous's picture

So the rest of the misspellings and bad grammar are just geek-speak? Oh I know, it's better to spend all kinds of time making dopey excuses than learning correct English. sorry, I thought LJ was a professional publication, not some pre-teen's basement blog.

It's called Linux Journal,

cyb's picture

It's called Linux Journal, not English Journal.. nobody really cares about the grammar, we like Linux info/stories.

So you're posting in the

brian_'s picture

So you're posting in the sysadmin category, and list yourself as a self-employed administrator, and you are only just learning this now? Backups are so easy to do these days, and a 1TB drive is now $100.

But I think the biggest problem here is the order of operations:
Step 1. Upgrade kernel
Step 2. Get suspend working
Step 3. Back stuff up (in case step 1 or 2 goes bad??)

Any kind of maintenance/upgrades are exactly the sort of thing you typically need backups for.

Missed the point.

Mike Diehl's picture

I think you missed a few points.

1. My WORK data was backed up off-site. I was hoping to recover some of my PERSONAL data.

2. I started with a corrupt partition table, a corrupt root filesystem, and misnamed files in /lost+found, and recovery was still possible. I think that's nice to know!

3. The kernel upgrade didn't cause the problem. A hard disk that had probably failed WEEKS ago caused the problem. As mentioned in the article, I was able to fall back to the previous version of the kernel.

4. The system continued to run, even though the drive was in quite bad shape.

The point of posting this in Sys Admin wasn't to serve as an example of how to run, say, a data center. The point was to demonstrate that even when bad things catch you by surprise, they may not be as bad as they appear.

Still, I hope it was an interesting read.

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Me Too

Karl's picture

I just had a hard drive crash a couple weeks ago, my first one with a RAID 1 setup. It was totally worth the cost of the spare. You should look into it.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState