Recovering from a Hard Drive Failure
Have you ever woke up in the morning and said to yourself, “today is the day that I'm finally going to backup my workstation!” only to find out that you're a day late and about 320Gb short? Well, that's about what happened to me recently, but don't worry, the story has a happy ending. I'm getting ahead of myself though.
Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory.
The system update went without incident and the kernel compiled and installed without error. The next step was to reboot into the new kernel. When the kernel panic'ed, I figured that I had missed something in the kernel configuration, so I rebooted back to my older kernel, which also panic'ed. Since this system had been running not 15 minutes ago, I knew things were about to get ugly.
At this point, I remembered that I had been doing some testing with an Ubuntu live CD, so I booted the live CD. At least now, I could get some work done, even though my workstation was “toes up.” This would also give me a platform from which to work on my regular hard drive, or so I thought. When I attempted to mount /dev/sda3, I was told that it didn't exist. Fdisk told me that my partition table was mostly gone! All that was left was /dev/sda1, where I keep my kernel, and /dev/sda2, which is where I swap. I posted a message describing my situation to the Gentoo user's group and was told that I should look into a program called testdisk.
I figured that I should at least assess /dev/sda1, so I tried to mount it. No such luck. The filesystem wasn't recognized. A quick look at /proc/filesystems told me that Ubuntu hadn't loaded ext2 support into the kernel. Further investigation revealed that Ubuntu loaded all of it's drivers from an initial ram disk and they weren't immediately available in /lib/modules. I couldn't bring myself to dissect an initial ram disk image on a system that was RUNNING on a ram disk, so out came the Gentoo installation CD.
It was while watching the Gentoo CD boot, that I saw the IDE disk seek error messages for the first time. I don't reboot my system very often and the Ubuntu live CD hides those messages from you, so who knows how long I'd been working with a drive that needed to be replaced?
Once the Gentoo CD had booted, it was time to try to recover my system. I discovered that testdisk wasn't installed on the CD, so I had to wget and untar it first. Oddly enough, I had to run testdisk and reboot a couple times before I had a partition table that looked sane. When I tried to mount the filesystem, I was told that mount couldn't find a valid filesystem. As a list ditch effort, I decided to try to fsck the filesystem anyway. The fsck program reported that it couldn't find a superblock, but this was the first good news I had received so far; I knew I could use the -b parameter and ask fsck to use a backup superblock. At least fsck hadn't choked completely. So, I issued a command like fsck -y -t ext2 -b 8192 /dev/sda3 to see what would happen. When fsck started to spew error messages indicating fix-ups it was performing, I decided that the process would take a while and went to be for the night.
When I woke up, I found that fsck had finished so I mounted the resulting filesystem. I was really hoping to see all of my files intact, but no, all I saw was /lost+found. When I cd'ed into the lost+found directory, I got my first glimpse of just how bad things had been. The fsck program had done it's job and recovered my filesystem, but it was unable to recover any of the file names at the root of the partition, so it moved the files to the lost+found directory and renamed each file after it's I-node number. All I had was a list of files and directories with names resembling #19539303. And the directory list was several screens in length; I usually keep a pretty clean / directory, so obviously, fsck had encountered a lot of trouble.
One of these oddly-named directories was my /home directory. I made an educated guess as to which one that was and sure enough, I had user directories. (My /home directory was the one reported with the largest file size.) Deeper inspection revealed that most of my files seemed to be there, and they were properly named! I was in business!
When my new disk arrived, I installed it and started copying my old files onto the new drive. I was immediately struck by how slow this process was going. It was as if I were transferring the files over a dial-up modem! It didn't help that the IDE subsystem had reset a few times in the process. At this rate the new drive would be out of warranty by the time my file recovery was complete, so I had to do something. It turns out that I had accumulated a lot of files in my home directory that I really didn't need. I had downloaded games and other software and simply built them in my home directory rather than installed them on the system. After I had pruned out all of the files and directories that I didn't care about, I was able to recover the rest of my /home directory.
So there you have it. When I started, I had a dead machine, a failing hard drive, a corrupt partition table, and a corrupt filesystem. When I had finished, I had at least recovered the important files from the system and had been able to carry on my day-to-day work without too much interruption, thanks to the Live CD. But there are some lessons to be learned here, which is why I chose to write about my experience.
I should have backed up yesterday. But for the record, my business files were on my server and I have redundant, off-site backups of them. I was mostly interested in recovering my password wallet, a few pictures and videos that I'd saved, and a few miscellaneous documents. OK, lesson learned.
But there's more. I was grateful to be able to keep running using a Live CD. However, I'm a KDE user and the Ubuntu CD that I had was Gnome-based. I got my work done, but it would have been nice to be in an environment that I was accustomed to using. In the future, I'll be keeping a Knoppix or Kubuntu CD handy.
I also found that my Gentoo CD just wasn't up to the task of system recovery. I'll be burning a genuine recovery disk, as soon as I have a system on which to burn CD's.
I really needed to have a set of emergency CD's handy for this situation. I could see having a CD wallet that had a Live CD, a Recovery CD, and an Installation CD. Having these CD's handy would have saved me a lot of time.
That said, I have to say that I'm glad to have been able to recover my data and that I wasn't down too terribly long in the process. I also wanted to mention how helpful the Linux Community is in times like this. I'm a fairly experienced Linux user, but it was sure nice to be able to ask questions before I actually committed changes to disk. I hope my tale of woe serves as both warning and encouragement to you; stuff happens, and you can recover from it.
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Build a Skype Server for Your Home Phone System
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Why Python?
- A Topic for Discussion - Open Source Feature-Richness?
- Tech Tip: Really Simple HTTP Server with Python
- Great
41 min 48 sec ago - Reply to comment | Linux Journal
49 min 49 sec ago - Understanding the Linux Kernel
3 hours 4 min ago - General
5 hours 34 min ago - Kernel Problem
15 hours 37 min ago - BASH script to log IPs on public web server
20 hours 4 min ago - DynDNS
23 hours 39 min ago - Reply to comment | Linux Journal
1 day 12 min ago - All the articles you talked
1 day 2 hours ago - All the articles you talked
1 day 2 hours ago



Comments
Can i use System Rescue CD
In this trick, may i use System Rescue CD?
thank's
Don't touch that disk!
Congrats at getting your data back. Whenever this happens to me (sometimes I use old hard disks from the office...), I turn the computer off, grab a cup of coffee and think. Then I boot my computer with my Parted Magic usb stick and connect by usb an extra hard disk (if pressed for hard disk space). And then, I image with dd (noerror option!) the faulty disk to a file on the usb hard disk. Then I run fsck on this file (you could make a second copy of this file, to stay on the safe side), thus not further degrading my data on the disk.
Parted Magic
I recommend you give Parted Magic a look. It's a Linux-based live CD that runs off a virtual memory that's filled with a number of useful recovery tools.
It uses Xfce to run the desktop, so it's very light and very fast to boot and use.
dd is your friend!
with a live cd, a external usb hdd and some patience, you can save yourself alot of data recovery headaches, with dd make a dd image of the whole drive and gzip it to your external hdd, that way you can restore the image back to the hdd (or to another drive) and then start with various recovery techniques until you have what you want back.
Wot! No smartmon
Everyone should be running smartmon on their systems, with a weekly self test. It won't warn you that a spindle is about to snap, but it should detect a gradual failure like this.
"Most people's excuse for
"Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory."
Thanks for the information
Learned something about hard disks in my life :>
First, I would power the system off and probably let the disk cool down. Then, only upon arrival of new disk, I would copy old one block-by-block to the new with ddrescue or dd_rescue/gddrescue. In one pass. Then put that old disk aside forever. Then, if any partitiona are lost, I would search them with testdisk or gpart. Then fsck and whatever. The number of either read or write operations on the old failed disk should indeed be minimized.
But instead, from the start, it should be known that using a single large disk for your important data is certainly a recipe for disaster in the long run. And it should never hurt to spend on 2 or more disks and make RAID1 or RAID5 array out of them. And to implement some backup strategy, of course, too.
fsck
I have to echo, doing fsck with repair is about the worst thing you can do and anyone following this advice can do infinitely more damage than has already occurred. (been there, done that :( ). Stick with dd and do dissection on the image.
PCLOS for rescue
Ubuntu, Gentoo...???
Next time, try PCLOS for recovery issues. Really great, and KDE as you like.
D.
Recovery distros
My 2 cents worth: distros such as RIPlinux (recovery is possible), SystemRescue, TRK (TrinityRescueKit) etc and to a lesser extent GParted and PartedMagic, were created for times like yours/those; (and whilst on the subject, rebuilding the LiveCD/USB with sleuthkit and autopsy added, is not a bad idea.)
Nice!
As I said in the article, I found myself not as well prepared as I would have liked. Xutre, your reply was a wealth of information. Thank you.
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
Not fsck!
No, definately don't try fsck on a dead disk. If fsck notices problems in a partition, it will usually try to fix them. This means that you'll just lose more data. Don't fiddle with dead disks; you could end up losing otherwise recoverable data.
The first thing you should do is use dd or ddrescue to get as much as possible off the disk, and into an image on another. Then you can play around with the data without risk of losing the whole lot. There are great tools that can recover partition information, and others, like foremost, that can attempt to recover files from a disk image. This is much safer than playing around with a partially damaged, dying disk.
Full of typos, misspellings and improper use of it's
Your editors may know some geek, but they certainly don't know how to spell or the difference between IT'S and ITS.
One of several examples is that none of them noticed that in your phrase, "The system update went without indecent," the word that you obviously intended was "incident."
You may have intelligent things to say, but your readers won't know that if you can't communicate them intelligently.
someday LJ will have real editors
It's "panicked", not "panic'ed." Once is a typo, twice is a cry for help :)
Nice article, we've all been there once or twice.
Spelling and grammar police
It would be nice if the spelling and grammar police would offer something helpful related to the article or else just go somewhere else. You are useless here.
Poetic License
If the kernel had just received an audit notification from the IRS then "panicked' would be correct, and in this case "panicked" is not incorrect, but since he's talking about a "kernel panic" this is poetic license. E.g. "I ssh'd into the system" or "I googled it".
The great thing about English is that there's no noun that can't be verbed.
Mitch Frazier is an Associate Editor for Linux Journal.
We have 'em
While we have been known to make mistakes, our editors are in fact so multi-talented that they speak Geek as well as English, and thus are able to discern the finer differences between actual mistakes and the intentional alterations of "geek-speak."
In this case "panic'ed" is accepted terminology. It refers to a kernel panic.
Katherine Druckman is webmistress at LinuxJournal.com. You might find her on Twitter or at the Southwest Drupal Summit
So the rest of the
So the rest of the misspellings and bad grammar are just geek-speak? Oh I know, it's better to spend all kinds of time making dopey excuses than learning correct English. sorry, I thought LJ was a professional publication, not some pre-teen's basement blog.
It's called Linux Journal,
It's called Linux Journal, not English Journal.. nobody really cares about the grammar, we like Linux info/stories.
So you're posting in the
So you're posting in the sysadmin category, and list yourself as a self-employed administrator, and you are only just learning this now? Backups are so easy to do these days, and a 1TB drive is now $100.
But I think the biggest problem here is the order of operations:
Step 1. Upgrade kernel
Step 2. Get suspend working
Step 3. Back stuff up (in case step 1 or 2 goes bad??)
Any kind of maintenance/upgrades are exactly the sort of thing you typically need backups for.
Missed the point.
I think you missed a few points.
1. My WORK data was backed up off-site. I was hoping to recover some of my PERSONAL data.
2. I started with a corrupt partition table, a corrupt root filesystem, and misnamed files in /lost+found, and recovery was still possible. I think that's nice to know!
3. The kernel upgrade didn't cause the problem. A hard disk that had probably failed WEEKS ago caused the problem. As mentioned in the article, I was able to fall back to the previous version of the kernel.
4. The system continued to run, even though the drive was in quite bad shape.
The point of posting this in Sys Admin wasn't to serve as an example of how to run, say, a data center. The point was to demonstrate that even when bad things catch you by surprise, they may not be as bad as they appear.
Still, I hope it was an interesting read.
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
Me Too
I just had a hard drive crash a couple weeks ago, my first one with a RAID 1 setup. It was totally worth the cost of the spare. You should look into it.