Know when your drives are failing, with smartd
“Ka-chunk... ka-chunk... ka-chunk... tick... tick... tick... Ka-chunk... ka-chunk...” That's just not a sound you ever want to hear coming from a hard drive. It's the sound of a hard drive trying to move it's read/write heads into a position that they don't seem to want to go to or its trying to read a sector that just isn't there anymore. Of course, modern hard drives have come a long way and are amazingly reliable, but if you work with computers long enough, you're bound to have one fail on you.
I know that a lot has been written about smartd and I was hesitant to pile on even more, but let's just say, for now, that I write about what I know about, and that lately, I've come to know a lot about disk drive failures...
Running smartd on your servers and workstations is just like performing backups; its something you know you should be doing, but probably aren't, or at least not regularly. I'm hoping that this article will help convince you that now's the time to start.
SMART, or Self-Monitoring, Analysis, and Reporting Technology is a capability that's built into almost all modern IDE and SCSI disk drives that allows the drive controler to minitor it's own state of health. What that means is that a SMART-capable drive can give you a clue that its about to fail. All you have to do is listen for these clues, and that's where smartd comes in. Smartd is part of the smartmontools package and runs as a daemon on your system. Periodically, smartd will poll your installed hard drives and, essentally, “ask them” how their doing. Smartmontools supports ATA/ATAPI/SATA-3 to -8 disks and SCSI, so smartd should be usable across a wide range of drives.
Performing a basic smartd configuration is almost rediculously simple to do. Most of us are using Linux distributions that have some sort of package management feature, so installing smartmontools should be fairly simple since the suite has been packaged for all of the major distributions. Once installed, configuration is as easy as it gets. The tool, by default, will discover and scan all of the drives in your system, and the default configuration file seems reasonable. Most of the time, I don't even bother to make changes to the /etc/smartd.conf file. You only need to change the configuration file if you need smartd to handle ill-behaved hardware, or if you need it to perform specific, non-default, functions. Finally, you have to arrange for your system to start smartd as part of the system start-up process. This part depends on which distribution you use. On my Gentoo boxes, I use:
rc-update add smartd default and I'm done.
Now that the suite is installed, and the daemon is running, we should find the occational log entry in our syslog. Here's what I see when I restart smartd:
Dec 1 16:50:32 localhost smartd[6219]: smartd received signal 15: Terminated
Dec 1 16:50:32 localhost smartd[6219]: smartd is exiting (exit status 0)
Dec 1 16:50:33 localhost smartd[2549]: smartd version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Dec 1 16:50:33 localhost smartd[2549]: Home page is http://smartmontools.sourceforge.net/
Dec 1 16:50:33 localhost smartd[2549]: Opened configuration file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Drive: DEVICESCAN, implied '-a' Directive on line 23 of file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Dec 1 16:50:33 localhost smartd[2549]: Problem creating device name scan list
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, found in smartd database.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, packet devices [this device CD/DVD] not SMART capable
Dec 1 16:50:33 localhost smartd[2549]: Monitoring 1 ATA and 0 SCSI devices
Dec 1 16:50:33 localhost smartd[2573]: smartd has fork()ed into background mode. New PID=2573.
Dec 1 16:50:33 localhost smartd[2573]: file /var/run/smartd.pid written containing PID 2573
This is what a successful smartd start looks like. As you can see, my workstation only has one hard drive that is SMART-capable and a CD/DVD drive that isn't. As long as the daemon is running, we'll see log entries indicating if the health status of my drive changes.
But we don't always read our logs. That's OK, the smartmontools suite has a command-line tool that you can use interactively to find out how healthy your drives are. For example, we can use smartctl to find out what type of drive we have:
# smartctl -i /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD3200JB-00KFA0
Serial Number: WD-WCAMR3566562
Firmware Version: 08.05J08
User Capacity: 320,072,933,376 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Dec 1 16:57:28 2008 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Of course, most of this isn't really interesting and only serves to expose just how old my system really is. (In the future, I'll strip out the heading information for further output) But we can also use smartctl to ask drive about it's general state of health:
# smartctl -H /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What it's telling you is that the IDE drive controller has detected problems. It does NOT mean that other problems don't exist, just that the IDE controller hasn't seen them yet. I got a passing result on a drive I knew to be bad. Still, this is a valuable check, because if the IDE controller has found a problem, you REALLY need to know about it.
A more thorough test result may be obtained by with the smartctl -t short /dev/hda command. As you might imagine, an even more thorough test result can be had by changing “short” to “long” in the command above. However, this command doesn't immediately return any results. It simply tells you to come back late and ask for the results:
# smartctl -t short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Dec 1 17:15:50 2008
Use smartctl -X to abort test.
Well, after the test is complete, we can use the smartctl -l selftest /dev/hda command to see the results:
# smartctl -l selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 14942 -
Here is where you see that the drive in my workstation is doing just fine. But in the spirit of spreading the misery, let's take a look at a few drives I have that aren't doing so well.
Here is an ominous message that I found in the log file of my MythTV server:
Nov 27 04:12:51 media smartd[6884]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
The neat thing about it all is that we use this server every day as a family and we've not noticed anything unusual. Still, the drive says it's about to fail. Guess what? We'll be replacing it soon. I'll probably duplicate the system and replace the drive in one operation and be back up in 30 minutes.
Here's another example. Two of the drives in my home fileserver are failing. I've backed them up and am waiting for the replacement drives to arrive. In the mean time, here is the result of the short selftest on one of them:
# smartctl -l selftest /dev/hdd
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 23678 200910
# 2 Extended offline Completed: read failure 90% 23676 200910
# 3 Short offline Completed: read failure 90% 23676 200910
As it happends, these drive problems were announced via syslog some time before I ever noticed a problem. Take a look:
Nov 27 03:34:10 dominion smartd[30161]: Monitoring 4 ATA and 0 SCSI devices
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 463 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 1210 Offline uncorrectable sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1430 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1429 Offline uncorrectable sectors
I've been watching the number of uncorrectable and unreadable sectors increase over the course of a month or so. And that's the point of this article. I knew my drives were failing long before they actually failed. A drive that has one or two bad sectors can be easily recovered or repaired. But this drive isn't going to last long. Because I was using smartd, and watching my logs, I was able to get my data backed up, consider my replacement options, and plan the replacement. The last thing I, or you, need, is to wake up one morning and find that an important server died during the night without warning.... Been there. Done that. Never want to do it again.
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- RSS Feeds
- Introduction to MapReduce with Hadoop on Linux
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Weechat, Irssi's Little Brother
- Tech Tip: Really Simple HTTP Server with Python
- Poul-Henning Kamp: welcome to
47 min 33 sec ago - This has already been done
48 min 33 sec ago - Reply to comment | Linux Journal
1 hour 33 min ago - Welcome to 1998
2 hours 22 min ago - notifier shortcomings
2 hours 45 min ago - heroku?
4 hours 22 min ago - Android User
4 hours 24 min ago - Reply to comment | Linux Journal
6 hours 17 min ago - compiling
9 hours 7 min ago - This is a good post. This
14 hours 20 min ago



Comments
how to monitor logs
A very helpful tool. I've actually been reading LJ from my cell phone.
a few questions:
1) Which log messages should I worry about?
I see the smart messages in my /var/log/everything. Many of them describe temperature changes. To skip these, I did this command.
# grep -i smart log-2009-02-06-*|grep -iv Celsius|less
How does one best monitor the smartd log entries? What's the "right command" to grep on the logs?
output:
log-2009-02-06-17:23:46:Feb 6 09:06:55 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 102_
log-2009-02-06-17:23:46:Feb 6 09:36:54 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 102 to 103_
log-2009-02-06-17:23:46:Feb 6 10:06:55 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 104_
log-2009-02-06-17:23:46:Feb 6 10:06:55 [smartd] Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 57 to 58_
2) What does lifetime mean? # of hours until it fails?
Here are the outputs of selftests on two of my drives, /dev/sda and /dev/sdba
Does only only have 500 hours left (i.e. 20 days)? What oes this mean?
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 509 -
# 2 Extended offline Completed without error 00% 947 -
# 3 Short offline Completed without error 00% 946 -
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 15344 -
# 2 Short offline Completed without error 00% 15342 -
Smart script
I've had a shell script published on my wiki since July '07 that monitors my drives and emails their status to me. Your article summarizes the research that I went thru to build my script. Perhaps the script will benefit some LJ readers.
I learned is that the extended drive information is different between manufacturers, so I can't depend on that as a means of determining status. But I can easily look for the word 'PASSED'. It is always there for a healthy drive. So its a pretty trivial task to write out a custom text message and deliver it--by email in my case--to avoid potential problems.
Typo
Shouldn't this
At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What its telling you is that the IDE drive controller has detected problems.Be
At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What its telling you is that the IDE drive controller has NOT detected problems.Google disk report
As one of the worlds biggest consumer of disks, Google wrote a report about disk failures in there systems some years ago, I think it still is rather actual. They do not talk very positive about SMART.
http://research.google.com/archive/disk_failures.pdf
What do these messages actually mean?
One thing these type of articles never seem to get into is:
What do messages like
Nov 27 04:12:51 media smartd[6884]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252
or (taken from Logwatch on one of my computers)
/dev/hda :
Prefailure: Seek_Time_Performance (8) changed to
250, 249, 248,
/dev/hdb :
Usage: Raw_Read_Error_Rate (1) changed to
109, 110, 111, 112,
actually mean? Not the meta-meaning "You should change your hard drive", but exactly what do these numbers actually indicate?
Ben
SMART is not very reliable, change harddisks
Before you think SMART will help you ensure that harddisks do not fail, read
http://research.google.com/archive/disk_failures.pdf
Harddisk quality is not increasing, it is decreasing. It is like floppy disks, 20 year ago they were rather reliable, during the last 5 years, they were readable in the same floppy drive where they were written.
The reason for this decline is price. Hard drives are very cheap. In my notebooks, I replace the harddrive yearly, very easily done if you use ghosting tools like g4l or similar. And as a bonus, you get a lot of free space.
How do you view smartd info for SATA drives?
Hi there, great article but how do I check on my SATA drives? Looks like I have to use something called libata.
Thanks,
Brendan
SATA
On the current version of smartmontools SATA is supported from the faq
"Smartmontools should work correctly with SATA drives under both Linux 2.4 and 2.6 kernels. Depending on which subsystem the SATA controller is in (i.e. drivers/ide, drivers/ata or libata (under drivers/scsi) a SATA drive will appear as /dev/hd* or /dev/sd*. Either way, smartmontools should be able to figure out what is going on and act accordingly. In some cases smartmontools may need a hint in the form of a '-d sat' or '-d ata' option on the smartctl command line or in the /etc/smartd.conf file. There may be a hint to add one of those options in the log file when smartd is run as a daemon or on the command line with smartctl. The '-d ata' option means that even though the drive has a SCSI device name, treat it as an ATA disk. Unfortunately such an approach doesn't often work. The next paragraph has more information about '-d sat'."
http://smartmontools.sourceforge.net/faq.html