Know when your drives are failing, with smartd

“Ka-chunk... ka-chunk... ka-chunk... tick... tick... tick... Ka-chunk... ka-chunk...” That's just not a sound you ever want to hear coming from a hard drive. It's the sound of a hard drive trying to move it's read/write heads into a position that they don't seem to want to go to or its trying to read a sector that just isn't there anymore. Of course, modern hard drives have come a long way and are amazingly reliable, but if you work with computers long enough, you're bound to have one fail on you.

I know that a lot has been written about smartd and I was hesitant to pile on even more, but let's just say, for now, that I write about what I know about, and that lately, I've come to know a lot about disk drive failures...

Running smartd on your servers and workstations is just like performing backups; its something you know you should be doing, but probably aren't, or at least not regularly. I'm hoping that this article will help convince you that now's the time to start.

SMART, or Self-Monitoring, Analysis, and Reporting Technology is a capability that's built into almost all modern IDE and SCSI disk drives that allows the drive controler to minitor it's own state of health. What that means is that a SMART-capable drive can give you a clue that its about to fail. All you have to do is listen for these clues, and that's where smartd comes in. Smartd is part of the smartmontools package and runs as a daemon on your system. Periodically, smartd will poll your installed hard drives and, essentally, “ask them” how their doing. Smartmontools supports ATA/ATAPI/SATA-3 to -8 disks and SCSI, so smartd should be usable across a wide range of drives.

Performing a basic smartd configuration is almost rediculously simple to do. Most of us are using Linux distributions that have some sort of package management feature, so installing smartmontools should be fairly simple since the suite has been packaged for all of the major distributions. Once installed, configuration is as easy as it gets. The tool, by default, will discover and scan all of the drives in your system, and the default configuration file seems reasonable. Most of the time, I don't even bother to make changes to the /etc/smartd.conf file. You only need to change the configuration file if you need smartd to handle ill-behaved hardware, or if you need it to perform specific, non-default, functions. Finally, you have to arrange for your system to start smartd as part of the system start-up process. This part depends on which distribution you use. On my Gentoo boxes, I use:

rc-update add smartd default and I'm done.

Now that the suite is installed, and the daemon is running, we should find the occational log entry in our syslog. Here's what I see when I restart smartd:

Dec 1 16:50:32 localhost smartd[6219]: smartd received signal 15: Terminated
Dec 1 16:50:32 localhost smartd[6219]: smartd is exiting (exit status 0)
Dec 1 16:50:33 localhost smartd[2549]: smartd version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Dec 1 16:50:33 localhost smartd[2549]: Home page is http://smartmontools.sourceforge.net/
Dec 1 16:50:33 localhost smartd[2549]: Opened configuration file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Drive: DEVICESCAN, implied '-a' Directive on line 23 of file /etc/smartd.conf
Dec 1 16:50:33 localhost smartd[2549]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Dec 1 16:50:33 localhost smartd[2549]: Problem creating device name scan list
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, found in smartd database.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hda, is SMART capable. Adding to "monitor" list.
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, opened
Dec 1 16:50:33 localhost smartd[2549]: Device: /dev/hdc, packet devices [this device CD/DVD] not SMART capable
Dec 1 16:50:33 localhost smartd[2549]: Monitoring 1 ATA and 0 SCSI devices
Dec 1 16:50:33 localhost smartd[2573]: smartd has fork()ed into background mode. New PID=2573.
Dec 1 16:50:33 localhost smartd[2573]: file /var/run/smartd.pid written containing PID 2573

This is what a successful smartd start looks like. As you can see, my workstation only has one hard drive that is SMART-capable and a CD/DVD drive that isn't. As long as the daemon is running, we'll see log entries indicating if the health status of my drive changes.

But we don't always read our logs. That's OK, the smartmontools suite has a command-line tool that you can use interactively to find out how healthy your drives are. For example, we can use smartctl to find out what type of drive we have:

# smartctl -i /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD3200JB-00KFA0
Serial Number: WD-WCAMR3566562
Firmware Version: 08.05J08
User Capacity: 320,072,933,376 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Dec 1 16:57:28 2008 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Of course, most of this isn't really interesting and only serves to expose just how old my system really is. (In the future, I'll strip out the heading information for further output) But we can also use smartctl to ask drive about it's general state of health:

# smartctl -H /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What it's telling you is that the IDE drive controller has detected problems. It does NOT mean that other problems don't exist, just that the IDE controller hasn't seen them yet. I got a passing result on a drive I knew to be bad. Still, this is a valuable check, because if the IDE controller has found a problem, you REALLY need to know about it.

A more thorough test result may be obtained by with the smartctl -t short /dev/hda command. As you might imagine, an even more thorough test result can be had by changing “short” to “long” in the command above. However, this command doesn't immediately return any results. It simply tells you to come back late and ask for the results:

# smartctl -t short /dev/hda
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Dec 1 17:15:50 2008

Use smartctl -X to abort test.

Well, after the test is complete, we can use the smartctl -l selftest /dev/hda command to see the results:

# smartctl -l selftest /dev/hda
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 14942 -

Here is where you see that the drive in my workstation is doing just fine. But in the spirit of spreading the misery, let's take a look at a few drives I have that aren't doing so well.

Here is an ominous message that I found in the log file of my MythTV server:

Nov 27 04:12:51 media smartd[6884]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252

The neat thing about it all is that we use this server every day as a family and we've not noticed anything unusual. Still, the drive says it's about to fail. Guess what? We'll be replacing it soon. I'll probably duplicate the system and replace the drive in one operation and be back up in 30 minutes.

Here's another example. Two of the drives in my home fileserver are failing. I've backed them up and am waiting for the replacement drives to arrive. In the mean time, here is the result of the short selftest on one of them:

# smartctl -l selftest /dev/hdd
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 23678 200910
# 2 Extended offline Completed: read failure 90% 23676 200910
# 3 Short offline Completed: read failure 90% 23676 200910

As it happends, these drive problems were announced via syslog some time before I ever noticed a problem. Take a look:

Nov 27 03:34:10 dominion smartd[30161]: Monitoring 4 ATA and 0 SCSI devices
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 463 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdc, 1210 Offline uncorrectable sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1430 Currently unreadable (pending) sectors
Nov 27 03:34:11 dominion smartd[30161]: Device: /dev/hdd, 1429 Offline uncorrectable sectors

I've been watching the number of uncorrectable and unreadable sectors increase over the course of a month or so. And that's the point of this article. I knew my drives were failing long before they actually failed. A drive that has one or two bad sectors can be easily recovered or repaired. But this drive isn't going to last long. Because I was using smartd, and watching my logs, I was able to get my data backed up, consider my replacement options, and plan the replacement. The last thing I, or you, need, is to wake up one morning and find that an important server died during the night without warning.... Been there. Done that. Never want to do it again.

______________________

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

how to monitor logs

bill m's picture

A very helpful tool. I've actually been reading LJ from my cell phone.

a few questions:

1) Which log messages should I worry about?

I see the smart messages in my /var/log/everything. Many of them describe temperature changes. To skip these, I did this command.
# grep -i smart log-2009-02-06-*|grep -iv Celsius|less

How does one best monitor the smartd log entries? What's the "right command" to grep on the logs?

output:
log-2009-02-06-17:23:46:Feb 6 09:06:55 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 102_
log-2009-02-06-17:23:46:Feb 6 09:36:54 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 102 to 103_
log-2009-02-06-17:23:46:Feb 6 10:06:55 [smartd] Device: /dev/sdb, SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 104_
log-2009-02-06-17:23:46:Feb 6 10:06:55 [smartd] Device: /dev/sdb, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 57 to 58_

2) What does lifetime mean? # of hours until it fails?

Here are the outputs of selftests on two of my drives, /dev/sda and /dev/sdba

Does only only have 500 hours left (i.e. 20 days)? What oes this mean?

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 509 -
# 2 Extended offline Completed without error 00% 947 -
# 3 Short offline Completed without error 00% 946 -

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 15344 -
# 2 Short offline Completed without error 00% 15342 -

Smart script

Chris Freyer's picture

I've had a shell script published on my wiki since July '07 that monitors my drives and emails their status to me. Your article summarizes the research that I went thru to build my script. Perhaps the script will benefit some LJ readers.

I learned is that the extended drive information is different between manufacturers, so I can't depend on that as a means of determining status. But I can easily look for the word 'PASSED'. It is always there for a healthy drive. So its a pretty trivial task to write out a custom text message and deliver it--by email in my case--to avoid potential problems.

Typo

Anonymous's picture

Shouldn't this

At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What its telling you is that the IDE drive controller has detected problems.

Be

At this point, though, we need to discuss what it means for the “overall-health self-assessment” to result in “PASSED.” In practice, I'm finding, it doesn't really mean that much. What its telling you is that the IDE drive controller has NOT detected problems.

Google disk report

Anonymous's picture

As one of the worlds biggest consumer of disks, Google wrote a report about disk failures in there systems some years ago, I think it still is rather actual. They do not talk very positive about SMART.

http://research.google.com/archive/disk_failures.pdf

What do these messages actually mean?

Oloryn's picture

One thing these type of articles never seem to get into is:

What do messages like

Nov 27 04:12:51 media smartd[6884]: Device: /dev/hda, SMART Prefailure Attribute: 8 Seek_Time_Performance changed from 251 to 252

or (taken from Logwatch on one of my computers)

/dev/hda :
Prefailure: Seek_Time_Performance (8) changed to
250, 249, 248,

/dev/hdb :
Usage: Raw_Read_Error_Rate (1) changed to
109, 110, 111, 112,

actually mean? Not the meta-meaning "You should change your hard drive", but exactly what do these numbers actually indicate?

Ben

SMART is not very reliable, change harddisks

Anonymous's picture

Before you think SMART will help you ensure that harddisks do not fail, read

http://research.google.com/archive/disk_failures.pdf

Harddisk quality is not increasing, it is decreasing. It is like floppy disks, 20 year ago they were rather reliable, during the last 5 years, they were readable in the same floppy drive where they were written.

The reason for this decline is price. Hard drives are very cheap. In my notebooks, I replace the harddrive yearly, very easily done if you use ghosting tools like g4l or similar. And as a bonus, you get a lot of free space.

How do you view smartd info for SATA drives?

Brendan Skoreyko's picture

Hi there, great article but how do I check on my SATA drives? Looks like I have to use something called libata.

Thanks,
Brendan

SATA

vicm3's picture

On the current version of smartmontools SATA is supported from the faq

"Smartmontools should work correctly with SATA drives under both Linux 2.4 and 2.6 kernels. Depending on which subsystem the SATA controller is in (i.e. drivers/ide, drivers/ata or libata (under drivers/scsi) a SATA drive will appear as /dev/hd* or /dev/sd*. Either way, smartmontools should be able to figure out what is going on and act accordingly. In some cases smartmontools may need a hint in the form of a '-d sat' or '-d ata' option on the smartctl command line or in the /etc/smartd.conf file. There may be a hint to add one of those options in the log file when smartd is run as a daemon or on the command line with smartctl. The '-d ata' option means that even though the drive has a SCSI device name, treat it as an ATA disk. Unfortunately such an approach doesn't often work. The next paragraph has more information about '-d sat'."

http://smartmontools.sourceforge.net/faq.html

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix