Monitoring Hard Disks with SMART

January 1st, 2004 by Bruce Allen in

One of your hard disks might be trying to tell you it's not long for this world. Install software that lets you know when to replace it.
Your rating: None Average: 4.7 (98 votes)

It's a given that all disks eventually die, and it's easy to see why. The platters in a modern disk drive rotate more than a hundred times per second, maintaining submicron tolerances between the disk heads and the magnetic media that store data. Often they run 24/7 in dusty, overheated environments, thrashing on heavily loaded or poorly managed machines. So, it's not surprising that experienced users are all too familiar with the symptoms of a dying disk. Strange things start happening. Inscrutable kernel error messages cover the console and then the system becomes unstable and locks up. Often, entire days are lost repeating recent work, re-installing the OS and trying to recover data. Even if you have a recent backup, sudden disk failure is a minor catastrophe.

Many users and system administrators don't know that Self-Monitoring, Analysis and Reporting Technology systems (SMART) are built in to most modern ATA and SCSI hard disks. SMART disk drives internally monitor their own health and performance. In many cases, the disk itself provides advance warning that something is wrong, helping to avoid the scenario described above. Most implementations of SMART also allow users to perform self-tests on the disk and to monitor a number of performance and reliability attributes.

By profession I am a physicist. My research group runs a large computing cluster with 300 nodes and 600 disk drives, on which more than 50TB of physics data are stored. I became interested in SMART several years ago when I realized it could help reduce downtime and keep our cluster operating more reliably. For about a year I have been maintaining an open-source package called smartmontools, a spin-off of the UCSC smartsuite package, for this purpose.

In this article, I explain how to use smartmontools' smartctl utility and smartd dæmon to monitor the health of a system's disks. See smartmontools.sourceforge.net for download and installation instructions and consult the WARNINGS file for a list of problem disks/controllers. Additional documentation can be found in the man pages (man smartctl and man smartd) and on the Web page.

Versions of smartmontools are available for Slackware, Debian, SuSE, Mandrake, Gentoo, Conectiva and other Linux distributions. Red Hat's existing products contain the UCSC smartsuite versions of smartctl and smartd, but the smartmontools versions will be included in upcoming releases.

To understand how smartmontools works, it's helpful to know the history of SMART. The original SMART spec (SFF-8035i) was written by a group of disk drive manufacturers. In Revision 2 (April 1996) disks keep an internal list of up to 30 Attributes corresponding to different measures of performance and reliability, such as read and seek error rates. Each Attribute has a one-byte normalized value ranging from 1 to 253 and a corresponding one-byte threshold. If one or more of the normalized Attribute values less than or equal to its corresponding threshold, then either the disk is expected to fail in less than 24 hours or it has exceeded its design or usage lifetime. Some of the Attribute values are updated as the disk operates. Others are updated only through off-line tests that temporarily slow down disk reads/writes and, thus, must be run with a special command. In late 1995, parts of SFF-8035i were merged into the ATA-3 standard.

Starting with the ATA-4 standard, the requirement that disks maintain an internal Attribute table was dropped. Instead, the disks simply return an OK or NOT OK response to an inquiry about their health. A negative response indicates the disk firmware has determined that the disk is likely to fail. The ATA-5 standard added an ATA error log and commands to run disk self-tests to the SMART command set.

To make use of these disk features, you need to know how to use smartmontools to examine the disk's Attributes (most disks are backward-compatible with SFF-8035i), query the disk's health status, run disk self-tests, examine the disk's self-test log (results of the last 21 self-tests) and examine the disk's ATA error log (details of the last five disk errors). Although this article focuses on ATA disks, additional documentation about SCSI devices can be found on the smartmontools Web page.

To begin, give the command smartctl -a /dev/hda, using the correct path to your disk, as root. If SMART is not enabled on the disk, you first must enable it with the -s on option. You then see output similar to the output shown in Listings 1–5.

The first part of the output (Listing 1) lists model/firmware information about the disk—this one is an IBM/Hitachi GXP-180 example. Smartmontools has a database of disk types. If your disk is in the database, it may be able to interpret the raw Attribute values correctly.

The second part of the output (Listing 2) shows the results of the health status inquiry. This is the one-line Executive Summary Report of disk health; the disk shown here has passed. If your disk health status is FAILING, back up your data immediately. The remainder of this section of the output provides information about the disk's capabilities and the estimated time to perform short and long disk self-tests.

The third part of the output (Listing 3) lists the disk's table of up to 30 Attributes (from a maximum set of 255). Remember that Attributes are no longer part of the ATA standard, but most manufacturers still support them. Although SFF-8035i doesn't define the meaning or interpretation of Attributes, many have a de facto standard interpretation. For example, this disk's 13th Attribute (ID #194) tracks its internal temperature.

Studies have shown that lowering disk temperatures by as little as 5°C significantly reduces failure rates, though this is less of an issue for the latest generation of fluid-drive bearing drives. One of the simplest and least expensive steps you can take to ensure disk reliability is to add a cooling fan that blows cooling air directly onto or past the system's disks.

Each Attribute has a six-byte raw value (RAW_VALUE) and a one-byte normalized value (VALUE). In this case, the raw value stores three temperatures: the disk's temperature in Celsius (29), plus its lifetime minimum (23) and maximum (33) values. The format of the raw data is vendor-specific and not specified by any standard. To track disk reliability, the disk's firmware converts the raw value to a normalized value ranging from 1 to 253. If this normalized value is less than or equal to the threshold (THRESH), the Attribute is said to have failed, as indicated in the WHEN_FAILED column. The column is empty because none of these Attributes has failed. The lowest (WORST) normalized value also is shown; it is the smallest value attained since SMART was enabled on the disk. The TYPE of the Attribute indicates if Attribute failure means the device has reached the end of its design life (Old_age) or it's an impending disk failure (Pre-fail). For example, disk spin-up time (ID #3) is a prefailure Attribute. If this (or any other prefail Attribute) fails, disk failure is predicted in less than 24 hours.

The names/meanings of Attributes and the interpretation of their raw values is not specified by any standard. Different manufacturers sometimes use the same Attribute ID for different purposes. For this reason, the interpretation of specific Attributes can be modified using the -v option to smartctl; please see the man page for details. For example, some disks use Attribute 9 to store the power-on time of the disk in minutes; the -v 9,minutes option to smartctl correctly modifies the Attribute's interpretation. If your disk model is in the smartmontools database, these -v options are set automatically.

The next part of the smartctl -a output (Listing 4) is a log of the disk errors. This particular disk has been error-free, and the log is empty. Typically, one should worry only if disk errors start to appear in large numbers. An occasional transient error that does not recur usually is benign. The smartmontools Web page has a number of examples of smartctl -a output showing some illustrative error log entries. They are timestamped with the disk's power-on lifetime in hours when the error occurred, and the individual ATA commands leading up to the error are timestamped with the time in milliseconds after the disk was powered on. This shows whether the errors are recent or old.

The final part of the smartctl output (Listing 5) is a report of the self-tests run on the disk. These show two types of self-tests, short and long. (ATA-6/7 disks also may have conveyance and selective self-tests.) These can be run with the commands smartctl -t short /dev/hda and smartctl -t long /dev/hda and do not corrupt data on the disk. Typically, short tests take only a minute or two to complete, and long tests take about an hour. These self-tests do not interfere with the normal functioning of the disk, so the commands may be used for mounted disks on a running system. On our computing cluster nodes, a long self-test is run with a cron job early every Sunday morning. The entries in Listing 5 all are self-tests that completed without errors; the LifeTime column shows the power-on age of the disk when the self-test was run. If a self-test finds an error, the Logical Block Address (LBA) shows where the error occurred on the disk. The Remaining column shows the percentage of the self-test remaining when the error was found. If you suspect that something is wrong with a disk, I strongly recommend running a long self-test to look for problems.

The smartctl -t offline command can be used to carry out off-line tests. These off-line tests do not make entries in the self-test log. They date back to the SFF-8035i standard, and update values of the Attributes that are not updated automatically under normal disk operation (see the UPDATED column in Listing 3). Some disks support automatic off-line testing, enabled by smartctl -o on, which automatically runs an off-line test every few hours.

The SMART standard provides a mechanism for running disk self-tests and for monitoring aspects of disk performance. Its main shortcoming is that it doesn't provide a direct mechanism for informing the OS or user if problems are found. In fact, because disk SMART status frequently is not monitored, many disk problems go undetected until they lead to catastrophic failure. Of course, you can monitor disks on a regular basis using the smartctl utility, as I've described, but this is a nuisance.

The remaining part of the smartmontools package is the smartd dæmon that does regular monitoring for you. It monitors the disk's SMART data for signs of problems. It can be configured to send e-mail to users or system administrators or to run arbitrary scripts if problems are detected. By default, when smartd is started, it registers the system's disks. It then checks their status every 30 minutes for failing Attributes, failing health status or increased numbers of ATA errors or failed self-tests and logs this information with SYSLOG in /var/log/messages by default.

You can control and fine-tune the behavior of smartd using the configuration file /etc/smartd.conf. This file is read when smartd starts up, before it forks into the background. Each line contains Directives pertaining to a different disk. The configuration file on our computing cluster nodes look like this:

# /etc/smartd.conf config file
/dev/hda -S on -o on -a -I 194 -m sense@phys.uwm.edu
/dev/hdc -S on -o on -a -I 194 -m sense@phys.uwm.edu

The first column indicates the device to be monitored. The -o on Directive enables the automatic off-line testing, and the -S on Directive enables automatic Attribute autosave. The -m Directive is followed by an e-mail address to which warning messages are sent, and the -a Directive instructs smartd to monitor all SMART features of the disk. In this configuration, smartd logs changes in all normalized attribute values. The -I 194 Directive means ignore changes in Attribute #194, because disk temperatures change often, and it's annoying to have such changes logged on a regular basis.

Normally smartd is started by the normal UNIX init mechanism. For example, on Red Hat distributions, /etc/rc.d/init.d/smartd start and /etc/rc.d/init.d/smartd stop can be used to start and stop the dæmon.

Further information about the smartd and its config file can be found in the man page (man smartd), and summaries can be found with the commands smartd -D and smartd -h. For example, the -M test Directive sends a test e-mail warning message to confirm that warning e-mail messages are delivered correctly. Other Directives provide additional flexibility, such as monitoring changes in raw Attribute values.

What should you do if a disk shows signs of problems? What if a disk self-test fails or the disk's SMART health status fails? Start by getting your data off the disk and on to another system as soon as possible. Second, run some extended disk self-tests and see if the problem is repeatable at the same LBA. If so, something probably is wrong with the disk. If the disk has failing SMART health status and is under warranty, the vendor usually will replace it. If the disk is failing its self-tests, many manufacturers provide specialized disk health programs, for example, Maxtor's PowerMax or IBM's Drive Fitness Test. Sometimes these programs actually can repair a disk by remapping bad sectors. Often, they report a special error code that can be used to get a replacement disk.

This article has covered the basics of smartmontools. To learn more, read the man pages and Web page, and then write to the support mailing list if you need further help. Remember, smartmontools is no substitute for backing up your data. SMART cannot and does not predict all disk failures, but it often provides clues that something is amiss and has helped to keep many computing clusters operating reliably.

Several developers are porting smartmontools to FreeBSD, Darwin and Solaris, and we recently have added extensions to allow smartmontools to monitor and control the ATA disks behind 3ware RAID controllers. If you would like to contribute to the development of smartmontools, write to the support mailing list. We especially are interested in information about the interpretation and meaning of vendor-specific SMART Attribute and raw values.

Bruce Allen is a professor of Physics at the University of Wisconsin - Milwaukee. He does research work on gravitational waves and the very early universe, and he has built several large Linux clusters for data analysis use.

__________________________


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Karen's picture

I am getting the following error, from my dedicated server

On April 22nd, 2009 Karen (not verified) says:

I just got moved from another server to this one

I am concerned about this error message and the hosting
people are telling me that it is fine, and the only way
to fix it is to turn off the temperature monitor of smart

That the control is set too low

I need to know if this is correct or not.

Here is the error:S.M.A.R.T Errors on /dev/sda
From Command: /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/sda
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Temperature_Celsius 0x0022 065 039 045 Old_age Always In_the_past 622854179

Your help would be very much appreciated,

Thanks
Karen

Anonymous's picture

didn't the article mention

On August 5th, 2009 Anonymous (not verified) says:

didn't the article mention to ignore 194 (temperature) as the variable changes so often?

Anonymous's picture

"Studies have shown that

On April 10th, 2009 Anonymous (not verified) says:

"Studies have shown that lowering disk temperatures by as little as 5°C significantly reduces failure rates, though this is less of an issue for the latest generation of fluid-drive bearing drives. One of the simplest and least expensive steps you can take to ensure disk reliability is to add a cooling fan that blows cooling air directly onto or past the system's disks."

Which studies are those? Google's study of over 100,000 drives found that disks failed MORE often when they were cooled, and ran better hot:

"In fact, there is a clear trend showing that lower temperatures are associated with higher failure rates. Only at very high temperatures is there a slight reversal of this trend."

-----------

The results from smartctl are very confusing and hard to understand. Wikipedia clarifies what some of the values mean, though there's still a lot of uncertainty:

http://en.wikipedia.org/wiki/Self-Monitoring,_Analysis,_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes

Another tool is http://gsmartcontrol.berlios.de/ , which adds a GUI to smartctl, and provides helpful descriptions when you hover over attributes.

Anonymous's picture

Sense key errors, important or not?

On October 31st, 2008 Anonymous (not verified) says:

I have a critical server running our mail system which has lately been spewing SCSI "sense key errors" to the console. Is this important?

Taking a backup of this server will be a real pain, so do you think the hard drives are OK?

Anonymous's picture

If your mail server is

On November 3rd, 2008 Anonymous (not verified) says:

If your mail server is "critical" to your operation, shouldn't you be doing regular backups anyway? Or, at the very least, use redundant storage like RAID 1?

If you are getting errors, I would back it up without hesitation. Which is the bigger pain, an inconvenient backup or permanent data loss?

Backup and replace the hard drives, the sooner the better. Also, incorporate some redundancy in there.

Alex507's picture

Hardware Error

On September 24th, 2007 Alex507 (not verified) says:

Im having the following error.

ce: /dev/sda, SMART Failure: HARDWARE IMPENDING FAILURE TOO MANY BLOCK REASSIGNS

I was wondering if someone have the correct solution for this issue or the main cause of this message.

bambid's picture

ID 194 shows strange value

On June 9th, 2007 bambid (not verified) says:

I have WDC WD5000YS-01MPB0 and when I read ID 194 from HDD I get this :

194 Temperature_Celsius 0x0022 253 253 000 Old_age Always - 101

which is totally wrong, my HDD can´t have 101 Celsius.

David

Anonymous's picture

Thank you very much for

On January 12th, 2007 Anonymous (not verified) says:

Thank you very much for posting this! I already feared that my hard disk is dying because of the strange noises the PC made on start up. (I don't even know now where the noise came from. Could be the other things too, right?)

Anonymous's picture

USB Harddrives

On January 3rd, 2007 Anonymous (not verified) says:

I was wondering if its ever going to be possible to get the SMART info off a USB storage device?

Would it need a redesigned USB/ATA interface?

Seems a real shame I can't monitor the health of my many USB drives.

mehereno's picture

SMART for USB Harddrives?

On December 28th, 2008 mehereno (not verified) says:

I miss SMART for USB disks too. I wonder why I cannot monitor my disk connected over USB in Linux. Is it Linux driver limit? Or HW limit? My USB/ATA controler is based on Genesys Logic (05e3:0702), my disk supports SMART; I know I can read SMART statistics when I connect my disk over PATA cable.

D. L. Sneddon's picture

Monitoring USB Hard Disks with SMART

On January 23rd, 2007 D. L. Sneddon (not verified) says:

Bruce Allen's reference to SMARTs ability "query the disk's health status, run disk self-tests..." suggests that you could at least get some kind of condition report by removing your Hard Disk from its USB housing, connecting it to an ATA cable in a desktop, then running the query utility. I use a cheap ($6) adapter to connect my 2.5 inch laptop drives to my desktop. Though this ritual does not allow continuous monitoring of the USB drive, it may give a clue as to its current status.

Sad to say, other distractions, such as picking a distribution, have prevented me from trying out my own suggestion.
cheers...

StuartH's picture

Dead date

On January 3rd, 2007 StuartH (not verified) says:

Is SMARTD able to calculate a dead date?

I came across another SMART tool that did this after it was left running for several weeks.

It gave a estimated dead date.

Vic's picture

2 instances of smartd.conf

On September 18th, 2006 Vic (not verified) says:

I'm confused as to why there are 2 smartd.conf files. One in /etc/ and the other in /usr/local/etc/

Why are there 2? Which one do I need to edit? Lastly, how do I make the smartctl email me once a wek with the SMART results?

Thanks.

richard's picture

interpreting results of smartctl?

On September 5th, 2006 richard (not verified) says:

I ran smartctl -a /dev/hda and got the following error report (one of 8 - all similar on the same day):

Error 4 occurred at disk power-on lifetime: 9060 hours (377 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 01 41 00 00 e0 Error: ICRC, ABRT 1 sectors at LBA = 0x00000041 = 65

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 02 40 00 00 e0 00 00:01:01.697 READ DMA
c8 00 02 40 00 00 e0 00 00:01:01.685 READ DMA
10 00 3f 00 00 00 e0 00 00:01:01.685 RECALIBRATE [OBS-4]
c8 00 02 40 00 00 e0 00 00:01:01.685 READ DMA
c8 00 02 40 00 00 e0 00 00:01:01.681 READ DMA

My question is "What does this mean exactly and should I be worried/how can I fix it?"

Thanks for a brilliant piece of diagnostic software. I only wish I was good enough to do full justice to it!!

Regards

Richard

Kitty's picture

Summary of bad (pending) sectors

On August 27th, 2006 Kitty (not verified) says:

Hello,
why doesn't smartctl show a summary of bad or pending sectors? One such message can be found in /var/log/messages like "Aug 27 12:17:51 91-64-143-104-dynip smartd[4483]: Device: /dev/hdb, 7 Currently unreadable (pending) sectors", however, it would be more convenient to get this information directly from smartctl. How can I get this information?

Thank u!!!

Michael Janich's picture

When do you replace a disk

On July 31st, 2006 Michael Janich (not verified) says:

I've seen all these attributes and things, but my question
is "when do you replace a disk?" I think that is the only
question a typical sysadmin has.

THANKS

Michael

Dotgain's picture

The day before it fails,

On September 9th, 2007 Dotgain (not verified) says:

The day before it fails, obviously.

fromport's picture

Nice written & informative article

On July 30th, 2006 fromport (not verified) says:

Thank you for a nice written and informative article.
I tried it on one of my scsi drives which tends to be busy.
It gave me some other information:
The overall status is ok, should i worry about the errors ?

# smartctl -a /dev/sda|less
smartctl version 5.36 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: MAXTOR ATLAS10K5_147SCA Version: JNZ3
Serial number: D404M6EK
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Mon Jul 31 07:47:36 2006 CEST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 26 C
Manufactured in week 04 of year
Current start stop count: 1074003968 times
Recommended maximum start stop count: 1124401151 times
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 13957055 0 0 0 0 10856.427 0
write: 0 0 0 0 0 21552.894 0

Non-medium error count: 564

PhilG's picture

Well, this is great

On July 22nd, 2006 PhilG (not verified) says:

Well, this is great information (certainly the parts I understand are.....)

Anyway, I am using SMARTMON to monitor the health of the Seagate drive in my Tivo

The last run produced THIS:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 068 049 006 Old_age - 116459253
3 Spin_Up_Time 0x0003 096 095 000 Old_age - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age - 28
5 Reallocated_Sector_Ct 0x0033 100 100 036 Old_age - 18
7 Seek_Error_Rate 0x000f 083 075 030 Old_age - 223581632
9 Power_On_Hours 0x0032 096 096 000 Old_age - 3778
10 Spin_Retry_Count 0x0013 100 100 097 Old_age - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age - 29
194 Temperature_Celsius 0x0022 047 049 000 Old_age - 47
195 Hardware_ECC_Recovered 0x001a 068 048 000 Old_age - 116459253
197 Current_Pending_Sector 0x0012 100 100 000 Old_age - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age - 0
202 Unknown_Attribute 0x0032 100 253 000 Old_age - 0

********************************************************************************

There are some BIG numbers for attributes 1, 7 and 195.

I do have a fairly good understanding of disk architectures, but I cannot get a handle on what these fields might mean so any assistance would be GREATLY appreciated

Basically, I just want to know whether I have LOTS of errors on this disk or whether I just have a small number of "bad spots" that I am hitting very often

Many thanks

Phil G

Dave Rave's picture

with attrib 1,7,195I find

On September 21st, 2006 Dave Rave (not verified) says:

with attrib 1,7,195
I find this with all my seagate drives
which worries me where I read that part about ata4 standard and drives not keeping the attributes anymore

i think my non-seagate drives are now just too dumb to realise they are failing.
if my seagate drives get that error value down in the 60's, they are going out soonish
not real quick today soon
but the system is just iffy and had to play with
if you get spinrite to run over the drive, it will improve, some, for a while

Mark F.'s picture

Cannot get rid of SMART warning on startup

On July 21st, 2006 Mark F. (not verified) says:

This is a great article, and the questions following it make it even better. I now understand what that SMART error warning I get whenever my machine starts up. Thanks for the great tools too!

Now my question, I get the following error everytime the machine starts up (I am paraphasing a bit):

SMART monitoring error
Please backup your data!
Press F1 to continue

Strangely, over several years I simply ignored this error and dutifully pressed F1. Slackware Linux (my former OS) and now NetBSD 2.0 and 3.0 have worked without any problem.

After reading the article and doing a long test, I have the following error report:
watson:~#smartctl -l selftest /dev/wd0d
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
Warning: device does not support Self Test Logging
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 29145 -
# 2 Short offline Completed without error 00% 29144 -

If I immediately give the following command:
watson:~#smartctl -l error /dev/wd0d |sed -n '/Error /p'
Warning: device does not support Error Logging
SMART Error Log Version: 1
ATA Error Count: 133 (device log contains only the most recent five errors)
ER = Error register [HEX]
Error 133 occurred at disk power-on lifetime: 29144 hours (1214 days + 8 hours)
Error 132 occurred at disk power-on lifetime: 29144 hours (1214 days + 8 hours)
Error 131 occurred at disk power-on lifetime: 29144 hours (1214 days + 8 hours)
Error 130 occurred at disk power-on lifetime: 29144 hours (1214 days + 8 hours)
Error 129 occurred at disk power-on lifetime: 29144 hours (1214 days + 8 hours)

The problem is I always have to be around to press F1 whenever the system boots up. Other than that, the disk (and the OSes) seem to work fine. I tried disabling BIOS harddrive monitoring but that did not help. Also disabling smart through smartctl and rebooting but that did not help either. Somehow the disk always remembers the SMART error.

The disk is a Maxtor 91531U3.

Is there anyway I get rid of that SMART warning at startup. Any help would be much appreciated.

Mark

Mike's picture

Can I switch off SMART detection using this tool?

On June 27th, 2006 Mike (not verified) says:

I get messages from bios when I switch on the laptop, that "HDD status bad , back up and replace. I want to stop this message appearing so that windows will load normally. I cant disable it via BIOS as it has got no such an option. Will this tool help me?

Thanks!!

Denis's picture

Lifetime

On June 14th, 2006 Denis (not verified) says:

First of all, congratulations on the article.

I've been intrigated with some data shown at the smartctl -a, about lifetime. I've read around, and I still have a doubt.

194 Temperature_Celsius 0x0022 043 049 000 Old_age Always - 43 (Lifetime Min/Max 0/20)

How am I supposed to read this lifetime Min/Max?
What 0/20 means? Anyone knows ?

Thanks

Eugene Dzhurinsky's picture

Seagate ST340014A 3.06

On February 19th, 2006 Eugene Dzhurinsky (not verified) says:

First of all - thank you for great article!
I just have a question - smartctl reports PASSED for my drive, but also it erports for
Extended offline Completed: read failure 0% 1895 39965820

Does it means I just have bad sector which could be remapped, because other sections reports no errors:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 071 067 006 Pre-fail Always - 182149325
3 Spin_Up_Time 0x0003 099 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 54648620
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1895
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 098 098 020 Old_age Always - 2198
194 Temperature_Celsius 0x0022 038 045 000 Old_age Always - 38
195 Hardware_ECC_Recovered 0x001a 071 067 000 Old_age Always - 182149325
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0

ballen's picture

Yes, this probably means

On June 10th, 2006 ballen (not verified) says:

Yes, this probably means that your disk has a bad sector. Read the BadBlocksHowTo linked from the smartmontools home page, to see how to identify if there is a file being stored on that bad part of the disk, and how to force the drive to reallocate that sector.

Bruce

Web Hosting Tech's picture

worst value

On February 2nd, 2006 Web Hosting Tech (not verified) says:

Great article, many thanks!

I have one thing that I cannot quite understand. If I read it correctly, the value is the current snapshot of what smartctl sees. In the case below, that is 045. The funny thing is it stats that the "worst" it has seen is 054.

194 Temperature_Celsius 0x0022 045 054 000

Is the temperature attribute the exception to the rule that the worst value is the "smallest value attained since SMART was enabled on the disk"

I suppose this would make sense as the worst temperature in a real life system would be a high temperature in most cases. Either that or I am way off base!

ballen's picture

This must be a SEAGATE disk.

On June 10th, 2006 ballen (not verified) says:

This must be a SEAGATE disk. Seagate ignores the smart standard and just stores the temperature (in Celsius) in these variables. So your current disk temperature is 45C and the hottest it has ever been is 54C.

Note: this info can also be found in the smartmontools FAQ page.

Bruce Allen

Tracy R T's picture

SMART for SATA drives

On November 2nd, 2005 Tracy R T (not verified) says:

I am running Centos release 4 with SATA drives on the digital video recorders we are building. I want to utilise the SMART suite but I have found that the SMART daemon fails to start during bootup. DO SATA drives support SMART?

regs TT

ballen's picture

Yes, smartmontools supports

On June 10th, 2006 ballen (not verified) says:

Yes, smartmontools supports SATA drives via libata. You need a Linux 2.6.15 or greater kernel. A typical command line is:

smartctl -a -d ata /dev/sda

Starting with release 5.37 smartmontools will also support a SCSI to ATA translation layer (SAT). The code is already in CVS. With this you can also use:

smartctl -a -d sat /dev/sda

The latter form allows extra functionality, for example running selective self-tests.

Bruce

sensovision from WKey's picture

SMART support on SATA drives

On November 12th, 2005 sensovision from WKey (not verified) says:

Unfortunately right now official libata library in kernel doesn't support ATA-passthrough calls and the only way to check SMART status right now is to use patches like this: http://www.kernel.org/pub/linux/kernel/people/jgarzik/libata/

Here is the quote from developers of smartmontools:
"Smartmontools should work correctly with SATA drives under both Linux 2.4 and 2.6 kernels, if you use the standard IDE drivers in drivers/ide. If you use the new libata drivers, it won't work correctly because libata doesn't yet support the needed ATA-passthrough ioctl() calls. Jeff Garzik, the libata developer, says that this support will be added to libata in the future. When this happens, we'll add support to smartmontools for a new SATA/libata device type '-d sata'. Typically, to force an SATA disk to run using the standard (non-libata) drivers, you must use the BIOS to select "legacy mode" for the controller. If the IDE driver doesn't support your particular SATA controller, or the controller doesn't have a legacy interface, then only libata can be used. Unless the hard disk controller on the system motherboard is Intel, VIA or nVidia, standard IDE drivers may not work

Note: an unofficial patch to libata that allows smartmontools to be used with the standard '-d ata' device type was posted to the linux kernel mailing list at the end of August 2004. The patch is included in the libata-dev patchset that can be applied to a recent Linux kernel (>= 2.6.9). With a SATA disk driven by a libata driver, smartmontools can now be used by specifying both the device type 'ata' and the SCSI device corresponding to this disk, for example, smartctl -i -d ata /dev/sda. The patch is still under development and it is probably best to make sure that the disk is idle before trying smartmontools. "

http://smartmontools.sourceforge.net/#testinghelp

Hope this helps.

Guest's picture

good work

On September 19th, 2005 Guest (not verified) says:

Thanks very much for this article. I feel better when I know how my HD's health is. Good work!

Thomas Rice's picture

S.M.A.R.T

On January 29th, 2006 Thomas Rice (not verified) says:

Well well - I had some ECS-AMD-Mainboard and activated the S.M.A.R.T. ... but actualy 2 Seagate-Harddisks died (the slow way - losing information) ... without SMART telling me that there is a Problem 8-)

ballen's picture

Unfortunately in the real

On June 10th, 2006 ballen (not verified) says:

Unfortunately in the real world SMART only detects about 2/3 of disk problems. The other 1/3 go undetected until the disk fails.

Bottom line: even with SMART you MUST back up data that you need and can not replace.

Bruce Allen

Anonymous's picture

Kernel I/O Error and SMART test result?

On March 12th, 2006 Anonymous (not verified) says:

I would like to know are there any direct reflection between the kernel I/O Error report and SMART test report?

I had a harddisk in Linux server, being reported I/O Seek Complete Error from Kernel nearly a year ago. I just leave that partition unused and used another harddisk to replace the mount point for that partition and let the server continues running.

After i read this article, i just go with a testing -a at that "Kernel reported problematic" harddisk.

The result is:
SMART overall-health self-assessment test result: PASSED

What does this mean? my harddisk is healthy with Seek Complete Error?
Or i don't have enough understanding about the actual manner of test result?

First part of the Error report is:
Error 9 occurred at disk power-on lifetime: 7557 hours
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 c5 ee 52 e0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name
-- -- -- -- -- -- -- -- --------- --------------------
25 00 08 c4 ee 52 e0 00 62302.187 READ DMA EXT
25 00 08 7c ee 52 e0 00 62302.186 READ DMA EXT
35 00 08 c9 8f f4 e0 00 62302.186 WRITE DMA EXT
25 00 08 bc ee 52 e0 00 62302.184 READ DMA EXT
25 00 10 7c 5f 53 e0 00 62302.184 READ DMA EXT

ballen's picture

Your disk has one or more

On June 10th, 2006 ballen (not verified) says:

Your disk has one or more unreadable sectors. This does NOT mean that the disk is failing, but it has lost some information on those sectors. Run an extended self test:

smartctl -t long /dev/hda (PATA disk)
smartctl -t long -d ata /dev/sda (SATA disk)

After the test is over, the self-test log (-l) will show what sector is unreadable. This will probably agree with what is shown in SYSLOG. Then look at BadBlockHowTo (linked from smartmontools home page) for instructions about how to identify if there is a file stored on that bad sector. If you have no data that you need, you can fix the problem by overwriting the bad partition with zeros using dd.

But be careful not to zero out regions of the disk that store data that you need!

Bruce Allen

Rick Paste's picture

re:S.M.A.R.T

On January 29th, 2006 Rick Paste (not verified) says:

Isn't the BIOS just activating the function and then some other piece of software has to do the checking. I am not sure, if the BIOS is giving you an alert, when harddisks are getting damaged.

Anonymous's picture

The BIOS does alert you if

On February 11th, 2006 Anonymous (not verified) says:

The BIOS does alert you if the drive is about to fail. However, if the machine is rarely restarted, we still need the monitoring software of course :-/

Thomas Rice's picture

re:S.M.A.R.T

On January 29th, 2006 Thomas Rice (not verified) says:

Oh yes - thanks a lot for the informaton.

Anonymous's picture

Smart Tool - Active

On July 7th, 2005 Anonymous (not verified) says:

Dear Mr. Bruce,

we are using the smart tools to test the 2.5" Fugitsu HD on one of embedded cPCI dual P-III board. We used it due to a problem we met - freezed "black screen" after POST (happened during the OS loading). It happened only with XP OS. The "Black screen" lead the user to manualy reset the card and hope it will not happened again in the next power cycle.

The command line used to run the test was smartctl -t long –d ata /dev/hda - the test run 40 minutes ! and fix the problem - but we have no idea what cause the problem and how this tool solve it if it should be only a TEST TOOL.

Can you assist with the following:
A. What tests is it running? read only? writes? does it change the HD controller working parameters?
Can you specifiy locations of the strings written/readden ?or this is random ?
B. Does it change something in the operating system? if it does, then what?
C. Does it change the Disk structure? if yes, how ?

Thank you
Shahar

ballen's picture

A: SMART extended self test.

On June 10th, 2006 ballen (not verified) says:

A: SMART extended self test. No it does not change drive parameters.
B. No, it does not change something in the OS.
C. It does not change the disk structure. However as the previous responder says, when the self-test is run, the disk firmware may find and correct some types of problems on the disk surface.

Bruce Allen

Anonymous's picture

As far as I know the given co

On October 12th, 2005 Anonymous (not verified) says:

As far as I know the given command line runs the diagnostic procedure *embedded* inside the hd firmware so it is to hd manufacturers discretion what is actually done. What may be possible in your case is that there was a bad sector in the disk that couldn't be read most of the times. When you run the procedure it might have happened that the sector could have been read one time and it have been immediately remaped (i.e. its data was moved to the disk spare area and the sector itself marked as unavailable).

Otherwise than that the tool doesn't affect HD state other than smart monitoring enable/disable flag and heath attributes values.

Regards,
DS.

Anonymous's picture

Thanks for the article. Great

On May 29th, 2005 Anonymous (not verified) says:

Thanks for the article. Great stuff!

RalphGL's picture

Thanks

On May 26th, 2005 RalphGL (not verified) says:

Thanks for this very useful article!

peter farge's picture

Maxtor HD: Smart not be enabled

On February 7th, 2005 peter farge (not verified) says:

Hello Bruce,

thanks for your great work. I have this Maxtor HD:

=== START OF INFORMATION SECTION ===
Device Model: MAXTOR 4K040H2
Firmware Version: A08.1500
User Capacity: 40,037,760,000 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 5
ATA Standard is: ATA/ATAPI-5 T13 1321D revision 1
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

And I cant enable SMART? I use the HD under Win2000. I have tried:

C:\s\bin>smartctl -s on /dev/hda
smartctl version 5.33 [i386-pc-mingw32] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

Its only a note. The HD works fine and I only want to be informed about the status...

ballen's picture

Please try a more recent

On June 10th, 2006 ballen (not verified) says:

Please try a more recent release of smartmontools. There were some teething problems with the initial Windows version, but those have been fixed.

Bruce

Chris Clemson's picture

Have you checked that it is e

On July 12th, 2005 Chris Clemson (not verified) says:

Have you checked that it is enabled in the BIOS?
Usually there is a setting which controls if SMART is enabled or not.

Anonymous's picture

Re: Monitoring Hard Disks with SMART

On September 8th, 2004 Anonymous says:

I have a drive that I can't even get into with fdisk, but I can access mounted file systems. It's temperature is indicating 53 celsius, but there is no max-min. It's in my machine next to another drive, and I'm going to try moving them apart. But any thoughts on fdisk not being able to get into the partition table even? It reads the other drive fine.

ballen's picture

Re: Monitoring Hard Disks with SMART

On September 19th, 2004 ballen (not verified) says:

I don't have any idea why fdisk can't provide information. What's the command line to and error message from fdisk?

The temperature min/max isn't provided by all vendors. IBM/Hitachi and Toshiba (recent disks) have this, but many other vendors and older disks either have no temperature information, or (as with your disk) just the current temperature.

Anonymous's picture

Re: Monitoring Hard Disks with SMART

On September 7th, 2004 Anonymous says:

excellent tool, using this now on a possibly faulty hd from our debian server

ballen's picture

Re: Monitoring Hard Disks with SMART

On September 19th, 2004 ballen (not verified) says:

Thank you!

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

December 2009, #188

If last month's Infrastrucuture issue was too "big" for you then try on this month's Embedded issue. Find out how to use Player for programming mobile robots, build a humidity controller for your root cellar, find out how to reduce the boot time of your embedded system, and if you're new to embedded systems find out the basics that go into one. You can also read about the Beagle Board, the Mesh Potato and a spate of other interestingly named items. And along with our regular columns don't miss our new monthly column: Economy Size Geek.







Read this issue