We have a linux server that has been in heavy use for 3 years. We're running a number of virtualized servers on it, some that have not been well behaved, and for a significant time the server's io capacity was exceeded leading to bad iowait. It's got 4 500gb Barracuda sata drives connected to a 3com raid controller. 1 Drive has the OS, and the other 3 are setup raid-5.
Now we have a debate as to the condition of the drives and whether they are actively failing.
Here's a portion of the output for 1 of the 4 disks. They all have relatively similar statistics:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 169074425 3 Spin_Up_Time 0x0003 095 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 26 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail Always - 200009354607 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 27856 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 1 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 26 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 060 045 Old_age Always - 29 (Lifetime Min/Max 26/37) 194 Temperature_Celsius 0x0022 029 040 000 Old_age Always - 29 (0 21 0 0) 195 Hardware_ECC_Recovered 0x001a 046 033 000 Old_age Always - 169074425 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged
My interpretation of this is that we have not had any bad sectors or other indications that any of the drives are actively failing.
However, the high Raw_Read_Error_Rate and Seek_Error_Rate is being pointed to as indications that the drives are dying.
For Seagate disks (and possibly some old ones from WD too) the Seek_Error_Rate and Raw_Read_Error_Rate are 48 bit numbers, where the most significant 16 bits are an error count, and the low 32 bits are a number of operations.
So your disk has performed 2440858991 seeks, of which 46 failed. My experience with Seagate drives is that they tend to fail when the number of errors goes over 1000. YMMV.
The "seek error rate" and "raw read error rate" RAW_VALUES are virtually meaningless for anyone but Seagate's support. As others pointed out, raw values of parameters like "reallocated sector count" or entries in the drive's error log are more likely to indicate a higher probability of failure.
But you can take a look at the interpreted data in the VALUE, WORST and THRESH columns which are meant to be read as gauges:
Meaning that your seek error rate is currently considered to be "77% good" and is reported as a problem by SMART when it reaches "30% good". It had been as low as "60% good" once, but has magically recovered since. Note that the interpreted values are calculated by the drive's SMART logic internally and the exact calculation may or may not be published by the manufacturer and typically cannot be tweaked by the user.
Personally, I consider a drive containing error log entries as "failing" and urge for a replacement as soon as they occur. But all in all, SMART data has turned out to be a rather weak indicator for failure prediction, as a research paper published by Google uncovered.
In my experience, Seagates have weird numbers for those two SMART attributes. When diagnosing a Seagate I tend to ignore those and look more closely at other fields like Reallocated Sector Count. Of course, when in doubt replace the drive, but even brand new Seagates will have high numbers for those attributes.
I realized this discussion is a bit old but want to add my 2 cents. I have found the smart information to be quite a good indicator of pre-fail. When you get a smart threshold tripped then replace the drive. That is what those thresholds are for.
The vast majority of time you will start to see bad sectors. That is a sure sign the drive is starting to fail. SMART has saved me many times. I use software RAID 1 and it's very helpful since you simply replace the failing drive and rebuild the array.
I also run short and long self test weekly.
Or add it /etc/smartd.conf and get it to email you if there are errors
Make sure to install logwatch and redirect root to an email address and check the daily emails from logwatch. SMARTD tripped flags will show up there but it's of no help if nobody is monitoring that regularly.
Sorry to commit necromancy on this post, but in my experience, the "Raw Read Error Rate" and "Hardware ECC Recovered" fields for a Seagate drive will quite literally go all over the place and increment constantly into the trillions range at which point they'll cycle back around to zero to continue the process again. I've a Seagate ST9750420AS that has had that problem since day one and still works great even after quite a few years and 3500+ hours of use.
I think those fields can be safely ignored if you're running one in your case. Just make sure the two fields are reporting the same number and in sync constantly. If they're not...well... That actually might mean a problem.
To automate the calculations of this answer, use the online javascript calculator:
https://yksi.ml/
This will tell you:
The calculator is valid for Seagate's:
For further reading on the calculation of the normalised (between 0 and 100 values), see this article.
Add these flags so that the attributes 1 & 7 (Raw_Read_Error_Rate & Seek_Error_Rate) are interpreted as consisting of a 24-bit error count and a 32-bit total count.
-v
stands for--vendorattribute=
Specifying raw24/raw32 is a way to tell smartctl to interpret and display the raw information as per a common format. see man page.
as per Seagate manual here The meaning of raw 7 Bytes of each attribute is as follows:
Yes, those fields look bad but I don't trust (anymore) the info reported by smart (my test machine have a drive which should be dead a long time ago if you read the data with smartctrl) The fact is that you have reported high iowait and the drives are 3 years old. This should be enough for you to change the drives.