I got such a message in /var/log/messages
:
Jun 25 06:29:27 server.ru smartd[4477]: Device: /dev/sda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 47
#smartctl -a /dev/sda
:
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 110 088 006 Pre-fail Always - 28526210
3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 24
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 471723621
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2520
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 41
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 068 052 045 Old_age Always - 32 (Lifetime Min/Max 31/35)
194 Temperature_Celsius 0x0022 032 048 000 Old_age Always - 32 (0 27 0 0)
195 Hardware_ECC_Recovered 0x001a 047 045 000 Old_age Always - 105036390
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
Does it mean that the disk is failing and I have to replace it? Where can I read about the interpretation of S.M.A.R.T test results?
According to Steve Gibson of Spinrite fame, SMART values have to be taken over time, not as instantaneous readings. That means, a value of 47 isn't necessarily bad if the value has been 47 for months. However if the value was 42 an hour ago, and its climbing rapidly, then that means the drive is experiencing difficulty accessing part of the data and may soon be unable to read the sector at all. Depending on the value of the data on that drive you may wish to replace it.
A high value for this attribute is actually pretty good:
https://kb.acronis.com/content/9131
First, lower values are worse for SMART, not higher values (notice how the threshold column is always lower than the current value). So, a value increasing is no cause for worry. (This rule does not apply to the raw values, however.)
SMART values tend to oscillate a bit (yours might be in the edge between 46 and 47, for instance, so even small changes could cause it to flip to the other value).
Your
smartctl -a
output shows the worst this value has been is 45, so it oscilating slightly above it is normal.For more information, take a look at Wikipedia: ATA S.M.A.R.T. attributes.
Please Note that the "Lower are worse" only applies to the values in the three columns labeled "Value", "Thresh" and "Worst". And not necessarily applicable to the "Raw Value", as values there are not normalised by that metric.
Keep in mind that even the extensive study that Google conducted found that a large number of drive failures were not predicted by SMART errors. It's possible what you see is perfectly normal, but as each manufacturer has different metrics for converting the raw values into the reported values it is hard to say for sure if your drive is experiancing a lot of errors or not. However, a raw number that large does strike me as odd.
I would recommend reading all of the drive (dd or rsync'ing to a new drive) and check the SMART values as it goes along. If you see that raw number, or the reported values, change a lot I'd start looking to replace the drive.
IIRC Hardware ECC recovered is error correction on disk reads, which isn't unusual for a disk, and they encode the data with error correction mechanisms for precisely this reason. Some controllers also support redundant information in disk sectors and add another layer of error correction.
As Dave Cheney states the figures should be monitored over time. Radical changes in these statistics are an indication of a failing drive. Also, keep an eye on grown defect lists - if the grown defect list starts to grow or the SMART statistics start to change significantly then you should prophylactically replace the drive.
Nothing wrong with it.
You can always run
Then after a few hours query its result
just to be sure.