I have a hard drive that's part of a Linux software raid5 array. SMART has reported that its multi_zone_error_rate was 0, then 1, then 3. So I figured I better start backing up more frequently and prepare to replace the drive. Now, today, the multi_zone_error_rate of that very same drive is back down to 1. It seems that 2 errors unhappened while I wasn't looking.
I've also seen simliar behaviour by inspecting the syslog on the server.
Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 7 21:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 7 21:01:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 200 to 100
Jun 8 02:31:18 FS1 smartd[25593]: Device: /dev/sdg, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sdc, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
Jun 8 03:01:17 FS1 smartd[25593]: Device: /dev/sde, SMART Usage Attribute: 7 Seek_Error_Rate changed from 100 to 200
These are raw values, not the human-useful values that smartctl -a
produces, but the behaviour is similar: error rates changing, then undoing the change. None of these are the drive that had the multi_zone weirdness. I haven't seen any problems from the RAID; its most recent scrub ( < 24 hours ago) came back totally clean. These are the only SMART values behaving strangely.
The only thing I can think of is that the SMART reporting circuitry on the drive isn't working properly all the time. The cables are in tight on the drive and board. What's going on here?
As the measure is called as a rate, it may be that it is expected to go down over time if no further errors occur. You'll need to check the drive's documentation to be sure.
If the measure is "occurrences over time", rather than an absolute count since a particular time, then it will fall if the errors cease to occur. Perhaps the previous increase was due to localised changes in environmental conditions such as a sudden jump in temperature (unusual weather, failed air conditioning) or an increase in vibration (any work done in the same rack around the time may have caused things to get knocked, or perhaps minor earth tremors if you are in an area affected by them, or perhaps someone has been getting angry and shouting at the server), and that temporary change in conditions since reverted and has not returned.
"error" in SMART reading names does not always imply a permanent and/or unrecoverable error. A seek error could perhaps be due to the drive heads missing their mark due to vibration - in this case the drive's electronics will just re-adjust the position (or leave it to settle) and wait for the disc to spin back around so the target sector is available again. This sort of thing is expected with the very tight timings and precise positioning requirements worked to by modern spinning-disk based drives and small numbers of such errors is not an issue.
It may be that it allocated around the bad sectors and "fixed" the problem. A certain amount of that is perfectly tolerable in a drive.