My laptop has recently started to become a bit unreliable, and for some reason I started to suspect that my HDD was starting to fail. After a bit of hunting on the internet, I found Ubuntu's Disk Utility in the System menu and ran the long SMART diagnostics from this.
However, since the documentation for Disk Utility is very poor (palimpsest
?), I'm not sure how to interpret the results:
For example, the Read Error Rate is over 50 million (!), yet the Assessment is rated "Good".
So would someone mind explaining to me how to interpret the results of these tests (especially the Normalized, Worst, Threshold and Value numbers)? And maybe tell me what they think of the results I got for my HDD? (Thanks)
You have a good description of how SMART works on wikipedia. But a quick intro:
Value: This is the raw value that the controller reports. Usually it's an easy to understand value (like power on hours or temperature), but sometimes it isn't (like the read error rate). Different manufacturers can use different structures and meanings for this data.
Normalized: This is the above value normalized so a higher value is always better. So a 114 in read/error rates is better than 113. Again, how your hard drive converts raw data to normalized value is vendor specific.
Worst: The worst normalized value that your drive had in the past (where 99 is likely the factory setting).
Threshold: When the normalized value is lower than this value the drive is likely to fail.
So, your hard disk seem to be ok. The value of the read error rate is not the times that your drive failed, but some data struct that depends on your disk manufacturer.
Yes, generally the raw value for read error rate is nonsense. The values you want to keep an eye on are the reallocated sector count, pending count, and offline uncorrectable. Those are the count of bad sectors that have been, are waiting to be, or can not be corrected, and the raw values there generally make sense and are the count of sectors.
If reading a sector fails, it becomes pending. The next time you try to write to that sector, the drive attempts to rewrite it, and if that works, everything goes back to normal. If it can not correctly write the sector, then it will reallocate the sector from the spare pool. If it can't do that ( maybe it's used up the spare pool already? ), then it just becomes offline_uncorrectable and trying to read or write to it just errors out.
psusi nails it.
If you read the data sheets (white papers) say at seagate.com you will see how HDD's are made, tested and how they really work. There is no perfect HDD, never was, never will be, (history and fact). In the olden days, we had to enter the bad sectors in to the HDD controller from a list on paper that came in the new drive box, so the controller skips them.
Modern drives have error correction. From day 1 sectors are bad.
So they map them out, this means the drive skips bad sectors. In fact they are "logically swapped out" - the bad sector is mapped to a new, good, spare cylinder sector (it has spare cylinders - think of cylinders as tracks). This is all transparent to the outside world - except for the SMART util.
Each manufacturer can do as they please, so some set the error counts to zero, even though there might be 10 bad sectors as soon as the drive is manufactured.
There is a 3 times rule in the drive's firmware - it reads a sector 3 times and if all 3 times it is bad then it may do a "recalibrate" on the fly, and read 3 more times. If the drive is still not ok it will map that sector to one of the spare sectors. This is deep in the firmware, but happens continually in the background, all transparent to the user.
Whether the manufacturer chooses to report raw errors whenever there are 3 bad reads or after the calibrate is up to them. So like he says above, it's not important unless you have many drives of the same kind and you see some odd trends.
Point 2: all HDD have natural read errors, you can learn that at Seagate too, if you want. but they all have errors on the fly. and are read again, and usually pass the test for CRC errors. if not the DRIVE tries to swap it out. if you run the disk cool, it will last a long time and you many never run out of spare cylinders. but do look at that as psusi tells you !
I'm typing this , on an old PC , running one of the first 1gb HDD ever made. and is still good. (im backed up ) (no lack of cooling ever...) heat is the #1 killer and power surges, I run a UPS. cheers and good day. I hope this helps. (ever seen a DatA General hard disk crash? and fill the room with vast amounts of aluminum wool, curly cues? lots of fun back then... never a dull moment....