My story starts out quite simply. I have a light-duty server, running Arch Linux, which stores most of its data on a RAID-1 composed of two SATA drives. It was working without any problems for about 4 months. Then, suddenly I started getting read errors on one of the drives. Always, the messages looked a lot like these:
Apr 18 00:20:15 hope kernel: [307085.582035] ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Apr 18 00:20:15 hope kernel: [307085.582040] ata5.01: failed command: READ DMA EXT
Apr 18 00:20:15 hope kernel: [307085.582048] ata5.01: cmd 25/00:08:08:6a:34/00:00:27:00:00/f0 tag 0 dma 4096 in
Apr 18 00:20:15 hope kernel: [307085.582050] res 51/40:00:0c:6a:34/40:00:27:00:00/f0 Emask 0x9 (media error)
Apr 18 00:20:15 hope kernel: [307085.582053] ata5.01: status: { DRDY ERR }
Apr 18 00:20:15 hope kernel: [307085.582056] ata5.01: error: { UNC }
Apr 18 00:20:15 hope kernel: [307085.621301] ata5.00: configured for UDMA/133
Apr 18 00:20:15 hope kernel: [307085.640972] ata5.01: configured for UDMA/133
Apr 18 00:20:15 hope kernel: [307085.640986] sd 4:0:1:0: [sdd] Unhandled sense code
Apr 18 00:20:15 hope kernel: [307085.640989] sd 4:0:1:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 18 00:20:15 hope kernel: [307085.640993] sd 4:0:1:0: [sdd] Sense Key : Medium Error [current] [descriptor]
Apr 18 00:20:15 hope kernel: [307085.640998] Descriptor sense data with sense descriptors (in hex):
Apr 18 00:20:15 hope kernel: [307085.641001] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 18 00:20:15 hope kernel: [307085.641010] 27 34 6a 0c
Apr 18 00:20:15 hope kernel: [307085.641020] sd 4:0:1:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
Apr 18 00:20:15 hope kernel: [307085.641023] sd 4:0:1:0: [sdd] CDB: Read(10): 28 00 27 34 6a 08 00 00 08 00
Apr 18 00:20:15 hope kernel: [307085.641027] end_request: I/O error, dev sdd, sector 657746444
Apr 18 00:20:15 hope kernel: [307085.641035] ata5: EH complete
Apr 18 00:20:15 hope kernel: [307085.641672] md/raid1:md16: read error corrected (8 sectors at 657744392 on sdd1)
Apr 18 00:20:17 hope kernel: [307087.505082] md/raid1:md16: redirecting sector 657742336 to other mirror: sdd1
Each error complained of a different sector number, and was accompanied by a several-second delay for the user (me) accessing the disk.
I checked the smartctl output, and saw the following output (irrelevant parts clipped):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 193 193 051 Pre-fail Always - 1606
5 Reallocated_Sector_Ct 0x0033 194 194 140 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 162 162 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 51
Looking back in the logs, I found that the errors had actually been happening for a few days, mostly during backups, but also frequently during very light use (meaning about every 5th time I tried to save a text file). I concluded that my disk was dying, that the RAID-1 was handling it appropriately, and that it was time to order a replacement disk. I ordered a new disk.
Much to my surprise, a day later, the errors... stopped. I had done nothing to fix them. I hadn't rebooted, hadn't taken the drive offline, nothing. But the errors just stopped.
At that point, curious to see whether the bad sectors were just in idle portions of the disk now, I took the disk out of the RAID, put it back in the RAID, and allowed it to complete the ensuing full resync. The resync completed without any errors, 9 hours later (2TB disks take a little while).
Also, the smartctl output has changed a bit, as follows:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 193 193 051 Pre-fail Always - 1606
5 Reallocated_Sector_Ct 0x0033 194 194 140 Pre-fail Always - 43
196 Reallocated_Event_Count 0x0032 162 162 000 Old_age Always - 38
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
So, the part of this that's weirding me out is, of course, "Since when do bad disks fix themselves?"
I suppose it's possible that a very small area of the drive spontaneously went bad, and that the drive simply took 3 days (!) before its sector reallocation code kicked in and it mapped some spare sectors over a bad area of the disk... But I can't say that I've ever seen that happen.
Has anyone else seen this kind of behavior? If so, what was your experience with the drive afterward? Did it happen again? Did the disk eventually fail completely? Or was it just an unexplained glitch that remained unexplained?
In my case, I already have the replacement drive (obtained under warranty), so I'll probably just replace the drive anyway. But I'd love to know if I misdiagnosed this somehow. If it helps, I have have complete 'smartctl -a' output from when the problem was happening. It's just a bit long, so I didn't post it here.
If one specific physical region of the drive surface goes bad, then until those sectors can be successfully mapped out, you'll get unrecovered read errors when you try to read any data that was written to that area. The drive knows that the sectors are bad (after the failures to access the sectors) but cannot remap the sectors because they already hold data. If you format the drive or overwrite the "bad" sectors, then the drive will have an opportunity to map out the bad sectors.
Once the bad sectors are mapped out, and as long as more of the drive surface does not fail, you're in good shape.
I don't know enough about drive failure models of current drives to know if there's much correlation between one part of the media surface going bad and the problem spreading or occurring again. If there is no correlation, then once the bad sectors get mapped out, you're in good shape. If there is a correlation, then this is the beginning of the end for the drive.
Most modern drives will "vector out" a block that has gone bad. The drive has a pool of spare blocks and the firmware uses these to replace any blocks that are known to the drive to be bad. The drive cannot do this re-mapping when it fails to READ a block because it cannot supply the correct data. It just returns "read error". It does MARK the block as bad, so if the block ever does read correctly then the block is vectored out and the correct data written to the replacement block. If the OS ever WRITES to a block that is in a "vector out pending" state then the block is vectored out and the data written to the replacement block.
Linux software raid will, on getting a read error from a device, get the correct data from other elements in the array and then it tries to WRITE the bad block again. SO, if the write works OK then the data is safe, if not, the drive just does the above, vectors the block and then perform the write. SO, the drive has, with the help of the raid system, just repaired itself!
Assuming such events are reasonably rare, it is probably safe to carry on. If too many replacement blocks are being used then the drive may have a problem. There is a limit to how many replacement blocks can be vectored to spare blocks and that is a function of the drive.
Yes, I have seen this as well, and under very similar circumstances. In my case, it was a "consumer-grade" Western Digital 1TB "Green" drive (WD10EARS) that pulled that stunt on me. The SMART
Current_Pending_Sector
raw value went from zero to 6, and then to 8, prompting the SMART monitoring daemon to send me some ominous emails.I
mdadm --fail
ed and--remove
d the drive from the array and ran a non-destructive pass ofbadblocks
over it, and yes, there were apparently over 2 dozen bad blocks. When the replacement drive arrived about a day later, I fixed the array, and life went on.Shortly thereafter, out of sheer boredom, I reran
badblocks
on the "failed" drive to see if it had worsened. On the contrary, the drive had completely "repaired" itself: zero bad blocks! Shaking my head, I wiped it and set it aside to be recycled or donated.The lesson: Don't use consumer-grade drives in servers, unless you are willing and able to put up with all manner of weirdness and unreliability. Corollary: Don't cheap-out on server components, because you'll eventually end up paying for them anyway, in extra time and aggravation.
It is common practice in server environments to discard drives that ever showed such errors, fixed or not. Sector hard errors can be a sign of physical surface damage to the medium - and if you scratch a surface you usually either displace material to the sides of the scratch and get a burr higher than the plane of such surface - or abrasive dust (glass!). Both tend to be rather damaging to a mechanical system that relies on a very thin air cushion between two surfaces assumed perfectly smooth... thats why medium errors once they start reaching a certain count tend to multiply rather more quickly.