I have two hard drives set up as a RAID 1 array on my server (Linux, software RAID using mdadm) and one of them just got me this "present" in syslog:
Nov 23 02:05:29 h2 kernel: [7305215.338153] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:29 h2 kernel: [7305215.338178] ata1.00: irq_stat 0x40000008
Nov 23 02:05:29 h2 kernel: [7305215.338197] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:29 h2 kernel: [7305215.338220] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:29 h2 kernel: [7305215.338221] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:29 h2 kernel: [7305215.338287] ata1.00: status: { DRDY ERR }
Nov 23 02:05:29 h2 kernel: [7305215.338305] ata1.00: error: { UNC }
Nov 23 02:05:29 h2 kernel: [7305215.358901] ata1.00: configured for UDMA/133
Nov 23 02:05:32 h2 kernel: [7305218.269054] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:32 h2 kernel: [7305218.269081] ata1.00: irq_stat 0x40000008
Nov 23 02:05:32 h2 kernel: [7305218.269101] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:32 h2 kernel: [7305218.269125] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:32 h2 kernel: [7305218.269126] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:32 h2 kernel: [7305218.269196] ata1.00: status: { DRDY ERR }
Nov 23 02:05:32 h2 kernel: [7305218.269215] ata1.00: error: { UNC }
Nov 23 02:05:32 h2 kernel: [7305218.341565] ata1.00: configured for UDMA/133
Nov 23 02:05:35 h2 kernel: [7305221.193342] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:35 h2 kernel: [7305221.193368] ata1.00: irq_stat 0x40000008
Nov 23 02:05:35 h2 kernel: [7305221.193386] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:35 h2 kernel: [7305221.193408] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:35 h2 kernel: [7305221.193409] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:35 h2 kernel: [7305221.193474] ata1.00: status: { DRDY ERR }
Nov 23 02:05:35 h2 kernel: [7305221.193491] ata1.00: error: { UNC }
Nov 23 02:05:35 h2 kernel: [7305221.388404] ata1.00: configured for UDMA/133
Nov 23 02:05:38 h2 kernel: [7305224.426316] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:38 h2 kernel: [7305224.426343] ata1.00: irq_stat 0x40000008
Nov 23 02:05:38 h2 kernel: [7305224.426363] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:38 h2 kernel: [7305224.426387] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:38 h2 kernel: [7305224.426388] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:38 h2 kernel: [7305224.426459] ata1.00: status: { DRDY ERR }
Nov 23 02:05:38 h2 kernel: [7305224.426478] ata1.00: error: { UNC }
Nov 23 02:05:38 h2 kernel: [7305224.498133] ata1.00: configured for UDMA/133
Nov 23 02:05:41 h2 kernel: [7305227.400583] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 02:05:41 h2 kernel: [7305227.400608] ata1.00: irq_stat 0x40000008
Nov 23 02:05:41 h2 kernel: [7305227.400627] ata1.00: failed command: READ FPDMA QUEUED
Nov 23 02:05:41 h2 kernel: [7305227.400649] ata1.00: cmd 60/08:00:d8:df:da/00:00:3a:00:00/40 tag 0 ncq 4096 in
Nov 23 02:05:41 h2 kernel: [7305227.400650] res 41/40:08:d8:df:da/00:00:3a:00:00/00 Emask 0x409 (media error) <F>
Nov 23 02:05:41 h2 kernel: [7305227.400716] ata1.00: status: { DRDY ERR }
Nov 23 02:05:41 h2 kernel: [7305227.400734] ata1.00: error: { UNC }
Nov 23 02:05:41 h2 kernel: [7305227.472432] ata1.00: configured for UDMA/133
From what I read so far, I am not sure if read errors mean that a hard drive is dying on me (no write errors so far). I've had hard drive errors in the past and those always had errors about failing to write to specific sectors in the logs. Not this time.
Should I be replacing the drive? Could something else be causing the problem?
I've scheduled a smartctl -t long
test that will finish in a couple of hours. I hope this will give me some more info.
UPDATE: Something like a miracle happened. Details below:
I was backing up some files off that machine, preparing to replace the faulty drive. Then, as I was copying those huge files, I got this logcheck email:
Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
Nov 23 17:16:24 h2 kernel: [7359837.963597] end_request: I/O error, dev sdb, sector 1202093816
Nov 23 17:16:41 h2 kernel: [7359855.196334] end_request: I/O error, dev sdb, sector 1202093816
System Events
=-=-=-=-=-=-=
Nov 23 17:14:06 h2 kernel: [7359700.193114] ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Nov 23 17:14:06 h2 kernel: [7359700.193139] ata2.00: irq_stat 0x40000008
Nov 23 17:14:06 h2 kernel: [7359700.193158] ata2.00: failed command: READ FPDMA QUEUED
Nov 23 17:14:06 h2 kernel: [7359700.193180] ata2.00: cmd 60/08:00:58:03:aa/00:00:47:00:00/40 tag 0 ncq 4096 in
Nov 23 17:14:06 h2 kernel: [7359700.193181] res 41/40:08:58:03:aa/00:00:47:00:00/00 Emask 0x409 (media error) <F>
Nov 23 17:14:06 h2 kernel: [7359700.193247] ata2.00: status: { DRDY ERR }
Nov 23 17:14:06 h2 kernel: [7359700.193265] ata2.00: error: { UNC }
Nov 23 17:14:06 h2 kernel: [7359700.194458] ata2.00: configured for UDMA/133
Oops! My hair, if I had some on my shaved head, stood up. See, it's real effing bad sectors on the second drive. Now what? With two faulty drives, what do I do?
I gave it some thought and decided that I:
- Had one drive that I suspect to be faulty
- And another that I'm 100% sure to be faulty with the bad sector complaints in the log.
So I replaced the second one, not the one I originally posted the question about. I had several partitions, each set up on a different RAID, and I was hoping that I'd be able to resync at least the root and boot ones, so that I don't have to reinstall everything on the server. I'd probably have to restore the huge data partition from backup, but well, I'd save me some work.
Replaced the drive, started the resyncs. Root and boot partitions (about 50GB) resynced really fast. No errors. I'm a happy camper!
Just for kicks, let's try resyncing the huge data partition -- it's about 2TB with 500GB of data on it. I started the resync and watched it for a while. It seemed to take forever, and I brought the server online, letting users use their stuff. Resync happening in the background. And, what do you know, about 18 hours later the resync is over with no errors. Server is fully alive now.
I wonder if I should be replacing the original drive now. I'm sure the server god of hard drives is laughing his butt off at me.
It's not about to die.. It's already dead.
Replace it ASAP, and restore from backups if you lose any data.
Can't find any reliable source for validating my own opinion, but I really think this is not a hardware damage. It's more a kind of data-retrieval problem.
If any data is written to the disk as the exact same location the read-operation failed, it should be then be readable.
So, as a final note, your current data might not be recoverable on that drive, but since you have a RAID array you can still get your data back from the other drive and make a backup, then format your faulty drive and resynchronize your RAID array.
This problem might occur by electromagnetic fields altering the content of the harddrive.