I just rebooted my monitoring server for the first time in a while, and the following starting filling the screen:
Jul 11 23:52:30 monit kernel: [ 25.255908] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 11 23:52:30 monit kernel: [ 25.256170] ata1.00: BMDMA stat 0x24
Jul 11 23:52:30 monit kernel: [ 25.256278] ata1.00: failed command: READ DMA
Jul 11 23:52:30 monit kernel: [ 25.256410] ata1.00: cmd c8/00:c0:20:68:35/00:00:00:00:00/e0 tag 0 dma 98304 in
Jul 11 23:52:30 monit kernel: [ 25.256416] res 51/40:9f:41:68:35/00:00:00:00:00/e0 Emask 0x9 (media error)
Jul 11 23:52:30 monit kernel: [ 25.256809] ata1.00: status: { DRDY ERR }
Jul 11 23:52:30 monit kernel: [ 25.256933] ata1.00: error: { UNC }
Jul 11 23:52:30 monit kernel: [ 25.304388] ata1.00: configured for UDMA/66
Jul 11 23:52:30 monit kernel: [ 25.304430] ata1: EH complete
. . .
Jul 11 23:52:30 monit kernel: [ 25.552451] sd 0:0:0:0: [sda] Unhandled sense code
Jul 11 23:52:30 monit kernel: [ 25.552462] sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 11 23:52:30 monit kernel: [ 25.552475] sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Jul 11 23:52:30 monit kernel: [ 25.552490] Descriptor sense data with sense descriptors (in hex):
Jul 11 23:52:30 monit kernel: [ 25.552498] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Jul 11 23:52:30 monit kernel: [ 25.552529] 00 35 68 41
Jul 11 23:52:30 monit kernel: [ 25.552543] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
Jul 11 23:52:30 monit kernel: [ 25.552559] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 00 35 68 20 00 00 c0 00
Jul 11 23:52:30 monit kernel: [ 25.552587] end_request: I/O error, dev sda, sector 3500097
Jul 11 23:52:30 monit kernel: [ 25.556607] ata1: EH complete
I already know I need to replace the HDD (Cost of Data > Cost of HDD), but I want to know for my own knowledge what's actually wrong with it.
Yes, our monitoring server has no RAID, just one HDD... Don't look at me...
Looks like the drive has bad sectors and is unable to reallocate these (possibly because it's run out of spare sectors). The output of
smartctl -a /dev/sda
would give you more information on the state of the drive.Lassie's saying "arf! arf arf! arf!". Which is dumb, because this has nothing to do with Timmy or wells. This is why you don't take sysadmin advice from dogs.
The drive is giving you an "Unrecovered read error - auto reallocate failed", which basically means "I tried to read, I failed, I tried to recover (read the sector a few more times, apply some ECC, and move the data to a sector that isn't broken), and it didn't work". This probably means (as mgorven says) that the disk is chock full of reallocated sectors already, because the disk's been dying for a while, but I also think it can mean that it wasn't able to recover the sector at all (repeated reads + ECC failed to get a good-looking data block).
Either way, yeah, the drive's very, very cactus. Your data isn't looking real healthy, either.
I know this is old, but just in case someone is still reading this post: "DD will also try to read the broken sector(s)" - gddrescue is useful here. It doesn't (okay, it does, but only once).
Make a dd image or rsync copy of that disk now++, unless you have a full backup allowing a convenient restore of that box. And start looking for a compatible and working replacement disk.
BTW, UDMA/66, is that a ten year old PATA disk?
As already mentioned it likely means your drive is nearing its end of life but not necessarily immediately - you should run an
fsck
on the disk and try to repair the errors (see smartmontools wiki for advice fixing bad blocks) and the disk may be ok for a while longer.But you should start running
smartd
(which comes as part of thesmartmontools
package) and keep an eye on its reports and/or set up email notifications. Also you can add custom notifications of your own by creating scripts (in/etc/smartmontools/run.d/
) that are called by thesmartd-runner
.