I am running a server on Ubuntu 10.04.4 LTS (Linux xxxx 2.6.32-67-server #134-Ubuntu SMP Wed Sep 24 18:55:00 UTC 2014 x86_64 GNU/Linux) with two hard discs in a software raid 1.
I repeatedly had the issue that the system became completely unresponsive for significant amounts of time (>1 hour), effectively taking down the server. The raid keeps the troubling disk in the array, sometimes starting a rebuild. I've had the same issue on three separate machines (same setup).
Is there a simple way to prevent such down times? The failing disk itself does not bother me that much (they all have been running non-stop for a few years), but the resulting down time does bother me. I was under the impression that raid 1 would keep the system going even when one hard disk is failing. It would be perfectly fine if the raid controller would just kick the disk from the array and the system would keep working. Even better would be if it tried to work out the issues in the background, without freezing. Some performance degragation would also be no issue as long as the system stays operable.
Here's a sample log entry from such an event:
Nov 14 14:00:10 xxxx kernel: [2137088.775542] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Nov 14 14:00:10 xxxx kernel: [2137088.788591] ata2.00: irq_stat 0x40000001
Nov 14 14:00:10 xxxx kernel: [2137088.801879] ata2.00: failed command: READ DMA EXT
Nov 14 14:00:10 xxxx kernel: [2137088.814988] ata2.00: cmd 25/00:80:d1:b9:89/00:00:16:00:00/e0 tag 0 dma 65536 in
Nov 14 14:00:10 xxxx kernel: [2137088.814991] res 51/40:00:d3:b9:89/00:00:16:00:00/e0 Emask 0x9 (media error)
Nov 14 14:00:10 xxxx kernel: [2137088.867197] ata2.00: status: { DRDY ERR }
Nov 14 14:00:10 xxxx kernel: [2137088.880205] ata2.00: error: { UNC }
Nov 14 14:00:10 xxxx kernel: [2137088.906336] ata2.00: configured for UDMA/133
Nov 14 14:00:10 xxxx kernel: [2137088.906345] sd 1:0:0:0: [sdb] Unhandled sense code
Nov 14 14:00:10 xxxx kernel: [2137088.906347] sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 14 14:00:10 xxxx kernel: [2137088.906351] sd 1:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Nov 14 14:00:10 xxxx kernel: [2137088.906356] Descriptor sense data with sense descriptors (in hex):
Nov 14 14:00:10 xxxx kernel: [2137088.906358] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Nov 14 14:00:10 xxxx kernel: [2137088.906367] 16 89 b9 d3
Nov 14 14:00:10 xxxx kernel: [2137088.906371] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Nov 14 14:00:10 xxxx kernel: [2137088.906376] sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 16 89 b9 d1 00 00 80 00
Nov 14 14:00:10 xxxx kernel: [2137088.906385] end_request: I/O error, dev sdb, sector 378124755
Nov 14 14:00:10 xxxx kernel: [2137088.919172] ata2: EH complete
This is the raid setup (cat /proc/mdstat):
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] [linear] [multipath]
md2 : active raid1 sda3[0] sdb3[1]
726266432 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
2104448 blocks [2/2] [UU]
md0 : active raid1 sdb1[1] sda1[0]
4200896 blocks [2/2] [UU]
unused devices: <none>
Thanks a lot in advance!
You're using software RAID. You don't have a "RAID controller" to "kick the disk from the array". Instead, you've got the kernel managing the ATA controllers, and when the disks don't respond (because they're having media errors, in this case) the kernel waits. This type of situation doesn't always create visible symptoms, but it certainly can.
The simple thing to do is use a hardware RAID controller. Even then, there's always still a chance that oddball edge-case failures could create a visible symptom but it's a lot less likely. A real hardware RAID controller will stand a much better chance of keeping the machine responsive even in the face of media errors.