Having these lines in /var/log/syslog
Apr 18 16:53:05 Server kernel: [4487878.816036] ata4: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Apr 18 16:53:05 Server kernel: [4487878.816058] ata4: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Apr 18 16:53:05 Server kernel: [4487878.816059] dhfis 0x1 dmafis 0x1 sdbfis 0x0
Apr 18 16:53:05 Server kernel: [4487878.816093] ata4: ATA_REG 0x40 ERR_REG 0x0
Apr 18 16:53:05 Server kernel: [4487878.816108] ata4: tag : dhfis dmafis sdbfis sacitve
Apr 18 16:53:05 Server kernel: [4487878.816125] ata4: tag 0x0: 1 1 0 1
Apr 18 16:53:05 Server kernel: [4487878.816150] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
Apr 18 16:53:05 Server kernel: [4487878.816200] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)
Apr 18 16:53:07 Server kernel: [4487880.416773] ata4.00: limiting speed to UDMA/133:PIO2
Apr 18 16:53:11 Server kernel: [4487884.676024] ata4: hard resetting link
Apr 18 16:53:11 Server kernel: [4487884.676027] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:12 Server kernel: [4487885.144032] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled
Apr 18 16:53:17 Server kernel: [4487890.144023] ata4: hard resetting link
Apr 18 16:53:17 Server kernel: [4487891.024033] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:17 Server kernel: [4487891.033357] ata4.00: ATA-8: WDC WD20EARS-00S8B1, 80.00A80, max UDMA/133
Apr 18 16:53:17 Server kernel: [4487891.033360] ata4.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 31/32)
Apr 18 16:53:17 Server kernel: [4487891.048347] ata4.00: configured for UDMA/133
Apr 18 16:53:17 Server kernel: [4487891.048361] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 18 16:53:17 Server kernel: [4487891.048365] sd 3:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
Apr 18 16:53:17 Server kernel: [4487891.048369] Descriptor sense data with sense descriptors (in hex):
Apr 18 16:53:17 Server kernel: [4487891.048371] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 18 16:53:17 Server kernel: [4487891.048378] 00 00 00 00
Apr 18 16:53:17 Server kernel: [4487891.048382] sd 3:0:0:0: [sdc] Add. Sense: No additional sense information
Apr 18 16:53:17 Server kernel: [4487891.048385] sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 e8 e0 88 00 00 00 08 00
Apr 18 16:53:17 Server kernel: [4487891.048393] end_request: I/O error, dev sdc, sector 3907028992
Apr 18 16:53:17 Server kernel: [4487891.048420] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048440] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048458] end_request: I/O error, dev sdc, sector 3907028992
Apr 18 16:53:17 Server kernel: [4487891.048477] md: super_written gets error=-5, uptodate=0
Apr 18 16:53:17 Server kernel: [4487891.048482] raid5: Disk failure on sdc, disabling device.
Apr 18 16:53:17 Server kernel: [4487891.048483] raid5: Operation continuing on 3 devices.
Apr 18 16:53:17 Server kernel: [4487891.048525] ata4: EH complete
Apr 18 16:53:17 Server kernel: [4487891.048554] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048576] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048596] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.
Apr 18 16:53:17 Server kernel: [4487891.048624] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048643] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048663] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048681] sd 3:0:0:0: [sdc] READ CAPACITY failed
Apr 18 16:53:17 Server kernel: [4487891.048683] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048685] sd 3:0:0:0: [sdc] Sense not available.
Apr 18 16:53:17 Server kernel: [4487891.048689] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048709] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048800] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048860] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.049028] sd 3:0:0:0: [sdc] Asking for cache data failed
Apr 18 16:53:17 Server kernel: [4487891.049048] sd 3:0:0:0: [sdc] Assuming drive cache: write through
Apr 18 16:53:17 Server kernel: [4487891.049071] sdc: detected capacity change from 2000398934016 to 0
Apr 18 16:53:17 Server kernel: [4487891.049080] ata4.00: detaching (SCSI 3:0:0:0)
Apr 18 16:53:18 Server kernel: [4487891.061149] sd 3:0:0:0: [sdc] Stopping disk
Apr 18 16:53:18 Server kernel: [4487891.485492] RAID5 conf printout:
Apr 18 16:53:18 Server kernel: [4487891.485496] --- rd:4 wd:3
Apr 18 16:53:18 Server kernel: [4487891.485500] disk 0, o:1, dev:sdb
Apr 18 16:53:18 Server kernel: [4487891.485502] disk 1, o:0, dev:sdc
Apr 18 16:53:18 Server kernel: [4487891.485504] disk 2, o:1, dev:sdd
Apr 18 16:53:18 Server kernel: [4487891.485506] disk 3, o:1, dev:sde
Apr 18 16:53:18 Server kernel: [4487891.497014] RAID5 conf printout:
Apr 18 16:53:18 Server kernel: [4487891.497016] --- rd:4 wd:3
Apr 18 16:53:18 Server kernel: [4487891.497018] disk 0, o:1, dev:sdb
Apr 18 16:53:18 Server kernel: [4487891.497019] disk 2, o:1, dev:sdd
Apr 18 16:53:18 Server kernel: [4487891.497021] disk 3, o:1, dev:sde
Apr 18 16:53:18 Server kernel: [4487891.838719] scsi 3:0:0:0: Direct-Access ATA WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
Apr 18 16:53:18 Server kernel: [4487891.838886] sd 3:0:0:0: Attached scsi generic sg3 type 0
Apr 18 16:53:18 Server kernel: [4487891.838911] sd 3:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Apr 18 16:53:18 Server kernel: [4487891.838964] sd 3:0:0:0: [sdf] Write Protect is off
Apr 18 16:53:18 Server kernel: [4487891.838967] sd 3:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Apr 18 16:53:18 Server kernel: [4487891.838988] sd 3:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 18 16:53:20 Server kernel: [4487891.839147] sdf: unknown partition table
Apr 18 16:53:20 Server kernel: [4487893.130026] sd 3:0:0:0: [sdf] Attached SCSI disk
Right now, i'm unable to do anything on /dev/sdc. Is there any way to try to re-attach it? I don't want to power-down the server unless absolutely necessary
System:
- Debian Stable 2.6.32-5-amd64
- mdadm version 3.1.4-1+8efb9d1
cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb[0] sdc[4](F) sde[3] sdd[2]
5860543488 blocks level 5, 64k chunk, algorithm 2 [4/3] [U_UU]
unused devices: <none>
mdadm --examine --scan
ARRAY /dev/md0 UUID=1a7744b5:912ec7af:f82a9565:e3b453b4
Try the following with the /proc filesystem:
http://tldp.org/HOWTO/SCSI-2.4-HOWTO/mlproc.html
I'm not sure what you think you're going to gain by adding a failing disk back into your array. These errors aren't soft errors - the disk is on it's way out.
Failed a write command, reset the link, now it's seeing a sector mismatch on the drive.
Failed to respond to an IDENTIFY command.
Drive failed to respond to a READ CAPACITY command.
The fact that the disk returns as far as presenting a block device to linux is a red herring. You should replace it, not spend time trying to get a disk that looks very much like it's failing back into a RAID array. Even if you did get it back in, it'll either fail again shortly, silently corrupt your data, or both.
Replacing SATA disks doesn't technically require powering off the disks. I appreciate that your chassis may not have hotswap bays, and may not allow you easy access to replace the disks, but you might consider taking this opportunity to install a SATA hotswap bay adapter. Something like this from Addonics, for example - fits into 3 5.25" bays, and provides 5x 3.5" hot-swap access drive trays. Makes replacing disks a lot easier.
I had the same problem with a Marvell controller. I disabled NCQ and this did not happen again.