I'm fairly new to ZFS and I have a simple mirrored storage pool setup with 8 drives. After a few weeks of running, one drive seemed to generate a lot of errors, so I replaced it.
A few more weeks go by and now I'm seeing small errors crop up all around the pool (see the zpool status
output below). Should I be worried about this? How can I determine if the error indicates the drive needs to be replaced?
# zpool status
pool: storage
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 22.5K in 1h18m with 0 errors on Sun Jul 10 03:18:42 2016
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
enc-a ONLINE 0 0 2
enc-b ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
enc-c ONLINE 0 0 0
enc-d ONLINE 0 0 2
mirror-2 ONLINE 0 0 0
enc-e ONLINE 0 0 2
enc-f ONLINE 0 0 1
mirror-3 ONLINE 0 0 0
enc-g ONLINE 0 0 0
enc-h ONLINE 0 0 3
errors: No known data errors
ZFS helpfully tells me to "Determine if the device needs to be replaced..." but I'm not sure how to do that. I did read the referenced article which was helpful but not exactly conclusive.
I have looked at the SMART test results for the effected drives, and nothing jumped out at me (all tests were completed without errors), but I can post the SMART data as well if it would be helpful.
Update: While preparing to reboot into Memtest86+, I noticed a lot of errors on the console. I normally SSH in, so I didn't see them before. I'm not sure which log I should have been checking, but the entire screen was filled with errors that look like this (not my exact error line, I just copied this from a different forum):
blk_update_request: I/0 error, dev sda, sector 220473440
From some Googling, it seems like this error can be indicative of a bad drive, but it's hard for me to believe that they are all failing at once like this. Thoughts on where to go from here?
Update 2: I came across this ZOL issue that seems like it might be related to my problem. Like the OP there I am using hdparm to spin-down my drives and I am seeing similar ZFS checksum errors and blk_update_request
errors. My machine is still running Memtest, so I can't check my kernel or ZFS version at the moment, but this at least looks like a possibility. I also saw this similar question which is kind of discouraging. Does anyone know of issues with ZFS and spinning down drives?
Update 3: Could a mismatched firmware and driver version on the LSI controller cause errors like this? It looks like I'm running a driver version of 20.100.00.00 and a firmware version of 17.00.01.00. Would it be worth while to try to flash updated firmware on the card?
# modinfo mpt2sas
filename: /lib/modules/3.10.0-327.22.2.el7.x86_64/kernel/drivers/scsi/mpt2sas/mpt2sas.ko
version: 20.100.00.00
license: GPL
description: LSI MPT Fusion SAS 2.0 Device Driver
author: Avago Technologies <[email protected]>
rhelversion: 7.2
srcversion: FED1C003B865449804E59F5
# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved
Adapter Selected is a LSI SAS: SAS2308_2(D1)
Num Ctlr FW Ver NVDATA x86-BIOS PCI Addr
----------------------------------------------------------------------------
0 SAS2308_2(D1) 17.00.01.00 11.00.00.05 07.33.00.00 00:04:00:00
Update 4: Caught some more errors in the dmesg
output. I'm not sure what triggered these, but I noticed them after unmounting all of the drives in the array in preparation for updating the LSI controller's firmware. I'll wait a bit to see if the firmware update solved the problem, but here are the errors in the meantime. I'm not really sure what they mean.
[87181.144130] sd 0:0:2:0: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144142] sd 0:0:2:0: [sdc] CDB: Write(10) 2a 00 35 04 1c d1 00 00 01 00
[87181.144148] blk_update_request: I/O error, dev sdc, sector 889461969
[87181.144255] sd 0:0:3:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144259] sd 0:0:3:0: [sdd] CDB: Write(10) 2a 00 35 04 1c d1 00 00 01 00
[87181.144263] blk_update_request: I/O error, dev sdd, sector 889461969
[87181.144371] sd 0:0:4:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144375] sd 0:0:4:0: [sde] CDB: Write(10) 2a 00 37 03 87 30 00 00 08 00
[87181.144379] blk_update_request: I/O error, dev sde, sector 922978096
[87181.144493] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144500] sd 0:0:5:0: [sdf] CDB: Write(10) 2a 00 37 03 87 30 00 00 08 00
[87181.144505] blk_update_request: I/O error, dev sdf, sector 922978096
[87191.960052] sd 0:0:6:0: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87191.960063] sd 0:0:6:0: [sdg] CDB: Write(10) 2a 00 36 04 18 5c 00 00 01 00
[87191.960068] blk_update_request: I/O error, dev sdg, sector 906238044
[87191.960158] sd 0:0:7:0: [sdh] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87191.960162] sd 0:0:7:0: [sdh] CDB: Write(10) 2a 00 36 04 18 5c 00 00 01 00
[87191.960179] blk_update_request: I/O error, dev sdh, sector 906238044
[87195.864565] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87195.864578] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 37 03 7c 68 00 00 20 00
[87195.864584] blk_update_request: I/O error, dev sda, sector 922975336
[87198.770065] sd 0:0:1:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87198.770078] sd 0:0:1:0: [sdb] CDB: Write(10) 2a 00 37 03 7c 88 00 00 20 00
[87198.770084] blk_update_request: I/O error, dev sdb, sector 922975368
Update 5: I updated the firmware for the LSI controller, but after clearing the ZFS errors and scrubbing, I'm seeing the same behavior (minor checksum errors on a few of the drives). The next step will be updating the firmware on the drives themselves.
Update 6: I replaced the PCI riser after reading in some forums that other people with the U-NAS NSC800 case have had issues with the provided riser. There was no effect on the checksum errors. I have been putting off the HDD firmware update because the process is such a pain, but I guess it's time to suck it up and make a bootable DOS flash drive.
Update 7: I updated the firmware on the three of the Seagate drives. The other drives either didn't have a firmware update available or I wasn't able to get it (Western Digital told me there was no firmware update for my drive). No errors popped up after an initial scrub, but I'm going to give it at least a week or two before I say this solved the problem. It seems highly unlikely to me that the firmware in three drives could be effecting the entire pool like this.
Update 8: The checksum errors are back, just like before. I might look into a firmware update for the motherboard, but at this point I'm at a loss. It will be difficult/expensive to replace the remaining physical components (controller, backplane, cabling), and I'm just not 100% sure that it's not a problem with my setup (ZFS + Linux + LUKS + Spinning down idle drives). Any other ideas are welcome.
Update 9: Still trying to track this one down. I came across this question which had some similarities to my situation. So, I went ahead and rebuilt the zpool using ashift=12
to see if that would resolve the issue (no luck). Then, I bit the bullet and bought a new controller. I just installed a Supermicro AOC-SAS2LP-MV8 HBA card. I'll give it a week or two to see if this solves the problem.
Update 10: Just to close this out. It's been about 2 weeks since the new HBA card went in and, at the risk of jinxing it, I've had no checksum errors since. A huge thanks to everyone who helped me sort this one out.
Having those errors across multiple drives seems to indicate a backplane/controller/cabling problem more than a disk or RAM issue.
My general rule of thumb is that if the errors are continuing to rise unexpectedly, the disk needs replaced; if it's static, there might have been some transient condition that caused the error, and the system's not reproducing the conditions that caused problems.
A few checksum errors doesn't necessarily indicate anything bad mechanically with the drive (bit rot happens, ZFS just happens to detect it while other filesystems don't), but if those errors have happened over the course of an hour, then it's a much different situation than if they've happened over the course of a year.