I've got a pool in raidz1-0
with 5 drives in it. I'm not sure exactly when, but all of the sudden all the drives went from always being ONLINE
with no read, write or checksum errors to randomly spitting out all sort of issues.
NAME STATE READ WRITE CKSUM
Data DEGRADED 0 0 0
raidz1-0 DEGRADED 149 185 0
gptid/905fe084-a003-11e9-9d12-000c29c8a62a DEGRADED 57 127 5 too many errors
gptid/2b75693a-9f09-11e9-8310-000c29c8a62a ONLINE 7 5 5
gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a DEGRADED 70 171 5 too many errors
gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a DEGRADED 51 6 14 too many errors
gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a FAULTED 8 13 2 too many errors
I've done some basic troubleshooting:
- SMART shows that everything is fine (apart from some warmer than I'd like temps around the 40C range). So the drives look like they're in good shape. No bad sectors, no pending sectors, nothing out of the ordinary. All of the drives have been spinning for ~3 years at this point.
- Each of the drives are connected directly to the motherboard via individual SATA connections. I've reseated and replaced the SATA cables with no success.
At some point in time, I replaced the 3rd disk in the pool. At the time, it was spitting out the most errors and could always be the first to go into a DEGRADED state. I replaced it with a brand new drive and it's been running for months now, immediately picking up the same issue as the rest of the pool.
Even after a zpool clear
, about 5 hours later I had the following status.
NAME STATE READ WRITE CKSUM
Data DEGRADED 0 0 0
raidz1-0 DEGRADED 1 0 0
gptid/905fe084-a003-11e9-9d12-000c29c8a62a ONLINE 2 4 0
gptid/2b75693a-9f09-11e9-8310-000c29c8a62a ONLINE 0 0 0
gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a FAULTED 1 11 0 too many errors
gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a ONLINE 1 1 0
gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a ONLINE 1 6 0
I'm not exactly sure what's going on here or where else to look.
I don't know if it's a coincidence, but I noticed this started to happen after upgrading the ZFS pool as part of one of FreeNAS's updates (I think it was 11.2U - also yeah, I'm running FreeNAS)
The only last thing I can possibly think of is a bad SATA controller. But before I get to that, is there anything else I can troubleshoot? This is for a hobby home server and replacing the controller essentially means a whole new server so I'd like to avoid that if possible. And there aren't any PCIe ports remaining to install an external controller unfortunately.
Thanks in advance!
After almost a month of debugging, it's safe to say that it was indeed the chipset's SATA controller.
@shodanshok brought to my attention that there is a "significant age-related SATA issue" with intel chipsets, and some extra googling showed that I wasn't the only one.
I've bought some new hardware, alongside a LSI 9205-8I H220 to connected all the drives into. Without any changes to the configuration (apart from a more modern motherboard + CPU), they ZFS pool was imported with no issue and the pool has been running for a whole day with 0 checksum/read/write errors. By now it would have been in the hundreds. This confirms that the issue was the onboard SATA controller.
Hope this helps anyone who is experiencing a similar issue!