Ping a Specific Port

Question

SteppingHat

Asked: 2021-05-07 08:20:21 +0800 CST2021-05-07 08:20:21 +0800 CST 2021-05-07 08:20:21 +0800 CST

ZFS pool continually DEGRADED or FAULTED

772

I've got a pool in raidz1-0 with 5 drives in it. I'm not sure exactly when, but all of the sudden all the drives went from always being ONLINE with no read, write or checksum errors to randomly spitting out all sort of issues.

    NAME                                            STATE     READ WRITE CKSUM
    Data                                            DEGRADED     0     0     0
      raidz1-0                                      DEGRADED   149   185     0
        gptid/905fe084-a003-11e9-9d12-000c29c8a62a  DEGRADED    57   127     5  too many errors
        gptid/2b75693a-9f09-11e9-8310-000c29c8a62a  ONLINE       7     5     5
        gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a  DEGRADED    70   171     5  too many errors
        gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a  DEGRADED    51     6    14  too many errors
        gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a  FAULTED      8    13     2  too many errors

I've done some basic troubleshooting:

SMART shows that everything is fine (apart from some warmer than I'd like temps around the 40C range). So the drives look like they're in good shape. No bad sectors, no pending sectors, nothing out of the ordinary. All of the drives have been spinning for ~3 years at this point.
Each of the drives are connected directly to the motherboard via individual SATA connections. I've reseated and replaced the SATA cables with no success.

At some point in time, I replaced the 3rd disk in the pool. At the time, it was spitting out the most errors and could always be the first to go into a DEGRADED state. I replaced it with a brand new drive and it's been running for months now, immediately picking up the same issue as the rest of the pool.

Even after a zpool clear, about 5 hours later I had the following status.

    NAME                                            STATE     READ WRITE CKSUM
    Data                                            DEGRADED     0     0     0
      raidz1-0                                      DEGRADED     1     0     0
        gptid/905fe084-a003-11e9-9d12-000c29c8a62a  ONLINE       2     4     0
        gptid/2b75693a-9f09-11e9-8310-000c29c8a62a  ONLINE       0     0     0
        gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a  FAULTED      1    11     0  too many errors
        gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a  ONLINE       1     1     0
        gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a  ONLINE       1     6     0

I'm not exactly sure what's going on here or where else to look.

I don't know if it's a coincidence, but I noticed this started to happen after upgrading the ZFS pool as part of one of FreeNAS's updates (I think it was 11.2U - also yeah, I'm running FreeNAS)

The only last thing I can possibly think of is a bad SATA controller. But before I get to that, is there anything else I can troubleshoot? This is for a hobby home server and replacing the controller essentially means a whole new server so I'd like to avoid that if possible. And there aren't any PCIe ports remaining to install an external controller unfortunately.

Thanks in advance!

1 Answers

Voted

SteppingHat · Answer 1 · 2021-06-03T02:58:11+08:00

Best Answer

SteppingHat

2021-06-03T02:58:11+08:002021-06-03T02:58:11+08:00

After almost a month of debugging, it's safe to say that it was indeed the chipset's SATA controller.

@shodanshok brought to my attention that there is a "significant age-related SATA issue" with intel chipsets, and some extra googling showed that I wasn't the only one.

I've bought some new hardware, alongside a LSI 9205-8I H220 to connected all the drives into. Without any changes to the configuration (apart from a more modern motherboard + CPU), they ZFS pool was imported with no issue and the pool has been running for a whole day with 0 checksum/read/write errors. By now it would have been in the hundreds. This confirms that the issue was the onboard SATA controller.

Hope this helps anyone who is experiencing a similar issue!

1

ZFS pool continually DEGRADED or FAULTED

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?