Ping a Specific Port

Question

Kendrick

Asked: 2012-11-25 10:14:53 +0800 CST2012-11-25 10:14:53 +0800 CST 2012-11-25 10:14:53 +0800 CST

zfs pool error, how to determine which drive failed in the past

772

I had been copying data from my pool so that I could rebuild it with a different version so that I could go away from solaris 11 and to one that is portable between freebsd/openindia etc. it was copying at 20mb a sec the other day which is about all my desktop drive can handle writing from the network. suddently lastnight it went down to 1.4mb i ran zpool status today and got this.

   pool: store
   state: ONLINE
   status: One or more devices has experienced an unrecoverable error.  An
          attempt was made to correct the error.  Applications are unaffected.
   action: Determine if the device needs to be replaced, and clear the errors
          using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
   scan: none requested
   config:

    NAME          STATE     READ WRITE CKSUM
    store         ONLINE       0     0     0
      raidz1-0    ONLINE       0     0     0
        c8t3d0p0  ONLINE       0     0     2
        c8t4d0p0  ONLINE       0     0    10
        c8t2d0p0  ONLINE       0     0     0

it is currently a 3 x1tb drive array. what tools would best be used to determine what the error was and which drive is failing.

per the admin doc

 The second section of the configuration output displays error statistics. These errors are divided into three categories:

READ – I/O errors occurred while issuing a read request.

WRITE – I/O errors occurred while issuing a write request.

CKSUM – Checksum errors. The device returned corrupted data as the result of a read request.

it was saying low counts could be any thing from a power flux to a disk event but gave no suggestions as to what tools to check and determine with.

1 Answers

Voted

notpeter · Answer 1 · 2012-11-29T13:37:30+08:00

Checksum errors occur when data was read from disk, but it didn't match the expected checksum; a noisy sata cable could cause this corruption either during writing (data corrupted on the way to disk) or reading (data corrupted on the way from the disk). Although it could be a failing disk, it was likely caused by a loose or pinched SATA data cable. Try reseating the cables on both ends or trying another known good cable.

As for determining which disk, kind of depends on what hardware you're using. For Sun branded hardware cfgadm -alv should give you hard drive serial numbers to match their logical names. If you're using SATA ports on the motherboard, the port numbers correspond to the target id (2, 3, 4) so the first port is probably t0. Most of my disks have WWN printed on the label, you can discover this by enabling multipathing with pfexec stmsboot -e (see: this question) which will use the c8tWWNxxxxxxxxd0p0 format instead of c8tNd0p0, but probably only if you're using a SAS controller.

Your output shows ZFS was able to correct the error by reconstructing the data from the other two disks and restore the redundancy. It's just letting you cause something bad happened, at this point the fault management system has not yet decided the disk has had sufficient errors to warrant offlining it (resulting in a 'degraded' pool status). I'd give it a scrub to make sure every byte reads cleanly. More info for error ZFS-8000-0P here.

zfs pool error, how to determine which drive failed in the past

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?