I had been copying data from my pool so that I could rebuild it with a different version so that I could go away from solaris 11 and to one that is portable between freebsd/openindia etc. it was copying at 20mb a sec the other day which is about all my desktop drive can handle writing from the network. suddently lastnight it went down to 1.4mb i ran zpool status today and got this.
pool: store
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scan: none requested
config:
NAME STATE READ WRITE CKSUM
store ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c8t3d0p0 ONLINE 0 0 2
c8t4d0p0 ONLINE 0 0 10
c8t2d0p0 ONLINE 0 0 0
it is currently a 3 x1tb drive array. what tools would best be used to determine what the error was and which drive is failing.
per the admin doc
The second section of the configuration output displays error statistics. These errors are divided into three categories:
READ – I/O errors occurred while issuing a read request.
WRITE – I/O errors occurred while issuing a write request.
CKSUM – Checksum errors. The device returned corrupted data as the result of a read request.
it was saying low counts could be any thing from a power flux to a disk event but gave no suggestions as to what tools to check and determine with.
Checksum errors occur when data was read from disk, but it didn't match the expected checksum; a noisy sata cable could cause this corruption either during writing (data corrupted on the way to disk) or reading (data corrupted on the way from the disk). Although it could be a failing disk, it was likely caused by a loose or pinched SATA data cable. Try reseating the cables on both ends or trying another known good cable.
As for determining which disk, kind of depends on what hardware you're using. For Sun branded hardware
cfgadm -alv
should give you hard drive serial numbers to match their logical names. If you're using SATA ports on the motherboard, the port numbers correspond to the target id (2, 3, 4) so the first port is probably t0. Most of my disks have WWN printed on the label, you can discover this by enabling multipathing withpfexec stmsboot -e
(see: this question) which will use the c8tWWNxxxxxxxxd0p0 format instead of c8tNd0p0, but probably only if you're using a SAS controller.Your output shows ZFS was able to correct the error by reconstructing the data from the other two disks and restore the redundancy. It's just letting you cause something bad happened, at this point the fault management system has not yet decided the disk has had sufficient errors to warrant offlining it (resulting in a 'degraded' pool status). I'd give it a scrub to make sure every byte reads cleanly. More info for error ZFS-8000-0P here.