I have been running ZFS Raid1z with 5 disk under Ubuntu 12.04 for 3 years now with no problems at all.
Unfortunately the day of failing disk has come. I have lost a disk in the array, he simply went offline and after a few days the second one started to drop errors as well. As the system detected check sum errors on the second disk that has started to fail (some bad sectors according to SMART) it started to re-silver the array and when i got to the PC and saw the re-silvering was already at 40%, in order to avoid a catastrophe I have decided to stop the server asap.
So basically my array looks like almost like this, and somewhere it is mentioned that data's were lost :
NAME STATE READ WRITE CKSUM
Misu DEGRADED 0 0 0
raidz1-0 ONLINE 0 0 0
scsi-SATA_ST3000DM001-9YN_Z1F1587B OFFLINE 0 0 0 (failed hdd)
scsi-SATA_ST3000DM001-9YN_Z1F14J7V ONLINE 0 0 0
scsi-SATA_ST3000DM001-9YN_Z1F14JYL ONLINE 0 0 0
scsi-SATA_ST3000DM001-1CH_W1F1G04F ONLINE 0 0 0
scsi-SATA_ST3000DM001-1CH_W1F1G1H7 ONLINE 134 5 139 (failing hdd)
Since the resilver process take some time i'm quite afraid of replacing the first disk and hope that the second one, the one that has checksum errors will not fail. So i have decided to replace the PCB on the first failed disk since it had pcb problems and not mecanical problems.
So, if i manage to make the first disk running what shall i do next, how will zfs know that the disk was not replace (not sure but i believe that changing the pcb will change the serial number and stuff for that disk) and detect the disk as the original member?
Any other information that can help me not to make this worse?
Reimport the pool read-only and make a backup while it is still mountable. You have two bad disks in a pool with one level of parity protection, and if the second disk also faults offline, then the problem becomes much worse. Avoid sleeping or power-cycling the hardware until you have a backup.
OpenZFS can recognize data on the repaired disk regardless of whether changing the PCB changes the disk serial number. If the repaired disk reappears in the system with a different /dev name, then just reimport the pool. Resilvering must happen on the repaired disk before the second failing disk can be replaced, which is when a fatal error is likely to happen.
Note that this may be a 'bathtub' failure because the disks seem to be from the same manufacturing batch. If so, then expect additional failures.
I see this message is very old, however, I'll add the resolution I suggest in case somebody else run into the same problem.
What you have to do is launch zpool replace command against your drive with errors, not against the drive that is OFFLINE. I'll clarify.
If you replace the drive OFFLINE, your changes to recover the information disappear. You'll still have one drive with errors, so those files affected will not be recoverable.
But if you launch the zpool replace command against the drive with errors, so, scsi-SATA_ST3000DM001-1CH_W1F1G1H7, you will recover the same information than in the previous case, but you'll have the chance to recover the rest if you're able to make the OFFLINE drive live again.
IMO this is the best way to proceed. ZFS will know which data is healthy from the drive with errors and the Resilver will be a bit faster, as ZFS is reading from more drives. Also, in case that other drives have errors in a block of the stripe but the drive returning errors have the Data in good state, you'll be able to recover that stripe, so you'll not loss Data if errors in another drive but not in the others, including scsi-SATA_ST3000DM001-1CH_W1F1G1H7. So do not OFFLINE scsi-SATA_ST3000DM001-1CH_W1F1G1H7. ZFS is very good using the information it can from failing drives, so it will do its best.
If your pool was composed by 14TB drives and you had 10TB full in each drive, it will take less than 24 hours to Resilver these 10TB to the new drive. (Drives ~200MB/s 5.4K rpm) Your drives are Seagate Barracuda ST3000DM001 3000GB / 3TB 7.2K 6.0 Gb/s SATA Desktop so Resilvering the drive assuming the pool was 100% full would take less than 8 hours.
Cheers