I was under the impression that if an I/O error occurs during a read from a ZFS pool, two things will happen:
- The failure will be recorded in either the READ or CKSUM statistic of the relevant device, propagating upwards toward the pool level.
- Redundant data will be used to reconstruct the requested block, return the requested block to the caller and if the duff drive is still functional rewrite the block to it, OR
- An I/O error will be returned if redundant data is not available to correct for the read error.
It appears that one of the disks in my mirror setup has developed a bad sector. That by itself is not alarming; such things happen, and that's exactly why I have redundancy (a two-way mirror, to be exact). Every time I scrub the pool or read through the files in a particular directory (I haven't bothered yet to determine exactly which file is at fault), the following pops up in dmesg, obviously with varying timestamps:
Nov 1 09:54:26 yeono kernel: [302621.236549] ata6.00: exception Emask 0x0 SAct 0x9c10 SErr 0x0 action 0x0
Nov 1 09:54:26 yeono kernel: [302621.236557] ata6.00: irq_stat 0x40000008
Nov 1 09:54:26 yeono kernel: [302621.236566] ata6.00: failed command: READ FPDMA QUEUED
Nov 1 09:54:26 yeono kernel: [302621.236578] ata6.00: cmd 60/a8:78:18:5a:12/00:00:5c:01:00/40 tag 15 ncq 86016 in
Nov 1 09:54:26 yeono kernel: [302621.236580] res 41/40:a8:18:5a:12/00:00:5c:01:00/00 Emask 0x409 (media error) <F>
Nov 1 09:54:26 yeono kernel: [302621.236585] ata6.00: status: { DRDY ERR }
Nov 1 09:54:26 yeono kernel: [302621.236589] ata6.00: error: { UNC }
Nov 1 09:54:26 yeono kernel: [302621.238214] ata6.00: configured for UDMA/133
This is a fairly up to date Debian Wheezy, kernel 3.2.0-4-amd64 #1 SMP Debian 3.2.63-2 x86_64, ZoL 0.6.3. Package versions are current at debian-zfs=7~wheezy, libzfs2=0.6.3-1~wheezy, zfs-dkms=0.6.3-1~wheezy, zfs-initramfs=0.6.3-1~wheezy, zfsutils=0.6.3-1~wheezy, zfsonlinux=3~wheezy, linux-image-amd64=3.2+46, linux-image-3.2.0-4-amd64=3.2.63-2. The only package pinning that I know of is for ZoL, for which I have (as provided by the zfsonlinux package):
Package: *
Pin: release o=archive.zfsonlinux.org
Pin-Priority: 1001
Running hdparm -R
on the drive reports that Write-Read-Verify is
turned on (this is a Seagate, so has that feature and I use it as an
extra safety net; the additional write latency is not a problem since
my interactive use pattern is very read-heavy):
/dev/disk/by-id/ata-ST4000NM0033-9ZM170_XXXXXXXX:
write-read-verify = 2
Even given the clear indication that something is amiss, zpool status
claims that there is no problem with the pool:
pool: akita
state: ONLINE
scan: scrub repaired 0 in 8h16m with 0 errors on Sat Nov 1 10:46:03 2014
config:
NAME STATE READ WRITE CKSUM
akita ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
wwn-0x5000c50065e8414a ONLINE 0 0 0
wwn-0x5000c500645b0fec ONLINE 0 0 0
errors: No known data errors
This error has been showing up in the logs regularly for the last several days now (since Oct 27) so I'm not terribly inclined to write it off as merely a fluke. I run the disks with quite short SCTERC timeouts; 1.5 seconds read (to recover quickly from read errors), 10 seconds write. I have confirmed that these values are active on the drive in question.
smartd keeps pestering me (which in itself is a good thing!) about the fact that the ATA error count is climbing:
The following warning/error was logged by the smartd daemon:
Device: /dev/disk/by-id/ata-ST4000NM0033-9ZM170_XXXXXXXX [SAT], ATA error count increased from 4 to 5
For details see host's SYSLOG.
Running smartctl --attributes
on the drive in question yields the
following:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 076 063 044 Pre-fail Always - 48910012
3 Spin_Up_Time 0x0003 091 091 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 97
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 092 060 030 Pre-fail Always - 1698336160
9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 9887
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 98
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 095 095 000 Old_age Always - 5
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 10
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 058 052 045 Old_age Always - 42 (Min/Max 20/45)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 61
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 492
194 Temperature_Celsius 0x0022 042 048 000 Old_age Always - 42 (0 11 0 0)
195 Hardware_ECC_Recovered 0x001a 052 008 000 Old_age Always - 48910012
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
Nothing glaringly out of the ordinary there. Note that this is an enterprise drive, so five years warranty and rated for 24x7 operation (meaning it's meant to be reliable for over 40,000 hours of operation, compared to the just under 10,000 hours under its belt so far). Notice the number 5 in attribute 187 Reported_Uncorrect; that's where the problem is. Also note the fairly low Start_Stop_Count and Power_Cycle_Count values of under 100 each.
Not that I think it's relevant in this case, but yes, the system does have ECC RAM.
Non-default properties of the root file system on the pool are:
NAME PROPERTY VALUE SOURCE
akita type filesystem -
akita creation Thu Sep 12 18:03 2013 -
akita used 3,14T -
akita available 434G -
akita referenced 136K -
akita compressratio 1.04x -
akita mounted no -
akita mountpoint none local
akita version 5 -
akita utf8only off -
akita normalization none -
akita casesensitivity sensitive -
akita usedbysnapshots 0 -
akita usedbydataset 136K -
akita usedbychildren 3,14T -
akita usedbyrefreservation 0 -
akita sync standard local
akita refcompressratio 1.00x -
akita written 0 -
akita logicalused 2,32T -
akita logicalreferenced 15K -
and correspondingly for the pool itself:
NAME PROPERTY VALUE SOURCE
akita size 3,62T -
akita capacity 62% -
akita health ONLINE -
akita dedupratio 1.00x -
akita free 1,36T -
akita allocated 2,27T -
akita readonly off -
akita ashift 12 local
akita expandsize 0 -
akita feature@async_destroy enabled local
akita feature@empty_bpobj active local
akita feature@lz4_compress active local
These lists were obtained by running {zfs,zpool} get all akita | grep -v default
.
Now for the questions:
Why isn't ZFS reporting anything about the read problem? It's clearly recovering from it.
Why isn't ZFS automatically rewriting the duff sector that the drive is clearly having trouble reading, in turn hopefully triggering a relocation by the drive, given that sufficient redundancy exists for automatic repair in the read request path?
I suspect the ATA driver is retrying the read operation a couple of times when it receives an error before passing the error back to the filesystem driver.
What this means is by the time the ZFS filesystem driver gets the result of the read the data is all there, and correct, but it likely took a bit longer than normal to happen. There is of course no error counter for higher-than-average latency, so nothing logged.
The fact that the SMART value for Reported_Uncorrect is not 0 makes me suspect that the cause of the failure is the disk itself, as opposed to say the SATA cable or SATA controller being flakey.
If this is the case, most likely the disk will eventually die more and start failing to read even after however many retries the block device driver is attempting. As such my advice would be to replace the disk.
Trigging a long SMART test would likely fail on the affected blocks, if you wanted to make the error go away rewriting those blocks (with dd for example) should cause the disk to swap out those sectors, but in my experience is once a drive starts to go it's better to just replace it and be done with it.
I had a bad SCSI disk with a ZFS raid on Solaris. I was scanning the log files for information on error messages to gather proof the disk was bad and get Oracle to cover it on the hardware maintenance. I ran grep for certain strings in error logs and these lines showing disk errors would be in my console screen. When "Explorer" (the system log and reporting tool for Solaris) was run and sent to Oracle, they said there was no errors on the disks. But I had them on my screen history. I checked and it was indeed gone from the logs on disk.
Here is what was going on... ZFS promises zero file system errors, not zero data errors. When a serious corruption is encountered, it rolls back the transaction, doing whatever is required to make the file system integrity good. As a result, file updates are lost when it rolls back to an earlier version of the file prior to the corruption, and therefore data loss may occur. But your file system is error free.
Perhaps for simple errors, ZFS can rollback and rewrite the data in another attempt, but it seems in serious cases of bad disk behaviour, it can get into a catch-22 where data isn't being recovered and written. If you need to gather evidence on disk errors, make them come on the screen and don't rely on proof which will reside on disk, where data is potentially being reset by ZFS transaction rollbacks.