Im having an issue which I have a really hard time debugging. Running ZFS, by system "hiccuped", dumped some information into DMESG, and continued working.
My ZFS is hosting VMs on ESXi. When this issue occurs, many of the VMs experience block IO errors, and some of them drop into read-only mode, requiring restores from backup or fsck to repair the filesystems. This issue only occurs very occasionally, and I have hammered the system, trying to stress it out, it does not seem to be performance related. Only occurs every few months, so conclusively solving it seems to be a pipe-dream to me.
First off, some info about my system (Centos 7, 4.5).
[root@zfs-head ~]# name -a
Linux zfs-head 4.5.0-1.el7.elrepo.x86_64 #1 SMP Mon Mar 14 10:24:58 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
dmesg entries:
[4331253.022999] sd 2:0:28:0: [sdaa] tag#2 CDB: Read(10) 28 00 10 a8 3d b5 00 00 20 00
[4331253.023006] mpt3sas_cm0: sas_address(0x5000c500837f31f2), phy(8)
[4331253.023008] mpt3sas_cm0: enclosure_logical_id(0x50010c60004d41ff),slot(0)
[4331253.023010] mpt3sas_cm0: enclosure level(0x0003), connector name( )
[4331253.023013] mpt3sas_cm0: handle(0x002d), ioc_status(scsi data underrun)(0x0045), smid(222)
[4331253.023016] mpt3sas_cm0: request_len(131072), underflow(16384), resid(131072)
[4331253.023018] mpt3sas_cm0: tag(0), transfer_count(0), sc->result(0x00000000)
[4331253.023020] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331253.023023] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331253.023030] sd 2:0:28:0: Mode parameters changed
[4331266.475222] sd 2:0:29:0: [sdab] tag#29 CDB: Write(10) 2a 00 09 97 6e c1 00 00 02 00
[4331266.475229] mpt3sas_cm0: sas_address(0x5000c500837f25c6), phy(9)
[4331266.475232] mpt3sas_cm0: enclosure_logical_id(0x50010c60004d41ff),slot(1)
[4331266.475234] mpt3sas_cm0: enclosure level(0x0003), connector name( )
[4331266.475237] mpt3sas_cm0: handle(0x002e), ioc_status(scsi data underrun)(0x0045), smid(139)
[4331266.475239] mpt3sas_cm0: request_len(8192), underflow(1024), resid(8192)
[4331266.475241] mpt3sas_cm0: tag(0), transfer_count(0), sc->result(0x00000000)
[4331266.475244] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331266.475246] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331266.475252] sd 2:0:29:0: Mode parameters changed
pool status:
[root@zfs-head ~]# pool status
pool: storage
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
s1d1 ONLINE 0 0 0
s2d1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
s3d1 ONLINE 0 0 0
s4d1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
s1d2 ONLINE 0 0 0
s2d2 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
s3d2 ONLINE 0 0 0
s4d2 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
s1d3 ONLINE 0 0 0
s2d3 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
s3d3 ONLINE 0 0 0
s4d3 ONLINE 0 0 0
logs
ata-Samsung_SSD_850_PRO_128GB_S24ZNXAGA10768M ONLINE 0 0 0
cache
ata-Samsung_SSD_850_EVO_250GB_S21NNXAG918721R ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA59337A ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA69590F ONLINE 0 0 0
errors: No known data errors
[root@zfs-head ~]#
My Vdev map:
[root@zfs-head ~]# cat /etc/zfs/vdev_id.conf
# by-vdev
# name fully qualified or base name of device link
alias s1d1 /dev/disk/by-id/scsi-35000c500837ff247
alias s1d2 /dev/disk/by-id/scsi-35000c500837f15c3
alias s1d3 /dev/disk/by-id/scsi-35000c500837f137f
alias s2d1 /dev/disk/by-id/scsi-35000c500837f377b
alias s2d2 /dev/disk/by-id/scsi-35000c500837f5bf7
alias s2d3 /dev/disk/by-id/scsi-35000c500837f75bf
alias s3d1 /dev/disk/by-id/scsi-35000c500837f14d3
alias s3d2 /dev/disk/by-id/scsi-35000c500837f571b
alias s3d3 /dev/disk/by-id/scsi-35000c500837f604f
alias s4d1 /dev/disk/by-id/scsi-35000c500837f31f3
alias s4d2 /dev/disk/by-id/scsi-35000c500837f25c7
alias s4d3 /dev/disk/by-id/scsi-35000c500837f14cf
[root@zfs-head ~]#
The box didn't restart, or really even acknowledge that there was an issue, save for the dmesg entries. I have googled those entries to my level best, but did not find anything relevant.
Help appreciated!
0 Answers