I am trying to copy(~7TB of data using rsync) between two server in same data center in the backend its using EMC VMAX3
After copying ~30-40GB of data multipath start failing
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: Recovered to normal mode
Dec 15 01:57:53 test.example.com multipathd: 360000970000196801239533037303434: remaining active paths: 1
Dec 15 01:57:53 test.example.com kernel: sd 1:0:2:20: [sdeu] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
[root@test log]# multipath -ll |grep -i fail
|- 1:0:0:15 sdq 65:0 failed ready running
- 3:0:0:15 sdai 66:32 failed ready running
We are using default multipath.conf
HBA driver version 8.07.00.26.06.8-k
HBA model QLogic Corp. ISP8324-based 16Gb Fibre Channel to PCI Express Adapter
OS: CentOS 64-bit/2.6.32-642.6.2.el6.x86_64
Hardware:Intel/HP ProLiant DL380 Gen9
Already verified this solution and checked with EMC everything looks good https://access.redhat.com/solutions/438403
Some more info
- There is no drop/error packet on the network side.
- Filesystem is mounted with noatime,nodiratime
- Filesystem ext4(Already tried xfs but same error)
- LVM is in striped mode(Started with linear option and then converted to striped)
Already disabled THP
echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
- Whenever multipath start failing process goes to D state
- System firmware upgraded
- Tried with latest version of qlogic driver
- Tried with different scheduler(noop,deadline,cfq)
- Tried with different tuned profile(enterprise-storage)
Vmcore collected during the time of issue
I am able to collect vmcore during the time of issue
KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.6.2.el6.x86_64/vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 36
DATE: Fri Dec 16 00:11:26 2016
UPTIME: 01:48:57
LOAD AVERAGE: 0.41, 0.49, 0.60
TASKS: 1238
NODENAME: test.example.com
RELEASE: 2.6.32-642.6.2.el6.x86_64
VERSION: #1 SMP Wed Oct 26 06:52:09 UTC 2016
MACHINE: x86_64 (2297 Mhz)
MEMORY: 511.9 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000018"
PID: 15840
COMMAND: "kjournald"
TASK: ffff884023446ab0 [THREAD_INFO: ffff88103def4000]
CPU: 2
STATE: TASK_RUNNING (PANIC)
After Enbaling Debug mode on the qlogic sid
qla2xxx [0000:0b:00.0]-3822:5: FCP command status: 0x2-0x0 (0x70000) nexus=5:1:0 portid=1f0160 oxid=0x800 cdb=2a200996238000038000 len=0x70000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d42580 cp=ffff88276d249480.
qla2xxx [0000:84:00.0]-3822:7: FCP command status: 0x2-0x0 (0x70000) nexus=7:0:3 portid=450000 oxid=0x4de cdb=2a20098a5b0000010000 len=0x20000 rsp_info=0x0 resid=0x0 fw_resid=0x0 sp=ffff882189d421c0 cp=ffff8880237e0880.
This is an HP ProLiant DL380 Gen9 server. Pretty standard enterprise-class server.
Can you give me information on the server's firmware revision?
Is EMC PowerPath actually installed? If so, check here.
Do you have the HP Management Agents installed? If so, do you have the ability to post the output of
hplog -v
.Have you seen anything in the ILO4 log? Is the ILO accessible?
Can you describe all of the PCIe cards installed in the system's slots?
For RHEL6-specific tuning, I highly recommend XFS, running
tuned-adm profile enterprise-storage
and ensuring your filesystems are mountednobarrier
(the tuned profile should handle that).For the volumes, please ensure that you're using the
dm
(multipath) devices instead of/dev/sdX
. See: https://access.redhat.com/solutions/1212233Looking at what you've presented so far and the check listed at Redhat's support site (and the description here), I can't rule out the potential for HBA failure or PCIe riser problems. Also, there's a slight possibility that there's an issue on the VMAX side.
Can you swap PCIe slots and try again? Can you swap cards and try again?
Is the firmware on the HBA current? Here's the most recent package from December 2016.
This looks to me like one of your SFPs has soft-failed... Look in your storage switch for errors on the port while you are doing a large copy.
I had a similar issue recently where everything looked great. Server vendor signed off on their stuff, storage vendor said their stuff looks good, swore the SFPs are all fine... SFP still showed as up and functional, until large amounts of data were sent across the MPIO interface and lots of errors on the storage switch port would start getting logged.
I had to replace all fiber cables with new ones, then switch SFPs with spares I had on hand to prove to the vendor that the SFP was bad, even though it appeared fine otherwise.
I know that if you will change in /etc/sysconfig/mkinitrd/multipath MULTIPATH=NO on MULTIPATH=YES and at file /etc/multipath.conf - comment next:
blacklist {devnode "*"}
Turn on auto-load:
chkconfig multipathd on
Turn on module download:
modprobe dm-multipath
modprobe dm-round-robin
On autocfg:
multipath -v2
Reload server, cheeking all up:
lsmod | grep dm_
watching multi-path :
multipath -ll
Finally issue is resolved
Error: TECH PREVIEW: DIF/DIX support may not be fully supported.
I constantly saw this message in dmesg during the time of issue and Keep on ignoring this message
On further debugging, I found out Kernel is in tainted state
As per RedHat
As we are using EMC we decided to disable this feature and that did the trick