I'm having problem with few linux boxes running xen. They are acting as hypervisors and they are connected to SAN using multipath setup to provide storage to guest vms.
Every now and then one of two paths fails but it can be quickly restored by running:
multipath
multipath -ll
I need to get to the bottom of the issue and find out why this is happening. I have noticed that this doesn't occur when the hypervisor is not too busy (network and I/O wise). I have also eliminated possible hardware problem by moving all the services on to identical new chassis. I have collected few system logs which may indicate NIC module issue or kernel problem and failing multipath might be only a result of this!!?? Here is a bit of log which always shows up when multipath goes down:
kernel: BUG: soft lockup - CPU#0 stuck for 60s! [swapper:0]
kernel: BUG: soft lockup - CPU#2 stuck for 60s! [events/2:76]
I'll paste full logs at the end of this post to keep it easy to read. Now a little bit more about my setup:
- Internet access is setup over eth0 and eth2 (bonded)
- SAN multipath access is setup over eth1 and eth3
Server:
- Supermicro SuperServer 6016T-NTRF
- Intel(R) Xeon(R) CPU E5645
- Intel Corporation 82576 Gigabit Network
CentOS release 5.7 (Final) 2.6.18-274.18.1.el5xen
filename: /lib/modules/2.6.18-274.18.1.el5xen/kernel/drivers/net/igb/igb.ko
version: 3.0.6-k2-1
- Log 02
If anyone needs more details please get it touch. Any help will be much appreciated.
Since this appears to be an iSCSI setup, there are a couple of areas where path failovers can occur.
Multipath setups are very sensitive to latency on the wire, and iSCSI + Ethernet is going to have more of that than a Fibre Channel environment. Some flapping is going to be normal.
As this seems to happen when the HVM is busy, this suggests that the kernel NIC paths are getting either congested with data or starved for CPU (possibly both), which is triggering multipath failover. There isn't a lot you can do about that, but you can narrow things down so you can better explain why it is doing what it is doing.
Server load is pretty easy, and sounds like you've already done that.
Diagnosing congestion is harder. If your network port bandwidth monitors aren't showing a lot of traffic, but the log entries you posted happen anyway, that is a sign that the server is clogging up internally. If you can actually grab a packet capture during one of these events, the time-stamped packet counts will tell you if it really is seeing 10 second gaps in passed traffic; a sure sign that the server is internally clogged.
Fixing the problem is likely to be driver-specific, with a possibility of some tuning of the TCP/IP stack tuneables.