We've got a CentOS server running a cluster of virtuals. Occasionally the cluster's internal network drops out for a minute or so ... and then comes back. The problem is somehow related to the actual network traffic, but it is not a simple load issue. (The system is generally lightly loaded, and the problem occurs irrespective of actual load.)
The setup:
- CentOS 5.6 on Dom0, various CentOS on the DomU's
- Hardware - a Dell R710 with a BroadCom NextXpress 2 NIC (sigh)
- using the latest drivers for the NIC from BroadCom
- Xen configured to use network-bridge and vif-bridge
- Some iptable tweaks to route an unrelated port to one of the virtuals.
The system has one externally visible IP address, and Dom0 runs an Apache httpd configured with a number of virtual hosts each of which reverse proxies to web servers running on the virtuals. (The virtuals have to be NAT'ed, primarily because we don't have enough allocated public IP addresses.)
The symptoms:
- Works fine most of the time.
- When someone tries to UPLOAD a large file to one virtuals, the internal network drops out ... for all virtuals:
- The Dom0 httpd sees a network timeout talking to the backend server on the virtual and reports a 502.
- A previously established ssh connection from Dom0 to any of the DomU's freezes.
- Our monitoring shows ping failures for traffic between virtuals.
- The Xen consoles to the DomU's do not freeze.
- No log messages in any log files that I can see, on either Dom0 or the DomU's ... apart from the Dom0 httpd logs.
- After a minute or so, the problem clears by itself.
This is 100% reproducible.
What we've tried:
- Downloading, building and installing the latest BNX2 driver on Dom0
- Turning off MSI on the NIC - adding "options bnx2 disable_msi=1" to /etc/modprobe.conf
- Turning off tcp segmentation offload - "ethtool -K eth0 tso off".
- Sacrificing a black rooster at midnight.
I've exhausted all my options apart from switching to KVM ... or slaughtering more roosters.
Any suggestions?
We eventually did find the problem. It turned out to be a caused by a problem in our virtual network configuration. For some reason that I can no longer remember exactly, network traffic for that particular download was taking an extra loop through the virtual networks. When a user tried to upload a large file, the download was tying down all available kernel network buffers. That was causing the entire network to freeze ... until something timed out and it all unjammed.
I'm sorry that this is all a bit vague, but there may offer some hints for people who run into a similar problem.
Perhaps there are a limited number of network threads to connect the virtuals to the host, and uploading large files takes up all of them eventually and the rest of them lose signal. I've got no other guesses. Sorry.
You might take a look at memory overcommitment and/or swap configuration(s). If the either is "tuned to the hilt" then a large file upload may be the trigger to the management of those resources - leading to the unavailability until the management is completed.
Are you sure you do not have MAC address conflicts?
This is just a wild guess, but it happens easily if one copies Xen domU config files but forgets to change the MAC to be unique in each domU and interface. I have seen this cause strange network problems where all connectivity was lost for exactly 60 seconds occasionally.