I've run out of ideas with this problem, so thought a SF question may help.
We've a number of Ubuntu 9.10 servers which we've recently switch from single NICs to bonded NICs using standard kernel network bonding.
This setup works as planned (and as has done in the past for various Linux machines), however we've had some boxes simply drop off the network hours after enabling bonding.
The boxes literally stop responding on the network, however a simple /etc/init.d/networking restart via the KVM brings the connection back online.
My first thought were along the lines that either 1) the upstream connection stopped, 2) something local on the box blew away the network configuration (e.g. network-manager), or 3) the bonding crashed somehow.
However I quickly hit a wall trying to investigate this across all four servers.
The event is not logged locally on any of the servers (/var/log/*, dmesg, etc). I expected to see a change in link status or similar.
The upstream switches all centrally syslog, which also recorded no change in network state, nor MAC flapping.
/proc/net/bonding/bond0 reported no issues
I can't see anything along the lines of network-manager running.
The only things logged are the change in network state cause by running the service restart.
Originally we used mode=0 (active-active), but with the suggestion that it was casuing network confusion with MACs being present in two places we switched to mode=1 (active-standby) -- this made no difference and the servers failed again a few hours later.
It's like the network just "stops". Any ideas folks?
Configuration
/etc/modprobe.d/bonding.conf
alias bond0 bonding
options bonding mode=0 miimon=100
/etc/network/interfaces
auto bond0
iface bond0 inet static
address 192.168.1.10
gateway 192.168.1.1
netmask 255.255.255.0
slaves eth0 eth1
up /sbin/ifenslave bond0 eth0 eth1
down /sbin/ifenslave -d bond0 eth0 eth1
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
You gave very little information to help debug.
Since you say "some" work.
Ubuntu has had random network failures do to "bad" kernels in specific configurations even without bonding. Try an alternate kernel, assuming the existing kernel matches a system that is working.
Depending on the switch and bonding mode used, even a single NIC failure can cause the connection to hang. Try a dual channel transparent bridge with packet analyzer to determine the last NIC used before failure. Also, look at the last packet type, flags, re-transmits, etc. sent on the wire before failure.
Best guess with no info - buggy kernel or faulty hardware. Ubuntu would not be the first choice for a server OS. Ubuntu is targeted to novice desktop Linux users. The current Ubuntu targets Netbook users. Ubu is a good selection for desktop due to it's popularity - larger forums, more desktop oriented hardware drivers, more desktop apps. Debian and Centos/RHEL both have larger install bases in "mission critical" production use for Linux servers.