I just had the motherboard replaced in a Dell PowerEdge R410 that functions as one of our virtual servers (running Ubuntu 10.04.3 LTS). I'm fairly new to Linux, and was quite surprised when networking was completely broken after the swap. Another disclaimer is that I didn't build our virtual servers to begin with, and have very limited understanding of how Linux-KVM works. Once the motherboard was swapped, I then ran the LifeCycle Controller application and applied a variety of upgrades (the most notable of which would be firmware upgrades for the NICs). After a lot of research, I finally managed to "fix" networking by editing the /etc/udev/70-persistent-net.rules file. Within that file, I removed the 2 old Broadcom (bnx2) entries for the prior motherboard, and then renamed the new bnx2 eth2 and eth3 to eth0 and eth1 respectively. I then moved eth0 and eth1 to the top of the document. The (igb) entries are for a PCI-based Intel Gigabit NIC that's currently unused. Here are the contents of my 70-persistent-net.rules file:
# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.
# PCI device 0x14e4:0x163b (bnx2)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="78:2b:cb:20:9d:71", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
# PCI device 0x14e4:0x163b (bnx2)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="78:2b:cb:20:9d:72", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"
# PCI device 0x8086:0x10c9 (igb)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="90:e2:ba:0c:7e:f9", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
# PCI device 0x8086:0x10c9 (igb)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="90:e2:ba:0c:7e:f8", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"
This resolved the networking being completely broken (couldn't ping the gateway before), but a much more major issue has persisted where the server hardware randomly reboots. I can't easily reproduce the crash, but it entails bringing up the 5 guest OSes that are running on the machine, and then doing Splunk queries/pings/running X11 forwarding to puTTY, etc. The hardware itself passes all self-tests, and a Dell technician reviewed a DSET I collected and mentioned everything looks great hardware-wise.
Here's my /etc/network/interfaces file:
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
auto eth1
iface eth1 inet manual
# 10.1.225.x network
auto br0
iface br0 inet static
address 10.1.225.12
netmask 255.255.255.0
network 10.1.225.0
broadcast 10.1.225.255
gateway 10.1.225.1
bridge_ports eth0
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off
#vlan 231
auto eth1.231
iface eth1.231 inet manual
up ifconfig eth1.231 up
#KVM bridge, vlan 231, via eth1
iface br231 inet static
bridge_ports eth1.231
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off
##vlan 229
#auto eth1.229
#iface eth1.229 inet manual
# up ifconfig eth1.229 up
##KVM bridge, vlan 229, via eth1
#auto br229
#iface br229 inet manual
# bridge_ports eth1.229
# bridge_maxwait 5
# bridge_fd 1
# bridge_stp on
#
# !!!!! NOTE (MGRACE): This *is* used !!!!!
#
#No! Unused
auto br1
iface br1 inet manual
bridge_ports eth1
bridge_fd 9
bridge_hello 2
bridge_maxage 12
bridge_stp off
#auto br2
#iface br2 inet manual
# bridge_ports eth1
# bridge_fd 9
# bridge_hello 2
# bridge_maxage 12
# bridge_stp off
#auto br3
#iface br3 inet manual
# bridge_ports eth1
# bridge_fd 9
# bridge_hello 2
# bridge_maxage 12
# bridge_stp off
I've scanned every log I can get my hands on, and have yet to find a bread crumb to follow =(. The Dell Technician mentioned this should potentially be as easy as changing the Hypervisor's MAC address, but I've been unable to figure out how to do that up until this point. Any help is greatly appreciated, and I'd be happy to provide any additional information that may prove beneficial.
Thanks, -Snipe
Good news: It turns out that a faulty power distribution block within the server was causing the random reboots. I wouldn't have been able to figure this out if the problem didn't exponentially get worse 2 Mondays ago, which enabled a Dell tech and I to finally track the source down. Sorry for the misdiagnosis everyone! =)
-Snipe