Yesterday I spent 4 hours trying to get my network's DHCP/DNS/SMB server back online. Long story short, it took numerous wildly frustrated shots in the dark (no DNS = no internet resources for help) and no fewer than half a dozen reboots to finally restore my server to functioning order.
What precipitated this was configuring and enabling my server's second Ethernet port in /etc/network/interfaces
. That's when it all hit the fan. I've finally gotten eth1 disabled again and eth0 is working as before, but this isn't the state I want this server to be in.
eth0 and eth1 are both gigabit ports built into the motherboard (an ASUS something-or-other), and previously they were both bonded together (round-robin, I think); however, the server's been completely reformatted and re-installed since then (hard drive failure precipitated that), so I would think that anything the bonding driver had configured would be dead and gone.
While the server was offline, ifconfig
seemed to be showing that it was receiving packets just fine, but every single outgoing packet was being dropped. (I should have saved the output from ifconfig
during the issue, but the 'TX' line showed "packets:0" and "dropped:123"; also "errors:0 ... overrun:0 carrier:0".)
eth0 is configured with a static IP; I did the same for eth1. Here is /etc/network/interfaces
:
root@odin:~# cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet static
address 10.12.0.50
netmask 255.0.0.0
gateway 10.12.0.2
# The secondary network interface
# Commented out now because this was the only way I could get it to work again
#auto eth1
#iface eth1 inet static
# address 10.12.0.51
# netmask 255.0.0.0
# gateway 10.12.0.2
ethtool
shows:
root@odin:~# ethtool eth0
Settings for eth0:
Supported ports: [ MII ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 1
Transceiver: external
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Link detected: yes
The output for eth1 is identical, except that it shows "Link detected: no" because it's disabled currently; "Link detected" was always "yes" for either interface when it was supposedly enabled, even when eth0 was apparently unable to send any packets.
/var/log/syslog
shows numerous entries like this:
May 11 21:55:08 odin kernel: [ 797.050022] forcedeth 0000:00:08.0: eth0: Got tx_timeout. irq: 00000020·
May 11 21:55:08 odin kernel: [ 797.050026] forcedeth 0000:00:08.0: eth0: Ring at 112804000·
May 11 21:55:08 odin kernel: [ 797.050029] forcedeth 0000:00:08.0: eth0: Dumping tx registers·
May 11 21:55:08 odin kernel: [ 797.050035] forcedeth 0000:00:08.0: eth0: 0: 00000020 000000df 00000003 0001000d 00000000 00000000 00000000 00000000·
[bunch more lines like this one, though none reference eth1]
Also in syslog are countless repetitions of the following lines:
May 11 21:54:42 odin kernel: [ 770.480861] martian source 10.12.0.50 from 10.42.0.206, on dev eth1·
May 11 21:54:42 odin kernel: [ 770.480865] ll header: ff:ff:ff:ff:ff:ff:00:1e:65:d6:6c:6a:08:06·
May 11 21:54:42 odin kernel: [ 770.987932] martian source 10.12.0.51 from 10.12.0.2, on dev eth1·
May 11 21:54:42 odin kernel: [ 770.987937] ll header: ff:ff:ff:ff:ff:ff:00:13:46:ed:e2:4a:08:06
The "from" address is different, but it's always eth1 and always "source" 10.12.0.50 or .51. That "martian" thing reminded me that I am running Shorewall, but turning it off (and verifying that iptables -L
showed nothing but accepting everything from/to anywhere) had no effect whatsoever. I'm not even sure why eth1 would be seeing traffic intended for eth0's address in the first place, given that they're connected to a switch that (in my understanding, anyway) would only send packets to their intended destinations. (It is an unmanaged gigabit switch, Linksys I think.)
I don't even know how to begin to diagnose or troubleshoot what went wrong here. Frankly, I'm afraid to try to start eth1 again, especially since I don't even know what finally fixed the problem so I don't know that I could get it reverted again to its current state. What can I do to figure out what happened, and to fix it so that I can again turn on eth1 without blowing up the server's networking again? Could the hardware still be mis-configured from the previous system install using the bonding driver? How could I determine that and, if that's the case, fix it?
Both ports worked perfectly independently on the previous install before I set up bonding, and I had no issues at all during that time. I re-installed the system about 4-ish weeks ago, and eth1 has been disabled since then (Ubuntu detected it during the installation routine, but I of course chose eth0 as my "primary" interface during the install and Ubuntu apparently made no effort to configure eth1 after that).
Couple of notes:
mode=active-backup
What you should do:
ip addr flush dev eth1; ip link set up dev eth1
to see if merely bringing up eth1 causes eth0 to fail. If it does, you likely have hardware problems.mode=active-backup
) with both eth0 and eth1 as slaves and assign the server's IP address to that.If you nics were previously bonded together, it's quite possible you need to reconfigure the switch ports. The ports may have been trunked or try plugging you nics into untagged ports on the same VLAN.