Is there an underlying administrative or diagnostic interface to the Linux bonding driver to determine what is going on internally?
I've used link aggregation between Linux boxes and Cisco switches for many years. Periodically I run into a dead-end when setting up new boxes where the Linux side simply does not respond to Cisco LACP packets. I meticulously follow a strict set of instructions for each server, but the results appear to vary.
Whether the bond contains one slave or eight, tcpdump shows LACP packets coming from the switch on all bonded interfaces and no packets are ever transmitted back. In fact, no packets are transmitted period. rx_packets
for the interface shows considerable traffic, but tx_packets
is zero. There is nothing interesting in the logs regarding MII or bonding. There aren't even any errors.
Presently, I'm dealing with a box that has only two nics. For the moment, I have only eth1 in the bond. Obviously, this is a degenerate configuration. The situation does not change with both eth0 and eth1 in the bond; it just makes it harder to work with the machine when the network stack is completely down. I can reconfigure it for both nics if necessary and go through an administrative interface (DRAC), but I can't copy-paste from the box that way.
Some preliminaries:
- I tested the nics, ports, and cables. Everything works as expected when the interfaces are not bonded.
- I have rebooted and confirmed that the modules are loading properly.
- I have tried this with and without the vlan trunking; it should not matter as link aggregation takes place below that point in the stack.
- The switch has working, trunked channel-groups going to other Linux boxes. The configurations are more-or-less identical even though the distros, kernels, and hardware of the Linux boxes are not.
This is debian 8.6 downloaded today.
Linux box 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u2
(2016-10-19) x86_64 GNU/Linux
An abbreviated config:
iface eth1 inet manual
auto bond0
iface bond0 inet manual
slaves eth1
address 10.10.10.10
netmask 255.255.255.0
bond_mode 4
bond_miimon 100
bond_downdelay 200
bond_updelay 200
bond_xmit_hash_policy layer2+3
bond_lacp_rate slow
Some state:
# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2+3 (2)
MII Status: down
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
bond bond0 has no active aggregator
Slave Interface: eth1
MII Status: down
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 78:2b:cb:5a:2b:3e
Aggregator ID: N/A
Slave queue ID: 0
An inbound tcpdump record on eth1 from the switch:
22:18:47.333928 M 44:ad:d9:6c:8d:8f ethertype Slow Protocols (0x8809),
length 126: LACPv1, length 110
Actor Information TLV (0x01), length 20
System 44:ad:d9:6c:8d:80, System Priority 32768, Key 12,
Port 272, Port Priority 32768
State Flags [Activity, Aggregation, Synchronization,
Collecting, Distributing, Default]
Partner Information TLV (0x02), length 20
System 00:00:00:00:00:00, System Priority 0, Key 0, Port 0,
Port Priority 0
State Flags [none]
Collector Information TLV (0x03), length 16
Max Delay 32768
Terminator TLV (0x00), length 0
The cisco side:
interface GigabitEthernet1/0/15
switchport trunk allowed vlan 100,101,102
switchport mode trunk
channel-group 12 mode active
end
interface Port-channel12
switchport trunk allowed vlan 100,101,102
switchport mode trunk
end
Eventually, the switch gives up, and the interface goes into "stand-alone" mode. If there are two interfaces in the channel-group, they both go into stand-alone mode.
#show etherchannel 12 sum
Flags: I - stand-alone
Group Port-channel Protocol Ports
------+-------------+-----------+-----------
12 Po12(SD) LACP Gi1/0/15(I)
I've racked my brain on this all day. I've torn out and rebuilt the Cisco configuration several times. If it wasn't for the tcpdump showing LACPv1 packets arriving on the Linux interface, I'd be looking at the Cisco side. Alas, the Linux kernel appears to be completely ignoring the packets. My next stop is kernel source code and worst case, custom kernel for diagnostics. Hopefully, someone has some insight into the bonding driver and what makes it run correctly.