I am trying to connect a Linux server with two 1Gbps NICs with a switch Netgear ProSafe GSM7248V2 using bonding, specifically 802.3ad mode. The results are very confusing, I would be thankful for any hints on what to try next.
On the server side, this is my /etc/network/interfaces:
auto bond0
iface bond0 inet static
address 192.168.1.15/24
gateway 192.168.1.254
dns-nameservers 8.8.8.8
dns-search my-domain.org
bond-slaves eno1 eno2
bond-mode 4
bond-miimon 100
bond-lacp-rate 1
bond-xmit_hash_policy layer3+4
hwaddress aa:bb:cc:dd:ee:ff
The configuration of the switch is following:
(GSM7248V2) #show port-channel 3/2
Local Interface................................ 3/2
Channel Name................................... fubarlg
Link State..................................... Up
Admin Mode..................................... Enabled
Type........................................... Dynamic
Load Balance Option............................ 6
(Src/Dest IP and TCP/UDP Port fields)
Mbr Device/ Port Port
Ports Timeout Speed Active
------ ------------- --------- -------
0/7 actor/long Auto True
partner/long
0/8 actor/long Auto True
partner/long
(GSM7248V2) #show lacp actor 0/7
Sys Admin Port Admin
Intf Priority Key Priority State
------ -------- ----- -------- -----------
0/7 1 55 128 ACT|AGG|LTO
(GSM7248V2) #show lacp actor 0/8
Sys Admin Port Admin
Intf Priority Key Priority State
------ -------- ----- -------- -----------
0/8 1 55 128 ACT|AGG|LTO
(GSM7248V2) #show lacp partner 0/7
Sys System Admin Prt Prt Admin
Intf Pri ID Key Pri Id State
------ --- ----------------- ----- --- ----- -----------
0/7 0 00:00:00:00:00:00 0 0 0 ACT|AGG|LTO
(GSM7248V2) #show lacp partner 0/8
Sys System Admin Prt Prt Admin
Intf Pri ID Key Pri Id State
------ --- ----------------- ----- --- ----- -----------
0/8 0 00:00:00:00:00:00 0 0 0 ACT|AGG|LTO
I believe that xmit "layer3+4" is most compatible with the Load Balance Type 6 of the switch. The first surprising thing is that the switch does not see the MAC address of the LACP partner.
On the server side, this is the content of /proc/net/bonding/bond0:
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: ac:1f:6b:dc:2e:88
Active Aggregator Info:
Aggregator ID: 15
Number of ports: 2
Actor Key: 9
Partner Key: 55
Partner Mac Address: a0:21:b7:9d:83:6a
Slave Interface: eno1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: ac:1f:6b:dc:2e:88
Slave queue ID: 0
Aggregator ID: 15
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: ac:1f:6b:dc:2e:88
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: a0:21:b7:9d:83:6a
oper key: 55
port priority: 128
port number: 8
port state: 61
Slave Interface: eno2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: ac:1f:6b:dc:2e:89
Slave queue ID: 0
Aggregator ID: 15
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: ac:1f:6b:dc:2e:88
port key: 9
port priority: 255
port number: 2
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: a0:21:b7:9d:83:6a
oper key: 55
port priority: 128
port number: 7
port state: 61
If I understand this correctly, it means that the Linux bonding driver correctly determined all the aggregator details (key, port numbers, system priority, port priority, etc). Despite that, I receive this in dmesg after a restart of the networking service:
[Dec14 20:40] bond0: Releasing backup interface eno1
[ +0.000004] bond0: first active interface up!
[ +0.090621] bond0: Removing an active aggregator
[ +0.000004] bond0: Releasing backup interface eno2
[ +0.118446] bond0: Enslaving eno1 as a backup interface with a down link
[ +0.027888] bond0: Enslaving eno2 as a backup interface with a down link
[ +0.008805] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ +3.546823] igb 0000:04:00.0 eno1: igb: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ +0.160003] igb 0000:05:00.0 eno2: igb: eno2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[ +0.035608] bond0: link status definitely up for interface eno1, 1000 Mbps full duplex
[ +0.000004] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ +0.000008] bond0: first active interface up!
[ +0.000166] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ +0.103821] bond0: link status definitely up for interface eno2, 1000 Mbps full duplex
Both interfaces are alive, the network connection seems to be quite normal, I just receive that strange warning that there is no 802.3ad compatible partner.
In addition, when I try to simultaneously copy two large binary files (10GB each) from two different machines connected to the very same switch, each connected with 1Gbps, the overall throughput of the bond0 interface on the server is well below 1Gbps, although I would expect something closer to 2 Gbps (the read speed, etc is not a limiting factor here, all SSDs, well cached, etc). When I copy the same files sequentially, one after another, from the same machines, I easily reach throughputs close to 1Gbps.
Do you, please, have any idea, what could be wrong here? Regarding the diagnostics, the confusing warning appears in dmesg (no 802.3ad compatible partner) and in the sh lacp output of the switch (no MAC of the partner, although regular port record shows the correct MAC address of the connected NIC). Regarding the network performance, I cannot really see any aggregation using two different connections. I would be very thankful for any hint.
The switch is configured to
long
LACP timeout - one LACPDU every 30 seconds.The Linux system is configured to
bond-lacp-rate 1
.I can't find what this actually does in Debian, but if it passes the
lacp_rate=1
module option to bonding (reference), then that is the fast timeout - one LACPDU every 1 second.This mismatch between slow/fast LACP rate is a misconfiguration.
All the example documentation I can find says that Debian accepts
bond-lacp-rate slow
which will hopefully correct it for you.You could probably also remove the
bond-lacp-rate
line from your config file, as the default is slow rate, then unload the bonding module or reboot to apply.Don't test throughput with just two streams. The
layer3+4
policy does not guarantee that any two streams each get a separate NIC, just that given enough streams, traffic should balance somewhat evenly.Test with say 16 or 32 concurrent iperf3 TCP streams. The total throughput of all streams should be close to 2Gbps.