I've been trying to troubleshoot this for whole day without any success.
I have two servers, server1 and server2, both running Ubuntu 14.04.5 LTS and connected to a Cisco sg200-08 switch via LAG trunk with LACP. The switch ip is 172.128.1.254/24 and the interfaces on the servers are shown below including the route and arp table for the relevant ip's:
On server1:
root@server1:~# ip addr show bond0
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:11:0a:10:03:29 brd ff:ff:ff:ff:ff:ff
inet 172.128.1.129/24 brd 172.128.1.255 scope global bond0
valid_lft forever preferred_lft forever
root@server1:~# ip addr show bond0.53
13: bond0.53@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:11:0a:10:03:29 brd ff:ff:ff:ff:ff:ff
inet 192.168.53.1/24 brd 192.168.53.255 scope global bond0.53
valid_lft forever preferred_lft forever
root@server1:~# ip route get 192.168.53.2
192.168.53.2 dev bond0.53 src 192.168.53.1
cache
root@server1:~# arp -n | grep '192.168.53.2'
192.168.53.2 (incomplete) bond0.53
On server2:
root@server2:~# ip addr show bond0
5: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:15:17:2e:ab:b4 brd ff:ff:ff:ff:ff:ff
inet 172.128.1.130/24 brd 172.128.1.255 scope global bond0
valid_lft forever preferred_lft foreve
root@server2:~# ip addr show bond0.53
22: bond0.53@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 00:15:17:2e:ab:b4 brd ff:ff:ff:ff:ff:ff
inet 192.168.53.2/24 brd 192.168.53.255 scope global bond0.53
valid_lft forever preferred_lft forever
root@server2:~# ip route get 192.168.53.1
192.168.53.1 dev bond0.53 src 192.168.53.2
cache
root@server2:~# arp -n | grep '192.168.53.1'
192.168.53.1 ether 00:11:0a:10:03:29 C bond0.53
When I ping server2 from server1, I can see no arp replies coming back to server1:
root@server1:~# tcpdump -ennqt -i bond0 \( arp or icmp \)
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 28
but on server2 side I can see the arp request from server1 AND replies being sent back over VLAN53:
root@server2:~# tcpdump -ennqt -i bond0 \( arp or icmp \)
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
00:11:0a:10:03:29 > ff:ff:ff:ff:ff:ff, 802.1Q, length 64: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.2 tell 192.168.53.1, length 46
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Reply 192.168.53.2 is-at 00:15:17:2e:ab:b4, length 28
For the ping in opposite direction I can only see this on server2:
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 102: vlan 53, p 0, ethertype IPv4, 192.168.53.2 > 192.168.53.1: ICMP echo request, id 6506, seq 1, length 64
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 102: vlan 53, p 0, ethertype IPv4, 192.168.53.2 > 192.168.53.1: ICMP echo request, id 6506, seq 2, length 64
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 102: vlan 53, p 0, ethertype IPv4, 192.168.53.2 > 192.168.53.1: ICMP echo request, id 6506, seq 3, length 64
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 102: vlan 53, p 0, ethertype IPv4, 192.168.53.2 > 192.168.53.1: ICMP echo request, id 6506, seq 4, length 64
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 102: vlan 53, p 0, ethertype IPv4, 192.168.53.2 > 192.168.53.1: ICMP echo request, id 6506, seq 5, length 64
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.1 tell 192.168.53.2, length 28
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.1 tell 192.168.53.2, length 28
00:15:17:2e:ab:b4 > 00:11:0a:10:03:29, 802.1Q, length 46: vlan 53, p 0, ethertype ARP, Request who-has 192.168.53.1 tell 192.168.53.2, length 28
No firewall, arptables or ebtables setup on both sides. Kernel sysctl is not blocking ICMP traffic. The bonds are up and healthy. The switch has 2 ports in each LAG configured as trunk towards each server and carrys vlan's 1 (native/default untagged) and 51,52,53,54 tagged. I can ping both bond0 ip's 172.128.1.129 and 172.128.1.130 from the switch. I can ping 172.128.1.129 (server1) from another linux pc connected to the switch (ip of 172.128.1.5) but not 172.128.1.130 (server2).
Thanks in advance for any pointers, ideas, suggestions.
CORRECTION: I can ping BOTH servers from third host on the network
igorc@client:~$ ip -f inet addr show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
inet 172.128.1.5/24 brd 172.128.1.255 scope global dynamic eth1
valid_lft 22497sec preferred_lft 22497sec
igorc@client:~$ ping -c 2 172.128.1.129
PING 172.128.1.129 (172.128.1.129) 56(84) bytes of data.
64 bytes from 172.128.1.129: icmp_seq=1 ttl=64 time=0.618 ms
64 bytes from 172.128.1.129: icmp_seq=2 ttl=64 time=0.541 ms
--- 172.128.1.129 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.541/0.579/0.618/0.045 ms
igorc@client:~$ ping -c 2 172.128.1.130
PING 172.128.1.130 (172.128.1.130) 56(84) bytes of data.
64 bytes from 172.128.1.130: icmp_seq=1 ttl=64 time=0.645 ms
64 bytes from 172.128.1.130: icmp_seq=2 ttl=64 time=0.693 ms
--- 172.128.1.130 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.645/0.669/0.693/0.024 ms
UPDATE: The bond on both servers
root@server1:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1
Actor Key: 17
Partner Key: 1
Partner Mac Address: 00:00:00:00:00:00
Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:11:0a:10:03:29
Aggregator ID: 1
Slave queue ID: 0
Slave Interface: eth0
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 00:11:0a:10:03:28
Aggregator ID: 2
Slave queue ID: 0
root@server2:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 100
Down Delay (ms): 100
802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 2
Number of ports: 1
Actor Key: 17
Partner Key: 1
Partner Mac Address: 00:00:00:00:00:00
Slave Interface: p1p1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:15:17:2e:ab:b4
Aggregator ID: 1
Slave queue ID: 0
Slave Interface: p1p2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:15:17:2e:ab:b5
Aggregator ID: 2
Slave queue ID: 0
Solved. I have mistakinly set the LAG in the Cisco switch to static instead dynamic which prevents LACP from being used. The embeded image will not show up probably due to lack of points in my account but attaching it in any case.
Cisco sg200-08 LAG Management
Now all looks much better:
Changes highlighted in bold (if visible in the code widget), first Number of ports is correctly set to 2 instead of previously 1, then the Aggregator ID now correctly has same value for both slaves, and lastly the Partner Mac Address now has a value (compared to 00:00:00:00:00:00 previously) indicating exchange of LACP UDP messages between the peers.