We have a local network that is connected to internet through linux gateway. There are about 50 workstations in LAN. We recently started observing a problem that gateway sporadically stops responding for several seconds. After investigation, we have noticed that sometimes, when gateway is not able to ping some workstation, it does not even try to send ARP requests to it.
As an example, we ping 192.168.5.37 from gateway:
PING 192.168.5.37 (192.168.5.37) 56(84) bytes of data.
From 192.168.5.1 icmp_seq=1 Destination Host Unreachable
From 192.168.5.1 icmp_seq=2 Destination Host Unreachable
From 192.168.5.1 icmp_seq=3 Destination Host Unreachable
From 192.168.5.1 icmp_seq=5 Destination Host Unreachable
From 192.168.5.1 icmp_seq=6 Destination Host Unreachable
From 192.168.5.1 icmp_seq=7 Destination Host Unreachable
64 bytes from 192.168.5.37: icmp_seq=8 ttl=128 time=438 ms
64 bytes from 192.168.5.37: icmp_seq=9 ttl=128 time=0.240 ms
64 bytes from 192.168.5.37: icmp_seq=10 ttl=128 time=0.238 ms
At the same time, tcpdump is running on the other console:
sudo tcpdump -nli eth0 host 192.168.5.37
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
14:12:26.944842 IP 192.168.5.37.6112 > 255.255.255.255.6112: UDP, length 16
14:12:31.951145 IP 192.168.5.37.6112 > 255.255.255.255.6112: UDP, length 16
14:12:36.958632 IP 192.168.5.37.6112 > 255.255.255.255.6112: UDP, length 16
14:12:39.914620 arp who-has 192.168.5.37 tell 192.168.5.1
14:12:39.914775 arp reply 192.168.5.37 is-at 00:0b:6a:86:53:14
14:12:39.914781 IP 192.168.5.1 > 192.168.5.37: ICMP echo request, id 50734, seq 8, length 64
14:12:39.914955 IP 192.168.5.37 > 192.168.5.1: ICMP echo reply, id 50734, seq 8, length 64
14:12:40.480035 IP 192.168.5.1 > 192.168.5.37: ICMP echo request, id 50734, seq 9, length 64
14:12:40.480264 IP 192.168.5.37 > 192.168.5.1: ICMP echo reply, id 50734, seq 9, length 64
14:12:41.480037 IP 192.168.5.1 > 192.168.5.37: ICMP echo request, id 50734, seq 10, length 64
14:12:41.480265 IP 192.168.5.37 > 192.168.5.1: ICMP echo reply, id 50734, seq 10, length 64
I would suspect that something is wrong with kernel ARP cache. By default, gc_thresh1 is at 512, and we have ten times less hosts in a lan (ip nei|wc -l
is about 50)..
What is the problem and how can we fix it?
Check for duplicate ip of your gateway in your network, is the most common problem in your case.
It seems I have found a root of my problem. Router is connected to LAN through eth0. Also, on the same interface it has several VLAN sub-interfaces (one of them is used to access internet). On eth0 there is a shaper configured to limit traffic rates for LAN users. It seems that Linux traffic shaping code sees all traffic on eth0 (including sub-interfaces) and that somehow interfered with arp requests on eth0.
After moving LAN connection to sub-interface and configuring shaping on it in way to not shape arp packets, problem disappeared.