I run a Linux server in a subnet that is not under my control. From time to time, the server can not be reached from outside (the Internet) for a few seconds. I'm trying to trace why and started to have a look on the output of ip neigh show
(written out to a file regularly with a cron job).
The next time it happened I looked at the file, and it read:
fe80::1 dev eth0 lladdr 00:22:64:b6:10:5c router STALE
192.168.14.1 dev eth0 FAILED
For me, this looks like the gateway (which is 192.168.14.1) does not respond on ARP requests made by the server. Is this correct?
I was trying to find more information, particularly in the iproute2
source code, but didn't find under which condition it would write out FAILED. But maybe that's because I'm not a C developer.
A
FAILED
output in the arp cache indicates your server was unable to reach the gateway. You can test this in your LAN but ping any PC in your LAN, check arp status, disconnect PC, then check arp status. You will notice the state change from REACHABLE to FAILED. Similarly, if you send an icmp request, and the gateway replies, the MAC address of the gateway will be included in the output when you run theip neigh show
orarp -a
command. The state will beREACHABLE
at first, but if there is a problem in connectivity, it might change state toFAILED
. You might notice intermediate states ofDELAY
andPROBE
as it tries to reach the gateway before labeling it as failed.To identify the cause, you need to
ping
multiple hosts in your subnet and check your arp cache status as the connectivity problem occurs. If only the gateway is shown as failed, while the other hosts are ok, then the problem is between your server and the gateway. If all the hosts are shown as failed, the problem could the connectivity between your server to the switch, or just a cable issue.This may be a problem with your system, with the gateway, or with the connection itself. Can you reach other systems in that subnet? If they are reachable while the gateway is not this is a hint that something gets reloaded on the gateway (due to firewall / tc updates or whatever). Maybe reconfigurations of the switch (VLAN e.g.) can cause that, too, but then the connectivity to all systems should be affected.