One of our Linux (CentOS) servers was unreachable last night.
The server was not reachable in any way except for the remote console. After logging in with the remote console, it turned out I could not ping any outside hosts either.
A simple service network restart
solved the issue, but I am still wondering what could have caused this. My log files seem to indicate no error at all (except for the various daemons that need a network connection and failed after the network failure).
Are there any additional steps I can take to find out the cause of this problem?
EDIT: this just happened again. The server was completely unresponsive until I issued a networking service restart. Any advise is welcome. Could this be caused by a faulty hardware component?
As per Madhatters request, here are some excerpts from the log at the time (the network crashed at 20:13):
/var/log/messages:
Dec 2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=101 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec 2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=100 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec 2 20:01:05 graviton kernel: Firewall: *TCP_IN Blocked* IN=eth0 OUT= MAC=<stripped> SRC=<stripped> DST=<stripped> LEN=40 TOS=0x00 PREC=0x00 TTL=101 ID=256 PROTO=TCP SPT=6000 DPT=3306 WINDOW=16384 RES=0x00 SYN URGP=0
Dec 2 20:13:34 graviton junglediskserver: Connection to gateway failed: xGatewayTransport - Connection to gateway failed.
The first three messages are simple responses to iptables rules I have set up through the LFD firewall. The last message indicates that JungleDisk, which I use for backups can no longer connect to the gateway. Apart from this, there are no interesting messages around this time.
EDIT 4 dec: as per Mattdm's request, here is the output of ethtool eth0
:
(Please not that these are the settings that currently work. If things go wrong again, I will be sure to post this again if necessary.
Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: g
Wake-on: d
Link detected: yes
As per Joris' request, here is also the output of route -n
:
aron@graviton [~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
xx.xx.xx.58 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.42 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.43 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.41 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.46 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.47 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.44 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.45 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.50 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.51 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.48 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.49 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.54 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.52 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.53 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
xx.xx.xx.0 0.0.0.0 255.255.255.192 U 0 0 0 eth0
xx.xx.xx.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0
0.0.0.0 xx.xx.xx.62 0.0.0.0 UG 0 0 0 eth0
The bottom xx.62 is my gateway.
EDIT december 28th: the problem occurred again and I got the chance to compare some of the outputs of the above tests. What I found out is that arp -an
returns an incomplete MAC address for my gateway (which is not under my control; the server is in a shared rack):
During failure:
? (xx.xx.xx.62) at <incomplete> on eth0
After service network restart
:
? (xx.xx.xx.62) at 00:00:0C:9F:F0:30 [ether] on eth0
Is this something I can fix or is it time for me to contact the data centre?
check
dmesg | less
for anything related to your nic alias (i.e. eht0)less /var/log/messages
aswellWhilst rare it could of been an ip address conflict, if this should occur again try
arping -U <gateway ip> -I <nic alias>
Check this however as it's been a long time since I have used arping and this may be incorrect.If successful you should regain connection without reloading the network service.
How are you getting your IP address on this network (DHCP, or static)? If it happens again, make sure to run
ifconfig
to look at the state of the interface while it's in its non-functional state. Does it have an address? Are there errors? If you runethtool
, is there a link? (And is it negotiated to the right speed and duplex?)Based on the issues encountered, I'd be very suspicious of an IP address conflict. Restarting the networking would send a gratuitous ARP which would take over that IP again, which would clear things up.
I'd install arpwatch on another host in the same broadcast domain (same network) and see if any other machines are responding to ARP requests for the IP of your server. If so, find out which machine (possibly using MAC address tables from your switches to find out which port it's attached to) and set it to another static address or DHCP.
Maybe TCP connection pool gets full? Something is opening more and more connections, maybe trying
netstat
(try different options, for example -i to see interfaces) would given insight about connection open.If actual connections (and iptables/routes/whatever:you_are_using configuration) are ok, problem could be for example in network interface configuration.
Is your
ifconfig -a
output sane? That output would tell if you have some network devices that shouldn’t be present, for example virtual devices, that is causing packets going haywire.This routing table you have pasted looks really strange. Does it work when it is like that, and does it change after connection stops working? If yes, something is causing routing table to change, maybe something iptables related.
Finally, CentOS specific thing: do you have NetworkManager in use? It is enabled by default in CentOS for some reason, even in virtual machines that doesn’t have X, making this connection doubling, routing changes and other things possible. I suggest switching it off unless you know you need it (like, have connections that goes on and off).
This problem has been solved quite a while ago: the problem was apparently hardware-related.
A new NIC has solved the issue.
From where are you testing? Within the subnet or outside of it? How many routes do you have? Automatic gateway selection may do seemingly unpredictable things.
I don't use RedHat or CentOS, but try looking at whatever script is called when you do a
service network restart.
Since your network returns to normal when something in that script happens, it may help narrow it down.Hhhmm.
Maybe an accidental change to iptables ? It can explain both why it wasn't reachable and why there is nothing strange in the logs (probably you don't log iptables. do you ?)