We have a number of Xen virtualised servers, all running Debian 6 64bit. We are having an intermittent problem where occasionally a server will stop responding over the network. When this happens we can't ping the server, and our app logs indicate that it is unable to connect to other servers on the network.
This has happened to a few different unrelated servers now, and the only common factors are the VPS host and associated infrastructure, the OS, and our OS settings. I'm following this up with the host, but really need to get to the bottom of it.
I don't really have much to go on at the moment. The only os log entries that I can find that co-incide with the event is one line in the syslog:
Nov 21 19:36:10 xxxxxx ntpd[2460]: xxxx:4f8:xxx:xxx:1:2:3:4 interface xxxx:7e00::xxxx:91ff:xxxx:1bd4 -> (null)
However I think that is a result of the network connection dying, rather than a clue to its cause.
MTR reports from a working server show nothing useful.
So, how should I go about trying to understand what's happening here? Are there any network specific logs that I don't know about which should be checking?
Thank you!
I presume that you don't have access to your VPS host, and that you can only debug from inside VM. So this is what I would do.
I would try to find out where the breakage happens - is it between the VM and host, VM and gateway or maybe somewhere within your providers network.
Set a script that will ping your first hop - ie your gateway. If you have other VMs within the same broadcast domain, you can ping them instead of GW. You could run screen/tmux and leave ping inside:
When the outage happens, if the gateway is still alive and pings go through, you have a problem down the drain. In that case, do a traceroute and ping next 2-3 hops, until you figure out where the outage happens. If gateway is immediately unavailable, then maybe set up a cron that will take a snapshot of a network info when the outage happens:
You can extend the script with additional info like uptime (to get the current load) lsof or netstat if you feel you need that info too.
sometimes guests dhclient drops connection or fails to renew lease, so any info collected in the time of outage can help.