Problem: I occasionally have network issues with my Ubuntu VPS. I cannot SSH to the box, I cannot ping the box by IP address. I can access the box via host Serial terminal. When I access the box via serial, I can't ping out anywhere (far as I can tell), even when pinging by IP address. After some amount of time the network comes back, sometimes without my intervention. Sometimes it comes back when I am fiddling around. But it is hard to tell why. (Edit: It is very consistently out for 1 hour)
Questions: How can I proceed in troubleshooting this issue? What things can I do in order to rule out configuration/software problems in my control so that I can feel more comfortable bringing up the issue to my VPS host?
Things I have tried:
- Bring eth0 down and up
- Disable firewall temporarily
- Checked VPS host advisories for network issues - haven't seen any
- Reboot the server via Web console
- Note: None of these have worked for me
Details:
- Ubuntu 10.04.1 LTS
- Hosted with Xen virtualization
- Have root access (SSH) to perform my own upgrades, installs, etc.
- I have the VPS setup as a VPN server so that I can connect to it "Road Warrior" style and forward all my traffic through the VPS first. So that is the junk with 10.8.X.X
- All traffic including DNS lookups are forwarded through the VPS
- Use uncomplicated firewall (ufw) with some basic rules
- Also acts as a server for some services including Mumble and web server
- I setup a script on the VPS as a cron job to ping some common internet entities by IP address every 5 minutes. If there is failure in the ping, then it logs it to a file. Simple enough. Consistently the network outage lasts for an hour. It does not always happen at the same time of day. On almost all of the occurrences, the network is down for an hour and then it "magically" comes back.
- Memory usage on my VPS is typically very high. Usually I am maxed out and using some swap. The memory hog is java, if that detail helps.
- My provider has been very unhelpful. It has ranged from "we are sorry, we had an unfortunate issue" to "there is no problem now". This is frustrating to me because typically I make a ticket when there is a problem, but the problem is gone by the time the ticket is addressed. The most recent communication has been that they suggest reformatting my VPS and starting over, which i am not keen about.
- Consistently network outages start on the hour (within 5-10 minutes). That is, network outages do not start around XX:30, XX:45, etc.
netstat -rn
Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 10.8.0.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0 XX.57.166.0 0.0.0.0 255.255.255.128 U 0 0 0 eth0 192.168.50.0 10.8.0.2 255.255.255.0 UG 0 0 0 tun0 10.8.0.0 10.8.0.2 255.255.255.0 UG 0 0 0 tun0 0.0.0.0 XX.57.166.1 0.0.0.0 UG 0 0 0 eth0
ip route list
10.8.0.2 dev tun0 proto kernel scope link src 10.8.0.1 XX.57.166.0/25 dev eth0 proto kernel scope link src XX.57.166.59 192.168.50.0/24 via 10.8.0.2 dev tun0 10.8.0.0/24 via 10.8.0.2 dev tun0 default via XX.57.166.1 dev eth0 metric 100
cat /etc/network/interfaces
auto eth0 iface eth0 inet static address XX.57.166.59 gateway XX.57.166.1 netmask 255.255.255.128 auto lo iface lo inet loopback
Firstly if you believe this is a vendor issue that they're not addressing, I'd strongly consider migrating away. I gave VPS.net the benefit of the doubt when their SAN kept crashing (taking down all the VPSes in the process) but after a few months of "We've fixed this for good" and it still crashing, I had to vote with my wallet.
It's surprisingly easy to start a VPS company (you really only need a bit of datacenter space and some servers) so they're not all equal in technical ability even before you get to customer service.
But in terms of getting to the bottom of the problem, I'd first look at stopping things ending up in swap. Leave swap on but do whatever you have to do so you're not pushing things that far. Rein in the Java application or add more RAM. And see what happens. If this is very regular, you shouldn't have to wait long (or pay much) to see a result.
Same with CPU. If you have things running at 100% for extended periods, you want to make sure they're not interfering with other applications. The most simple way to work this can be done by setting the nice value of whatever applications are rampant to something positive. A nice value of something like +10 should let the system get full priority of the resources before your applications. Sidebar: Nice values basically mean the're more polite when it comes to CPU scheduling. Something with a low (eg -20) nice value means they'll get prioritised over all other things with higher nice values.
If you can, expand your testing to other local network items. If they provide a DNS resolver (as a lot of server companies do), ping that constantly (well, a few times a minute) and log the results. If you can still access that during periods of downtime, it's less likely that it's your fault.
And as I say, if this isn't your fault, move. If you spend any more time trying to fix this, you're outweighing any conceivable benefit of staying with these people. I personally have a very good and long experience with Linode but there are lots of good companies out there.