I have 2 Debian Linux 6.0.4 servers that have a strange behaviour: after 5-7-10 days they hang. By this I mean the servers need to be restarted and before that ping won't answer.
I've been struggling with this problem for a couple of months now and here's some thoughts/what I tried without being able to solve the problem.
- I changed the RAM on a server. Being 2 different servers I doubt that it could be something related to hardware as a 3rd identical server won't have this problem.
- I logged the server load and when it crashes the load is fine (quite low)
- I cannot find anything in the server logs, logs are fine till the server freezes.
- I don't have access to console unfortunately.
While I have years of admin experience I have never encountered such an issue and right now I have no idea where else to investigate.
If you have an idea of what I could try in order to fix the problem please share it with me:-)
Do the servers really hang or are they just unreachable by ping?
Install a monitoring tool such as Munin (or similar) which will show you graphs of not just the CPU load but also memory stats, disk usage, and various other bits and pieces - you can configure it to monitor lots of aspects. Nex time the server hangs, check the graphs for any unusual signs. You will learn to see what a normal graph looks like so anything out of the ordinary is suspicious (although not necessarily wrong).
Are you sure you are checking all server logs? ie do you have web/mail/ftp/dns/other servers? check all such logs! Don't forget to enable debug logging while troubleshooting.
If the server crashes every week or so it could be something that happens regularly, ie a cron job, or log rotation, stuff like that.
Make sure you get all system emails (root alias). You can even install OSSEC which is a great tool for keeping an eye on the logs and getting emails when things go wrong. But this OSSEC tool is just an automated way of looking at the logs, so nothing magical.
Networking issues? dhcp lease expired?
Please show relevant content of /var/log/messages and/or /var/log/kern.log it's possible the kernel logged some crash reports or something else that could shed some light. When I experienced such unexplained hangs it was due to a bad driver, because logging isn't very verbose I wasn't able to find out the exact driver.
In my case there were soft lockups (kernel: [XXXX] BUG: soft lockup - CPU#X). After some research I found http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556030 and the last comment provided some insight and a way to make logging more verbose. It's an easy kernel modification but if you don't feel comfortable compiling your own kernel it may not be the best thing to do.
Just updating the kernel or installing a newer version and rebooting may fix the problem.
Quoting:
Apparently the problem was related to some python scripts that caused the server to hang. I don't understand why they hanged the server but at least they don't hang it any more.