I have several Ubuntu Server 8.04 machines at a remote location. Every couple of months or so, one of them would stop responding and need to be power cycled. From looking at my log files it seems that all my processes are running fine until at some point everything just stops.
I suspect it's a hardware problem, but I don't even know how to begin pinpointing the issue. Are there any diagnostics tools or techniques designed to track down these sort of problems?
I know this is a fairly general question, but I'm hoping for a general answer.
Memtest would be the first point of call although if you can, ask the center to plug in a console next time it crashes. If the kernel is going, it should output something to screen.
I had a similar problem in the past, and it turned out to be heat related. Improving the circulation and adding a fan or two helped big-time.
Also, make sure you've got SMART enabled on your disks and have a look to see if maybe one of them is on its last legs.
You might want to install munin to monitor them all and see what's going on.
Hook up another machine and configure a serial console to get all of the kernel messages and such that come up. If it's a kernel panic or some other catastrophic problem, you'll see it there. Monitoring temperature and running a memtest are also recommended, especially if the console shows nothing abnormal before the wheels fall off.
Put in a comprehensive remote monitoring solution with something like Zabbix. Monitor aspects of system resource usage, as well as any hardware statistics that are available to the operating system (eg. fan speeds, temperatures and the like). That way, when your system next falls over, you'll have a number of data points you can look at to see what the problem is.
With this approach you may find, for example, that you have a process that goes out of control with RAM allocation, pushes the system into swap, and causes the out of memory killer to start carving its way through your running processes, leaving the machine unresponsive. Without monitoring, you couldn't have known that.
Too little information given to actually anything that would definetly work.
It would be good to know how you define "stops" responding ? Is it just the ssh that stops responding or some other service ? Any ideas if console is still responding ?
Any traces in the logfiles after the machine is back online after the reboot ?
Few options anyway to get you forward on gathering the information: