One of our servers froze yesterday, apparently refusing to serve any HTTP requests. The tech guy on site could not connect remotely to the machine, so he rebooted the (virtual) machine from the VMware Infrastructure Client, and everything was up and running again.
Now I want to figure out what went wrong. I looked at a couple of log files, and all just stop logging anything at 5:00am, and start logging again with a boot sequence. I could not find anything suspicious, other then the fact a number of cron jobs ran at 5:00am. These were all fairly simple jobs, not interacting with anything critical, and there was at least some activity after they completed.
The freeze lasted a couple of hours. We did not have any other issues on other virtual machines on the same box, which all have a very similar configuration.
Is there any place I should start looking for clues? What can I tell people to do should this happen again before just resetting the machine? Magic SysRq maybe?
Im guessing you've seen this already but how-can-i-use-syslog-to-diagnose-mysterious-crashes dont know if this could help at all, is your servers under stress/serve a great deal of clients
My first action would be to take server out of service and run a full Memtest+ run on it to check that memory is not failing. Next check SMART from HDD's for any issues. Next would be to following instructions at http://www.kernel.org/doc/Documentation/networking/netconsole.txt to capture anything like this in future.