First instance, had a Centos 5.4 (64-bit), plenty of resources, installed Hudson (http://wiki.hudson-ci.org/display/HUDSON/Meet+Hudson) and everything was honkey-dorey. Several days or weeks later (can't remember which), the entire server would randomly freeze, requiring a hard reboot. There was nothing running on it other than the resources required for Hudson.
New gig: freshly installed Centos 5.5 (64-bit). Within a month or so, freezing has started again. No apparent reason.
We have identical servers running all over the place, serving everything from Tomcat to Jboss to basic Apache stuff, all without ever freezing or crashing.
It seems Hudson is the problem - we just can't figure out what it does differently from typical configs.
So 2 questions:
- Any Hudson experts out there want to chime in?
- Troubleshooting: What are the right logs to be looking at? Where might we find an entry that says "X caused the system to crash" etc.?
The best way I've found is to keep some kind of live log over a network or serial connection. Sometimes, the kernel can print a critical message out to a logged in shell even though it can't save it to a file so just having a remote shell open can help. You can also tail -f certain log files, or better yet, cat /proc/kmsg and see live kernel messages sent over ssh. Another more reliable option is to set up a physical serial port as the console. I have all my servers support a serial console and can log the whole boot with a serial terminal emulator like HyperTerminal, or better, PuTTY on a serial port. Adding the boot option console=ttyS0 will send all kernel messages to COM1 which requires a lot less to work as opposed to maintaining a network connection. Most motherboards still usually have a header on the board for COM1 even if they don't have the connector.