In a new Xeon 55XX server with 4xSSD at raid 10 with Debian 6, I have experienced 2 random shut downs within two weeks after the server being built. Looking at bandwidth logs before shut down does not indicate anything unusual. The server load is usually very low (about 1) and it is collocated far away.There seem to be no power outage while the server went down.
I know that I look at /var/log but not sure which logs should I investigate and what should I look for. So Appreciate your hints.
First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).
If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.
Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.
If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.
UPDATE:
I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.
First, you want to check
/var/log/syslog
. If you are not sure what to look for, you can start by looking for the wordserror
,panic
andwarning
.If you have system graphs available (e.g. Munin). Check them and look for abnormal patterns. If you do not have munin installed, it might be an idea to install it (
apt-get install munin munin-node
)You should also check root-mail for any interesting messages that might be related to your system crash.
Other logfiles you should check is application error-logs. E.g
/var/log/apache2/error.log
or similiar. They might contain information leading you to the problem.In my experience, an "unexpected halt" is almost always caused by overheating. Check your temperatures and fan speeds via lm_sensors and make sure that they are good.
Recently we had the same pattern: A server halted about one hour after the support manually started it. After this hours the CPU temperature hit the configured threshold in the BIOS (iirc 60 or 70°C) and halted the system. All these troubles where caused by an broken CPU fan. After replacing the fan everything returned to normal.
There are a number of logs files in /var/log directory (and it's subdirectories), including
and
Start with the files above.
You can find if system know about fact that it was going down with next commands
If no info => then it could be lose of power or something else external
if you have info => search in logs around reboot/shutdown time
There are 2 ways of checking what triggered shutdown, first check the Out-Of-Band Management console for any issue in the hardware, i would suggest to configure SNMP and receive emails or add the traps in a monitoring software for any alert.
Then through the Operating System, you can either check
/var/log/messages
(RedHat based distros) or/var/log/syslog
(Debian Based distros).The disk subsystem is complicated enough to be affected when a problem occurs, because of you'll hardly get anything in your log files.
Try logging over the serial console. This needs some cabling, and an other system to pick up the lines, but you have better chance actually catching the problem.
Of course if your node has a built-in management system similar to Oracle's ALOM/ILOM, you can also check for possible problems and log files there.