Ping a Specific Port

Question

alfish

Asked: 2012-05-09 00:57:06 +0800 CST2012-05-09 00:57:06 +0800 CST 2012-05-09 00:57:06 +0800 CST

How to investigate unexpected Linux server shut down?

772

In a new Xeon 55XX server with 4xSSD at raid 10 with Debian 6, I have experienced 2 random shut downs within two weeks after the server being built. Looking at bandwidth logs before shut down does not indicate anything unusual. The server load is usually very low (about 1) and it is collocated far away.There seem to be no power outage while the server went down.

I know that I look at /var/log but not sure which logs should I investigate and what should I look for. So Appreciate your hints.

7 Answers

Voted

Bittrance · Answer 1 · 2012-05-09T01:16:22+08:00

First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).

If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.

Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.

If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.

UPDATE:

I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.

pkhamre · Answer 2 · 2012-05-09T01:09:07+08:00

pkhamre

2012-05-09T01:09:07+08:002012-05-09T01:09:07+08:00

First, you want to check /var/log/syslog. If you are not sure what to look for, you can start by looking for the words error, panic and warning.

grep -i error /var/log/syslog

If you have system graphs available (e.g. Munin). Check them and look for abnormal patterns. If you do not have munin installed, it might be an idea to install it (apt-get install munin munin-node)

You should also check root-mail for any interesting messages that might be related to your system crash.

Other logfiles you should check is application error-logs. E.g /var/log/apache2/error.log or similiar. They might contain information leading you to the problem.

10

ercpe · Answer 3 · 2012-05-09T01:48:04+08:00

ercpe

2012-05-09T01:48:04+08:002012-05-09T01:48:04+08:00

In my experience, an "unexpected halt" is almost always caused by overheating. Check your temperatures and fan speeds via lm_sensors and make sure that they are good.

Recently we had the same pattern: A server halted about one hour after the support manually started it. After this hours the CPU temperature hit the configured threshold in the BIOS (iirc 60 or 70°C) and halted the system. All these troubles where caused by an broken CPU fan. After replacing the fan everything returned to normal.

6

Naveen · Answer 4 · 2016-06-15T22:01:23+08:00

Naveen

2016-06-15T22:01:23+08:002016-06-15T22:01:23+08:00

There are a number of logs files in /var/log directory (and it's subdirectories), including

/var/log/boot

and

/var/log/boot.log

Start with the files above.

2

Ryabchenko Alexander · Answer 5 · 2019-10-30T02:16:23+08:00

Ryabchenko Alexander

2019-10-30T02:16:23+08:002019-10-30T02:16:23+08:00

You can find if system know about fact that it was going down with next commands

sudo last -1x reboot
sudo last -1x shutdown

If no info => then it could be lose of power or something else external

if you have info => search in logs around reboot/shutdown time

2

etcshad0vv · Answer 6 · 2016-06-16T00:33:50+08:00

etcshad0vv

2016-06-16T00:33:50+08:002016-06-16T00:33:50+08:00

There are 2 ways of checking what triggered shutdown, first check the Out-Of-Band Management console for any issue in the hardware, i would suggest to configure SNMP and receive emails or add the traps in a monitoring software for any alert.

Then through the Operating System, you can either check /var/log/messages(RedHat based distros) or /var/log/syslog(Debian Based distros).

1

asdmin · Answer 7 · 2016-06-16T22:43:03+08:00

asdmin

2016-06-16T22:43:03+08:002016-06-16T22:43:03+08:00

The disk subsystem is complicated enough to be affected when a problem occurs, because of you'll hardly get anything in your log files.

Try logging over the serial console. This needs some cabling, and an other system to pick up the lines, but you have better chance actually catching the problem.

Of course if your node has a built-in management system similar to Oracle's ALOM/ILOM, you can also check for possible problems and log files there.

0

How to investigate unexpected Linux server shut down?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?