So you have this neatly setup unix server and it's super fast and works swell and everything is great for months, and suddenly all kinds of weird errors start showing up for a variety of different services and none of them make a lot of sense on their own, much less together.
What are cheap things you should check as soon as you get your ssh session to the machine?
I'm specially interested in trauma stories that highlight non-obvious commands and rare situations, but I guess what's obvious varies from person to person, so we can just list them all freely.
First Order: Is it responsive?
If you can't log in, there's bigger problems afoot. This generally comes in two flavors: hardware failure, and software failure. Both are potentially catastrophic. To prevent DFA errors, check the general hardware health first - a simple glance-over usually will suffice.
Second Order: Are the system's underlying structures in good health and order?
Check the "Golden Triad" of systems:
In the last few decades, the triad has expanded into a "quad" which includes communications (networking):
Third Order: What is the severity of the issue?
What programs or services are affected? In decreasing order of severity, is it systemic (system-wide), clustered (a group of programs), or isolated (a specific program)? Clusters of programs typically are tripping up because a specific underlying service has failed or gone unresponsive. Systemic issues are sometimes related to this (think DNS or IP conflicts) but knowing where to look is usually the key.
Fourth Order: Are diagnostic tools providing useful data relevant to the issue? Now that you have info about the health of the system (second order) and what parts of it are experiencing issues (third order) this should make it easy to narrow down where the problem is.
Error messages or log files should be a common waypoint on this journey.
CPU issues:
Disk space / I-O issues:
Memory issues:
Connectivity issues:
Most common complaint (that I hear):
Email is not delivering fast enough (more than a minute from send to receipt by recipient) or, email is rejecting my attempt to send. This usually comes down to the rate limiter in Postfix kicking in during a spam-storm, which impacts the ability to accept internal delivery.
A real-life example:
However, this is not always the case. One time, the issue persisted regardless of service restarts; so after 3 minutes it was time to start looking around. CPU was busy but under 100%, yet the load had soared to 15 on a box of just 2 cores, and was threatening to go higher. The top command revealed that the mail system was in overdrive, along with the mail scanner, but there were no amavis child processes to be seen. That was the clue - the mail queue command (mailq) showed some 150+ undelivered messages, over 80% of which were spam, in the last 20 minutes. A quick adjustment to lower the rate limiter (which reduced the intake rate of the spam storm) while increasing the number of child email scanner processes (to help process the backlog), followed by a service restart, resolved the issue and the system was able to complete deliveries in a short time.
The cause of the problem was that the amavis parent process had keeled over dead, and the child processes had eventually all run their course (they self-terminate after so many scans to prevent memory leaks). So there were SMTP processes in postfix attempting to contact...thin air...to do the spam/virus scanning that was needed. The distro I was using had out-of-date packages that would never be updated; as the installation was due to be replaced in a year or so, I manually "overrode" the install to the latest version, which included several bug fixes. I haven't had the same problem since.
usually "who" followed by "last"
a heap of issues on machines I've managed over times have been because of a very loose definition of "untouched" - often someone has done something :)
Well, I'll start.
This one bit me once, I spent hours trying thousands of different things, disabling services here and there, rebooting, etc. What was the problem? Totally out of disk space.
So, here's the first thing I type when debugging a suddenly troubled server:
I never forget that now. It just saved me lots of wasted effort. Thought I'd share.
top (or htop)
If you can I would always try shutting down all NICs bar the management one.
First thing I check is 'top' (are there any strange processes; ones that hog memory or CPU time.)
If nothing turns up there, I'll check 'who' to see if anyone else is on my machine for some reason.
Maybe a filesystem got dismounted; check with a call to 'cat /etc/mtab' and then 'fstab' to make sure everything will come up right on boot.
Check uptime to make sure the # of users on the box is reasonable (should only be you) and then skim through var/log/auth.log to see if anything is awry there.
These are catch-alls. Depending on the errors your box is throwing, you may need to examine specific processes that are causing the trouble.
Running something like (at)sar on the host is almost mandatory. The usefulness of being able to get historical snapshots of CPU, network, memory and disk I/O (amongst others) cannot be understated.
There have been many times that I have been able to diagnose a fault by examining what the host was doing in the past 24 hours, and seeing when things started going awry.
Checking dmesg for any errors - I usually start with a
dmesg | tail
, because chances are things are still going wrong and the server is still trying to do whatever is causing the error.top df -h and ALWAYS check /var/log to make sure that partition hasn't filled up. That has caused total melt down on me a few times.
df -ha
to check if harddrives are full and someone hasn't received warnings
htop or top
to check memory and cpu usage isn't abnormally high.
Alternatively if the box isn't responding I go into the vm-ware client and check cpu/ram from there.