We have configured Nagios
with check_load
via NRPE
plugin to monitor server load, it reports when load is high, but does not have option to take a snapshot top processes (like top
command) at that time.
Are there any nagios
NRPE
plug-ins for that?
You can do it with event handlers.
First, add an event handler for your Load average definition:
The
processes_snapshot
command is defined incommands.cfg
:And second, write an event handler script (
processes_snapshot.sh
):The command
processes_snapshot
is defined innrpe.cfg
on thexx
host as belows:PS: I haven't tested this config.
Here's what I did to get a process list snapshot directly in the notification emails, based on the idea by @quanta. It may contain paths specific to the way Nagios is installed on Debian/Ubuntu machines:
Created a wrapper script
/usr/local/sbin/check_load
that calls the original and appends the process snapshot if the exit code is 1 (WARNING) or 2 (CRITICAL):This sets COLUMNS to a large number so the process names/command lines won't be truncated to 40 characters, run top in batch mode for one iteration (
-bn 1
), asks for full command lines (-c
) and cumulative CPU times (-S
) to be shown, then makes sure top's output isn't truncated at the first|
character by replacing it with<BAR>
.I find top's default sort order to be adequate -- attempting to re-sort by cumulative CPU time like was suggested in @quanta's answer puts system daemons like init or crond at the top, which doesn't help me figure out which CGI script was responsible for the CPU spike. Also this way I get to keep top's header.
Don't forget to
chmod +x /usr/local/sbin/check_load
Edit
/etc/nagios-plugins/config/load.cfg
and replace the check_load entrywith
Edit
/etc/nagios3/commands.cfg
and update the notify-service-by-email entry so it includes $LONGSERVICEOUTPUT$ in the generated emails. It's too long to paste here; basically find theInfo:\n\n$SERVICEOUTPUT$\n" | /usr/bin/mail
bit and change it toInfo:\n\n$SERVICEOUTPUT$\n$LONGSERVICEOUTPUT$\n" | /usr/bin/mail
.Restart nagios:
service nagios3 restart
.I haven't tried this with NRPE.
I prefer: