I've been looking for some hours now for a plugin that will notify me if one of my server's CPU Load has been over 90% for the past 5 hours. No luck looking around the Nagios Exchange.
Can anyone help out?
Thanks!
I've been looking for some hours now for a plugin that will notify me if one of my server's CPU Load has been over 90% for the past 5 hours. No luck looking around the Nagios Exchange.
Can anyone help out?
Thanks!
CPU load under UNIX is typically defined as the number of processes in a runnable state. We measure this in 1, 5, and 15 minute intervals. The command
uptime
is a common way to output the load average values.check_load takes a tuple of three elements, matching the 1, 5, and 15 minute averages and accepts both a warning and critical threshold.
As a rough idea, try
check_load -c 0.9,0.9,0.9
with acheck_interval
of 1 hour and amax_check_attempts
of 5.Also note, the
-r
argument. This addresses the fact that most CPUs are multi-core and can therefore be fully utilized individually while still having excess capacity in the aggregate.The basic check_load Nagios check will only evaluate
/proc/loadavg
which just has 1, 5, and 15 minute averages. If you need more, you would need a backlog reaching this far. Incidentally, thesysstat
package does just that - it evaluates and records performance values at given intervals and makes them available via thesar
command line utility. Thecheck_sa Nagios pluginis capable of evaluating the output and averaging the values to match your needs.I should add that Nagios is a rather poor choice when it comes to actually defining alarm thresholds based on performance values averaging over a certain period of time as this needs extensive state-keeping which Nagios does not support. Other monitoring systems collecting performance data are doing a better job here. I would suggest looking at OpenNMS or at least something like Munin if you can't manage the complexity and handle the technical requirements (SNMP) of the former. Both have the advantage of being able to draw fancy RRD graphs helping you to detect trends before you get them formalized in evaluation rules.
Astonishing - isn`t it?
We had to write a monitor ourselves for this, too.
The standard check_load is pretty meaningless since it has to be set into relationship with the number of (logical) processors within the system.
So roughly what we do: - look up how many processors are reported in the system - divide the current load through that number
There you will get that 90% mark you are after.
We use 100% for warning and 150% for critical.
basically, sar gives you by default 10 minute status details,.
so for load avg...
this can report on a number of things, although email server reporting is going away in the likes of app dynamics and newrelic, which dig much deeper (but cost money)
IMHO, nagios is still the best for the money... and hell you can even integrate it with ircd
nagios is definately the way i would go. it is easy to use their prebuilt plugins, or write your own nrpe plugins, and it awesome with hipchat, irc, pagerduty, or custom alerting systems