Right now I am using these values:
# y = c * p / 100
# y: nagios value
# c: number of cores
# p: wanted load procent
# 4 cores
# time 5 minutes 10 minutes 15 minutes
# warning: 90% 70% 50%
# critical: 100% 80% 60%
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4
But these values are just picked almost random.
Does anyone have some tested values?
Though its an old post, replying now because I knew check_load threshold values are bigtime headache for the newbies.. ;)
A warning alert, if CPU is 70% for 5min, 60% for 10mins, 50% for 15mins. A critical alert, if CPU is 90% for 5min, 80% for 10mins, 70% for 15mins.
All my findings about CPU load:
Whats meant by "the load": Wikipedia says:
All Unix and Unix-like systems generate a metric of three "load average" numbers in the kernel. Users can easily query the current result from a Unix shell by running the uptime command:
From the above output load average:
0.06, 0.11, 0.09
means (on a single-CPU system):.
The above load average of
1.73 0.50 7.98
on a single-CPU system as:Nagios threshold value calculation:
For Nagios CPU Load setup, which includes warning and critical:
y = c * p / 100
Where:
y = nagios value
c = number of cores
p = wanted load procent
for a 4 core system:
For a single core system:
y = p / 100
Where:
y = nagios value
p = wanted load procent
A great white paper about CPU Load analysis by Dr. Gunther http://www.teamquest.com/pdfs/whitepaper/ldavg1.pdf In this online article Dr. Gunther digs down into the UNIX kernel to find out how load averages (the “LA Triplets”) are calculated and how appropriate they are as capacity planning metrics.
Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.
where
0 < avg load < infinity
.So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.
check_load
now havewhich means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With
-r
the warning and critical intervals becomes0 <= load avg <= 1
. Ie. you don't have to modify your warning and critical values from server to server.OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.
Unless the servers in question have an asynchronous workload where queue depth is the important service metric to manage then its honestly not even worth monitoring load average. Its just a distraction from the metrics that matter like service time (service time, and service time).
A good complement too Nagios is a tool like Munin or Cacti, they will graph the different kinds of workload your server is experiencing. Be it load_average, cpu usage, disk io or something else.
Using this information it is easier to set good threshold values in Nagios.
Do you know at what load average your system's performance is affected? We had servers at my last job that would consistently sit at 35-40 load average, but were still responsive. It's a measurement you have to do a bit of detective work to get accurate numbers for.
You might want to instead measure some other metrics on the system, like average connect time for SSH or http; this might be a better indicator of how much load your system is under.
To extend Invent Sekar's answer: When using check_load and percentages I believe you will need the "-r" Command Line Argument along with the others.
For example: