Ping a Specific Port

Question

Sandra

Asked: 2010-12-07 05:45:18 +0800 CST2010-12-07 05:45:18 +0800 CST 2010-12-07 05:45:18 +0800 CST

What Warning and Critical values to use for check_load?

772

Right now I am using these values:

# y = c * p / 100
# y: nagios value
# c: number of cores
# p: wanted load procent

# 4 cores
# time        5 minutes    10 minutes     15 minutes
# warning:    90%          70%            50%
# critical:   100%         80%            60%
command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4

But these values are just picked almost random.

Does anyone have some tested values?

6 Answers

Voted

Invent Sekar · Answer 1 · 2012-10-13T07:26:40+08:00

Though its an old post, replying now because I knew check_load threshold values are bigtime headache for the newbies.. ;)

A warning alert, if CPU is 70% for 5min, 60% for 10mins, 50% for 15mins. A critical alert, if CPU is 90% for 5min, 80% for 10mins, 70% for 15mins.

*command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7*

All my findings about CPU load:

Whats meant by "the load": Wikipedia says:

All Unix and Unix-like systems generate a metric of three "load average" numbers in the kernel. Users can easily query the current result from a Unix shell by running the uptime command:

$ uptime
14:34:03 up 10:43,  4 users,  load average: 0.06, 0.11, 0.09

From the above output load average: 0.06, 0.11, 0.09 means (on a single-CPU system):

during the last minute, the CPU was underloaded by 6%
during the last 5 minutes, the CPU was underloaded 11%
during the last 15 minutes, the CPU was underloaded 9%

.

$ uptime
14:34:03 up 10:43,  4 users,  load average: 1.73, 0.50, 7.98

The above load average of 1.73 0.50 7.98 on a single-CPU system as:

during the last minute, the CPU was overloaded by 73% (1 CPU with 1.73 runnable processes, so that 0.73 processes had to wait for a turn)
during the last 5 minutes, the CPU was underloaded 50% (no processes had to wait for a turn)
during the last 15 minutes, the CPU was overloaded 698% (1 CPU with 7.98 runnable processes, so that 6.98 processes had to wait for a turn)

Nagios threshold value calculation:

For Nagios CPU Load setup, which includes warning and critical:

y = c * p / 100

Where: y = nagios value c = number of cores p = wanted load procent

for a 4 core system:

time      5 min  10 min    15 min
warning:  90%    70%       50%
critical: 100%   80%       60%

command[check_load]=/usr/local/nagios/libexec/check_load -w 3.6,2.8,2.0 -c 4.0,3.2,2.4

For a single core system:

y = p / 100

Where: y = nagios value p = wanted load procent

time       5 min  10 min    15 min
warning:   70%    60%       50%
critical:  90%    80%       70%

command[check_load]=/usr/local/nagios/libexec/check_load -w 0.7,0.6,0.5 -c 0.9,0.8,0.7

A great white paper about CPU Load analysis by Dr. Gunther http://www.teamquest.com/pdfs/whitepaper/ldavg1.pdf In this online article Dr. Gunther digs down into the UNIX kernel to find out how load averages (the “LA Triplets”) are calculated and how appropriate they are as capacity planning metrics.

d2xdt2 · Answer 2 · 2010-12-08T03:40:18+08:00

Best Answer

d2xdt2

2010-12-08T03:40:18+08:002010-12-08T03:40:18+08:00

Linux load is actually simple. Each of the load avg numbers are the summation of all the core's avg load. Ie.

 1 min load avg = load_core_1 + load_core_2 + ... + load_core_n
 5 min load avg = load_core_1 + load_core_2 + ... + load_core_n
15 min load avg = load_core_1 + load_core_2 + ... + load_core_n

where 0 < avg load < infinity.

So if a load is 1 on a 4 core server, then it either means each core is used 25% or one core is 100% under load. A load of 4 means all 4 cores are under 100% load. A load of >4 means the server needs more cores.

check_load now have

 -r, --percpu
    Divide the load averages by the number of CPUs (when possible)

which means that when used, you can think of your server as having just one core and hence write the percent fractions directly without thinking of number of cores. With -r the warning and critical intervals becomes 0 <= load avg <= 1. Ie. you don't have to modify your warning and critical values from server to server.

OP have 5,10,15 for intervals. That is wrong. It is 1,5,15.

11

cagenut · Answer 3 · 2010-12-07T07:15:18+08:00

cagenut

2010-12-07T07:15:18+08:002010-12-07T07:15:18+08:00

Unless the servers in question have an asynchronous workload where queue depth is the important service metric to manage then its honestly not even worth monitoring load average. Its just a distraction from the metrics that matter like service time (service time, and service time).

3

nenne · Answer 4 · 2010-12-08T06:02:19+08:00

nenne

2010-12-08T06:02:19+08:002010-12-08T06:02:19+08:00

A good complement too Nagios is a tool like Munin or Cacti, they will graph the different kinds of workload your server is experiencing. Be it load_average, cpu usage, disk io or something else.

Using this information it is easier to set good threshold values in Nagios.

2

Peter Grace · Answer 5 · 2010-12-07T06:58:08+08:00

Peter Grace

2010-12-07T06:58:08+08:002010-12-07T06:58:08+08:00

Do you know at what load average your system's performance is affected? We had servers at my last job that would consistently sit at 35-40 load average, but were still responsive. It's a measurement you have to do a bit of detective work to get accurate numbers for.

You might want to instead measure some other metrics on the system, like average connect time for SSH or http; this might be a better indicator of how much load your system is under.

1

Phil · Answer 6 · 2018-11-20T08:54:05+08:00

Phil

2018-11-20T08:54:05+08:002018-11-20T08:54:05+08:00

To extend Invent Sekar's answer: When using check_load and percentages I believe you will need the "-r" Command Line Argument along with the others.

For example:

command[check_load]=/usr/local/nagios/libexec/check_load -r -w 0.7,0.6,0.5 -c 0.9,0.8,0.7

1

What Warning and Critical values to use for check_load?

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?