Without making any changes to nagios3 config or OS (debian) filesystem changes when I add some extra devices (to the 12000+ on it already) suddenly
[1508925621] Warning: Return code of 127 for check of service 'PING' on host 'SOME-HOST.CISCO' was out of bounds. Make sure the plugin you're trying to run actually exists.
[1508925621] SERVICE ALERT: SOME-HOST.CISCO;PING;CRITICAL;HARD;3;(Return code of 127 is out of bounds - plugin may be missing)
All the binaries are readable/executable none of that has changed since setup.
It happens for ALL hosts of that type, bear in mind this is a setup that's worked for years non-stop the only thing I can think of is some kind of OS limit is hit when running the checks as that's the only thing that changes, more hosts.
I've had max_concurrent_checks=1500
for a long time. (Its a 16 core 24GB RAM physical server)
Apart from the concurrent checks I run
check_result_reaper_frequency=25
max_check_result_reaper_time=20
The large group of hosts are configured as such:
define host{
use generic-cisco
host_name SOME_HOST.CISCO
alias SOME_HOST.CISCO
address xxx.xxx.xxx.xxx
check_command check-host-alive
hostgroups cisco_devices
}
define service{
use generic-service
host_name SOME_HOST.CISCO
service_description PING
check_command check_ping!200.0,20%!600.0,60%
normal_check_interval 10
retry_check_interval 5
}
The only thing to make return it to a working state is to take off some of the more recent hosts I've added and stop and start and hope it runs fine. Any suggestions?
What fixed it was although I had many other performance recommendations followed I hadn't disabled
enable_environment_macros
Not a dent in performance now. Apparently the problem was the OS was struggling with making those environment vars available at that amount of hosts.. Found through hereI like a good nagios facepalm.