We currently make use of several Nagios workers to distribute the workload using DNZ as described here: https://assets.nagios.com/downloads/general/docs/Distributed_Monitoring_Solutions.pdf. I have not been able to find any information on this in the official documentation, and most searches just link me back to their website. Ignoring the compute power required (CPU, RAM, etc.) is there any hard limit on how many hosts or services a single Nagios instance can monitor? What about on an individual worker?
I'm not sure if you're asking about config limits or runtime limits. Or both.
If you're asking whether there's a limit on the number of objects (hosts/services/commands/contacts/whatever) that the parser will handle, it looks like the answer is "no" (looking at the source for the parser). Unless you run out of memory while parsing.
As you get into tens-of-thousands of objects (and more) territory, the time required to parse the config can increase dramatically. This is more of a problem with v3 than v4, though. See the docs page on Fast Startup Options for more info.
If you're asking about runtime limitations, again there isn't a preconfigured or hardcoded upper limit. The only thing that really matters is checks/second, and whether or not your hardware can handle it. 10k hosts (or services) with a 5 minute check_interval is the same as 2k hosts/services with a 1 minute interval, in that regard.
Watching (and trending) the average check latencies and execution times from
nagiostats
is a good way to assess Nagios capacity problems.There are some simple tweaks that can make big differences, like having check results spool to a ramdisk, and using
check_icmp
instead ofcheck_ping
. There are good suggestions on the Tuning Nagios For Maximum Performance page. (But that page also suggests using UltraSCSI disks instead of IDE, to give you some idea of how current it is...)The reason you're having so much trouble finding definitive information about Nagios dimensioning is that no two installations are the same, and there are too many variables to be able to say "you can have X checks per second per core" or anything similar.