We are running a new Nagios Core server on Ubuntu 16 Server. Everything was running fine until today when all of a sudden, the site slowed to a crawl. Looking at top command results, we are seeing consistent 99-100% usage by either the nagios or *.cgi processes (web UI). Nothing changed. We also see that polling latencies have increased dramatically. We ran into this once before and decided to remove the install, build a fresh compile and deploy as new. That was a few weeks ago, and now we are back to the same thing. Anyone else run into this that has a fix? Thanks.
top - 11:33:30 up 7 days, 22:38, 1 user, load average: 2.00, 1.91, 1.41
Tasks: 161 total, 2 running, 154 sleeping, 0 stopped, 5 zombie
%Cpu(s): 31.1 us, 3.3 sy, 0.0 ni, 63.3 id, 2.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12174388 total, 7690680 free, 1430508 used, 3053200 buff/cache
KiB Swap: 4067324 total, 4067324 free, 0 used. 10267768 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27230 nagios 20 0 782008 767708 2752 D 87.7 6.3 189:32.12 nagios
16175 www-data 20 0 781988 136336 68412 R 48.5 1.1 0:01.46 status.cgi
16174 sysadmin 20 0 41776 3836 3248 R 0.3 0.0 0:00.01 top
31422 www-data 20 0 296772 11440 3424 S 0.3 0.1 0:00.15 apache2
top - 11:33:33 up 7 days, 22:38, 1 user, load average: 2.00, 1.91, 1.41
Tasks: 161 total, 2 running, 154 sleeping, 0 stopped, 5 zombie
%Cpu(s): 24.9 us, 0.8 sy, 0.0 ni, 28.4 id, 45.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 12174388 total, 7550296 free, 1570912 used, 3053180 buff/cache
KiB Swap: 4067324 total, 4067324 free, 0 used. 10127412 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16175 www-data 20 0 922568 413956 205436 R 100.0 3.4 0:04.48 status.cgi
27230 nagios 20 0 782008 767708 2752 D 2.0 6.3 189:32.18 nagios
323 root 20 0 0 0 0 D 1.0 0.0 0:24.04 jbd2/dm-0-8
1 root 20 0 37792 5980 4144 S 0.0 0.0 0:10.31 systemd
I ended up resolving this issue in part by working with the community over on the Nagios site. Here is the solution:
1) Downloaded, compiled and installed a working build of Nagios from Githib per their recommendation. There is a bug in the version of Nagios (4.4.1) that causes hosts/services to stay in a soft state causing rechecks to happen more frequently.
Maintenance Branch: https://github.com/NagiosEnterprises/na ... tree/maint
2) Renaming the retention.dat & status.dat files was also necessary because they had reached a file size of over 8GB each. Presumably the parsing of these files was causing all the delays.
It has been working perfectly since then for a few weeks now with no degradation in performance. I hope this helps others.