I am having a really tough time setting up nagios3 to do what I want. Far too many config files and not sure where exactly the problem is as everything seems correct.
First, notifications were sent for down hosts and critical services, then I wanted to configure it so it also sent notifications on recovery and now it only sends that, but not for everything.
The way I want to configure it is so it uses the generic service as template and then configure additional details if I need to, but it's not playing ball, here are my config files see if you see anything wrong:
What I want is simple. Send email when host is down, when service is critical, and when it recovers - that's it!
----File contacts.cfg ---
define contact{
contact_name admin
alias administrator
service_notification_period 24x7
host_notification_period 24x7
service_notification_options u,c,r
host_notification_options d,u,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email [email protected]
}
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members admin
}
---------------------EOF----------
------file generic-service.cfg ---------
define service{
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_interval 0 ; Only send notifications on status change by default.
is_volatile 0
check_period 24x7
normal_check_interval 1
retry_check_interval 1
max_check_attempts 4
notification_period 24x7
notification_options w,u,c,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
---------------EOF--------
----generic-host.cfg file----
define host{
name generic-host ; The name of this host template
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
# check_command check-host-alive
check_command check_tcp_alive
max_check_attempts 10
notification_interval 0
notification_period 24x7
notification_options d,u,r
contact_groups admins
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
}
----Excercept from servicegroups.cfg-----
define service {
hostgroup_name Live, inhouse
service_description USERS
check_command check_nrpe_1arg!check_users
use generic-service
normal_check_interval
10
retry_check_interval 10
contact_groups admins
notification_interval 0 ; set > 0 if you want to be renotified
}
# check the LOAD
define service {
hostgroup_name Live, inhouse
service_description LOAD
check_command check_nrpe_1arg!check_load
use generic-service
normal_check_interval 5
retry_check_interval 1
notification_interval 0 ; set > 0 if you want to be renotified
}
# check the HDD
define service {
hostgroup_name Live, inhouse
service_description HDD
check_command check_nrpe_1arg!check_all_disks
use generic-service
normal_check_interval 600
retry_check_interval 30
notification_interval 0 ; set > 0 if you want to be renotified
}
-----EOF-----
--- Excerpt from Hostgroups.cfg----
define hostgroup {
hostgroup_name http-servers
alias HTTP servers
members *
}
----EOF-----
Your configs seem a bit off to me. If a check is NOT-OK, then Nagios will recheck every 'retry_check_interval' (time between each retry) X 'max_check_attempts' (number of failures in a row) before it will send an alert that something is broke. In the cause of your 'HDD' check, that means the hard drive will need to be in a NON-OK state for 2 hours before you get a notification. If a check should return to an OK state before the above conditions are meet, then no failing notification will be sent. However, you will receive the recovery notification. This will very likely happen for 'LOAD' check, even with the much smaller retry_check_interval, as system usage is often very dynamic.
Also, I don't believe in setting a notification interval to '0' - I feel its a very bad practice that leads to alerts being missed, especially on the generic-* templates. I leave mine at '60' minutes in the template, then use a '240' minutes in those few checks I don't want to hear from so often.
You should also check that 'hostgroup.cfg' file again. The hostgroups you define in your checks are not listed in the hostgroup config file your examples.
In Nagios 3 and above:
'retry_check_interval' changed to 'retry_interval'
'normal_check_interval' changed to 'check_interval'
That said, for backwards compatibility with old versions of config files, all four are still supported - even in Nagios version 4.