Ping a Specific Port

Question

Ulukai

Asked: 2015-12-14 10:41:06 +0800 CST2015-12-14 10:41:06 +0800 CST 2015-12-14 10:41:06 +0800 CST

Nagios erratic behavior when sending notification emails

772

I am having a really tough time setting up nagios3 to do what I want. Far too many config files and not sure where exactly the problem is as everything seems correct.

First, notifications were sent for down hosts and critical services, then I wanted to configure it so it also sent notifications on recovery and now it only sends that, but not for everything.

The way I want to configure it is so it uses the generic service as template and then configure additional details if I need to, but it's not playing ball, here are my config files see if you see anything wrong:

What I want is simple. Send email when host is down, when service is critical, and when it recovers - that's it!

----File contacts.cfg ---

define contact{
        contact_name                    admin
        alias                           administrator
        service_notification_period     24x7
        host_notification_period        24x7
        service_notification_options    u,c,r
        host_notification_options       d,u,r
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        email                           [email protected]
        }


define contactgroup{
        contactgroup_name       admins
        alias                   Nagios Administrators
        members                 admin
        }

---------------------EOF----------

------file generic-service.cfg ---------

define service{
        name                            generic-service ; The 'name' of this service template
        active_checks_enabled           1       ; Active service checks are enabled
        passive_checks_enabled          1       ; Passive service checks are enabled/accepted
        parallelize_check               1       ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this service (if necessary)
        check_freshness                 0       ; Default is to NOT check service 'freshness'
        notifications_enabled           1       ; Service notifications are enabled
        event_handler_enabled           1       ; Service event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
        notification_interval           0       ; Only send notifications on status change by default.
        is_volatile                     0
        check_period                    24x7
        normal_check_interval           1
        retry_check_interval            1
        max_check_attempts              4
        notification_period             24x7
        notification_options            w,u,c,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

---------------EOF--------

----generic-host.cfg file----

define host{
        name                            generic-host    ; The name of this host template
        notifications_enabled           1       ; Host notifications are enabled
        event_handler_enabled           1       ; Host event handler is enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        failure_prediction_enabled      1       ; Failure prediction is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information across program restarts
        retain_nonstatus_information    1       ; Retain non-status information across program restarts
#       check_command                   check-host-alive
        check_command                   check_tcp_alive
        max_check_attempts              10
        notification_interval           0
        notification_period             24x7
        notification_options            d,u,r
        contact_groups                  admins
        register                        0       ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

----Excercept from servicegroups.cfg-----

define service {
        hostgroup_name                  Live, inhouse
        service_description             USERS
        check_command                   check_nrpe_1arg!check_users
        use                             generic-service
    normal_check_interval

               10
            retry_check_interval            10
            contact_groups                  admins
            notification_interval           0 ; set > 0 if you want to be renotified
    }

    # check the LOAD
    define service {
            hostgroup_name                  Live, inhouse
            service_description             LOAD
            check_command                   check_nrpe_1arg!check_load
            use                             generic-service
        normal_check_interval           5
            retry_check_interval            1
            notification_interval           0 ; set > 0 if you want to be renotified
    }       


    # check the HDD
    define service {
            hostgroup_name                  Live, inhouse
            service_description             HDD
            check_command                   check_nrpe_1arg!check_all_disks
            use                             generic-service
        normal_check_interval           600
            retry_check_interval            30
            notification_interval           0 ; set > 0 if you want to be renotified
    }

-----EOF-----

--- Excerpt from Hostgroups.cfg----

define hostgroup {
        hostgroup_name  http-servers
        alias           HTTP servers
        members         *
        }

----EOF-----

1 Answers

Voted

Jim Black · Answer 1 · 2015-12-29T09:39:57+08:00

Your configs seem a bit off to me. If a check is NOT-OK, then Nagios will recheck every 'retry_check_interval' (time between each retry) X 'max_check_attempts' (number of failures in a row) before it will send an alert that something is broke. In the cause of your 'HDD' check, that means the hard drive will need to be in a NON-OK state for 2 hours before you get a notification. If a check should return to an OK state before the above conditions are meet, then no failing notification will be sent. However, you will receive the recovery notification. This will very likely happen for 'LOAD' check, even with the much smaller retry_check_interval, as system usage is often very dynamic.

Also, I don't believe in setting a notification interval to '0' - I feel its a very bad practice that leads to alerts being missed, especially on the generic-* templates. I leave mine at '60' minutes in the template, then use a '240' minutes in those few checks I don't want to hear from so often.

You should also check that 'hostgroup.cfg' file again. The hostgroups you define in your checks are not listed in the hostgroup config file your examples.

In Nagios 3 and above:

'retry_check_interval' changed to 'retry_interval'

'normal_check_interval' changed to 'check_interval'

That said, for backwards compatibility with old versions of config files, all four are still supported - even in Nagios version 4.

Nagios erratic behavior when sending notification emails

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?