Often my users require me to be just as responsible for knowing if an event hasn't happened.
I've always had to build custom and brittle solutions with cron'ed shell scripts and lots of date edge case testing.
Centralized logging ought to allow for a better, more maintainable way to get a grip on what didn't happen in the last N hours. Something like logstash noticing and nagios alerting.
Update
toppledwagon's answer was so incredibly helpful . o O (Light. Bulb.) that I now have a dozen batch jobs being freshness checked. I wanted to do his thorough answer justice and follow up with how I've implemented his ideas.
I configured jenkins to emit syslogs, logstash catches them and sends status updates to nagios via nsca. I also use check_mk to keep everything DRY and organized in nagios.
Logstash filter
:::ruby
filter {
if [type] == "syslog" {
grok {
match => [ "message", '%{SYSLOGBASE} job="%{DATA:job}"(?: repo="%{DATA:repo}")?$',
"message", "%{SYSLOGLINE}" ]
break_on_match => true
}
date { match => [ "timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ] }
}
}
The magic is in that double pair of patterns in grok's match parameter along with break_on_match => true. Logstash will try each pattern in turn until one of them matches.
Logstash output
We use the logstash nagios_nsca output plugin to let icinga know we saw the jenkins job in syslog.
:::ruby
output {
if [type] == "syslog"
and [program] == "jenkins"
and [job] == "Install on Cluster"
and "_grokparsefailure" not in [tags] {
nagios_nsca {
host => "icinga.example.com"
port => 5667
send_nsca_config => "/etc/send_nsca.cfg"
message_format => "%{job} %{repo}"
nagios_host => "jenkins"
nagios_service => "deployed %{repo}"
nagios_status => "2"
}
} # if type=syslog, program=jenkins, job="Install on Cluster"
} # output
icinga (nagios)
Finally, we have arrived at icinga (nagios) by way of nsca. Now we will need
passive service checks defined for each and every job we want to notice didn't
happen on time. That can be a lot of jobs so lets use check_mk
to transform
python lists of jobs into nagios object definitions.
check_mk
is cool like that.
/etc/check_mk/conf.d/freshness.mk
# check_mk requires local variables be prefixed with '_'
_dailies = [ 'newyork' ]
_day_stale = 86400 * 1.5
_weeklies = [ 'atlanta', 'denver', ]
_week_stale = 86400 * 8
_monthlies = [ 'stlouis' ]
_month_stale = 86400 * 32
_service_opts = [
("active_checks_enabled", "0"),
("passive_checks_enabled", "1"),
("check_freshness", "1"),
("notification_period", "workhours"),
("contacts", "root"),
("check_period", "workhours"),
]
# Define a new command 'check-periodically' that sets the service to UKNOWN.
# This is called after _week_stale seconds have passed since the service last checked in.
extra_nagios_conf += """
define command {
command_name check-periodicaly
command_line $USER1$/check_dummy 3 $ARG1$
}
"""
# Loop through all passive checks and assign the new check-period command to them.
for _repo in _dailies + _weeklies + _monthlies:
_service_name = 'deployed %s' % _repo
legacy_checks += [(('check-periodicaly', _service_name, False), ['lead'])]
# Look before you leap - python needs the list defined before appending to it.
# We can't assume it already exists because it may be defined earlier.
if "freshness_threshold" not in extra_service_conf:
extra_service_conf["freshness_threshold"] = []
# Some check_mk wizardry to set when the check has passed its expiration date.
# Results in (659200, ALL_HOSTS, [ 'atlanta', 'denver' ]) for weeklies, etc.
extra_service_conf["freshness_threshold"] += [
(_day_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _dailies] ),
(_week_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _weeklies] ),
(_month_stale, ALL_HOSTS, ["deployed %s" % _x for _x in _monthlies] ),
]
# Now we assign all the other nagios directives listed in _service_opts
for _k,_v in _service_opts:
if _k not in extra_service_conf:
extra_service_conf[_k] = []
extra_service_conf[_k] += [(_v, ALL_HOSTS, ["deployed "]) ]
I setup passive checks in nagios for various events. Then at the end of the event the passive check is sent to nagios (either via wrapper script or built into the event itself.) If the passive check hasn't been received in freshness_threshold seconds, it will run check_command locally. check_command is setup as a simple shell script which returns critical and the information of the service description.
I don't have code examples handy, but if I could if interest is shown.
EDIT ONE added code examples:
This assumes that you have done the basic setup for NSCA and send_nsca (make sure password and encryption_method is the same in send_nsca.cfg on the client and nsca.cfg on the nagios server. Then start nsca daemon on the nagios server.)
First we define a template that other passive checks can use. This goes into services.cfg.
This says that if a notification hasn't come in, run check_failed with $SERVICEDESC$ as an argument. Let's define the check_failed command in commands.cfg.
Here is the
/usr/lib/nagios/plugins/check_failed
script.Having an exit of 2 makes this service critical according to nagios (see below for all nagios service states.) Sourcing
/usr/lib/nagios/plugins/utils.sh
is another way, then you couldexit $STATE_CRITICAL
. But the above works even if you don't have that.This gives the added notice of "Is NSCA running" because it might be the case that the service didn't check in properly OR it might be the case that NSCA has failed. This is more common than one might think. If multiple passive checks come in at once, check for a problem with NSCA.
Now we need a passive check to accept the results. In this example I have a specially crafted cron job that knows about all of the different types of raid controllers in our environment. When it runs it sends in a notification to this passive check. In this example, I don't want to be woken up in the middle of the night (edit notification_period as needed.)
Now there's the cronjob that sends info back to the nagios server. Here's the line in /etc/cron.d/raidcheck
See
man send_nsca
for options, but the important parts are 'nagios' is the name of my nagios server and the string that is the printed at the end of this script.send_nsca
expects a line on stdin of the form (perl here)$hostname is obvious, $check in this case is 'raidcheck', $state is the nagios service state (0 = OK, 1 = warning, 2 = critical, 3 = unknown, 4 = dependent.) and $status_info is an optional message to send as the status info.
Now we can test the check on the command line of the client:
This gives us a nagios passive check that expects to be updated every freshness_threshold seconds. If the check isn't updated, check_command (check_failed in this case) is run. The example above is for a nagios 2.X install, but will likely work (maybe with minor modification) for nagios 3.X.
Not sure which type you are referring to as the "event doesn't happen" can take different forms, it can be either conditional or unconditional. Examples:
If you are after the first case and need an open source tool, there is a Pairwithwindow rule in SEC and an Absence rule in nxlog.(Note that I'm affiliated with the latter).
The second type is simpler and both tools can handle that too afaik.