Ping a Specific Port

Question

Budric

Asked: 2018-11-02 06:56:51 +0800 CST2018-11-02 06:56:51 +0800 CST 2018-11-02 06:56:51 +0800 CST

Prometheus Alert Rule for Absent Discovered Target

772

I'm trying to write a general rule to fire alert when a discovered target goes missing. In particular kubernetes pods annotated for scraping and auto-discovered using kubernetes_sd_configs.

Expressions of the form: absent(up{job="kubernetes-pods"}==1) do not return any additional labels which were available as part of the up time series. If a pod is deleted (say by mistake), it disappears as a target from prometheus. An alert based on absent() is fired, but I have no information about what pod has gone missing.

I think the same happens for auto-discovered kubernetes services. If it's deleted by mistake, it just disappears as a monitored target. I'm not sure if the behavior is the same for target_groups (https://prometheus.io/blog/2015/06/01/advanced-service-discovery/) with ip range - that is if the physical node is turned off the metrics just stop and up == 0 is not available.

What is the correct way to detect when an auto-discovered target is gone in a general way? Or do I need to hard code rules for each service/node/pod explicitly, even though it was auto discovered?

2 Answers

Voted

brian-brazil · Answer 1 · 2018-11-02T10:41:49+08:00

brian-brazil

2018-11-02T10:41:49+08:002018-11-02T10:41:49+08:00

Or do I need to hard code rules for each service/node/pod explicitly, even though it was auto discovered?

Yes, you need a rule for every individual thing to you to alert on being missing as Prometheus doesn't know about their labels from anywhere - service discovery is not returning it.

The usual alert is absent(up{job="kubernetes-pods"})

9

David · Answer 2 · 2019-01-31T09:00:06+08:00

David

2019-01-31T09:00:06+08:002019-01-31T09:00:06+08:00

We've been solving something similar. Our setup: when some service starts somewhere, some metrics appear with a non-zero value. Then, if any of those metrics go missing, we want an alert.

In our case, the proper expression to achieve that is

count (our_metric offset 1h > 0) by (some_name) unless count(our_metric) by (some_name)

This returns a vector which contains metrics which have been present an hour ago, but aren't present now. The values of the metrics are the count(...) from the LHS (which can even be useful).

You can use any LHS/RHS. Read more about the unless operator.

6

Prometheus Alert Rule for Absent Discovered Target

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?