I'm trying to write a general rule to fire alert when a discovered target goes missing. In particular kubernetes pods annotated for scraping and auto-discovered using kubernetes_sd_configs.
Expressions of the form: absent(up{job="kubernetes-pods"}==1)
do not return any additional labels which were available as part of the up time series. If a pod is deleted (say by mistake), it disappears as a target from prometheus. An alert based on absent() is fired, but I have no information about what pod has gone missing.
I think the same happens for auto-discovered kubernetes services. If it's deleted by mistake, it just disappears as a monitored target. I'm not sure if the behavior is the same for target_groups (https://prometheus.io/blog/2015/06/01/advanced-service-discovery/) with ip range - that is if the physical node is turned off the metrics just stop and up == 0 is not available.
What is the correct way to detect when an auto-discovered target is gone in a general way? Or do I need to hard code rules for each service/node/pod explicitly, even though it was auto discovered?
Yes, you need a rule for every individual thing to you to alert on being missing as Prometheus doesn't know about their labels from anywhere - service discovery is not returning it.
The usual alert is
absent(up{job="kubernetes-pods"})
We've been solving something similar. Our setup: when some service starts somewhere, some metrics appear with a non-zero value. Then, if any of those metrics go missing, we want an alert.
In our case, the proper expression to achieve that is
This returns a vector which contains metrics which have been present an hour ago, but aren't present now. The values of the metrics are the
count(...)
from the LHS (which can even be useful).You can use any LHS/RHS. Read more about the unless operator.