I have setup 5 alerts in my Prometheus setup. 3 of them work as expected. However, I have 2 that are never triggered. I am really confused and I need some help here.
So, the 2 rules that do not work are:
alert: CriticalDiskSpace
expr: node_filesystem_free{filesystem!~"^/run(/|$)",fstype!~"tmpfs",job="{{
$labels.job }}"} / node_filesystem_size{job="{{ $labels.job }}"} <
0.25
for: 4m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has less than 25%
space remaining.'
summary: Instance {{ $labels.instance }} - Critical disk space usage
alert: CriticalCPULoad
expr: (100
* (1 - avg by(instance) (irate(node_cpu{job="{{ $labels.job }}",mode="idle"}[2m]))))
> 75
for: 2m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has Critical CPU load
for more than 2 minutes.'
summary: Instance {{ $labels.instance }} - Critical CPU load
When I run the rules manually in the Prometheus, I get the correct values. For example, for the HDD, I have a test instance where the FS is at 79%, so, it should fire.
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 50G 40G 11G 79% /
node_filesystem_free{filesystem!~"^/run(/|$)",fstype!~"tmpfs",fstype!~"rootfs", job="ec2_eu_west_1_discovery"} / node_filesystem_size{job="ec2_eu_west_1_discovery"} < 0.25
And of course, Prometheus has the correct value:
Element:
{device="/dev/xvda1",fstype="xfs",instance="Grafana Test",job="ec2_eu_west_1_discovery",mountpoint="/"}
Value:
0.21932882130469517
I have found a way to make the rule firing.
So, if I change the expression from this:
to this:
I get an alert. So, now, I need to understand why in the rules browser I can use the {job="{{ $labels.job }}"} and not in the rules.yml file.