How can I configure Zabbix to allow hosts to be unreachable at unplanned times?
We have some cluster running where usually not all nodes are completely utilized (VMware, Slurm, OpenStack) and we are currently evaluating where we can shut down hosts do conserve power when possible.
Both VMware and Slurm can do this easily, but they more or less randomly decide which hosts to shut down and when, so maintenance windows are out.
On Slurm I have to write a script to handle the shutdown and power up, I could add an API call there that disables and enables monitoring of the shut down server, but on VMware I can't do this.
The only way I can currently think of are trigger dependencies, but I don't really see how I could configure them in a useful way.
Some clarifications:
- I want to "pause" monitoring of nodes that are shut down intentionally
- I still need alarms to be triggered when nodes are down unintentionally
- I would prefer not to install anything on an ESXi host, except maybe as a VIB that is going to be supported in the foreseeing future
- I would prefer to have everything directly in Zabbix or on the Zabbix host
- I would prefer to have the same method for all types of hosts, so it's easier to maintain
My current idea (work in progress):
- Monitoring of the node state via control plane
- A trigger that is triggered when the state is in planned standby
- An action that either enables/disables the host or creates/deletes maintenance periods
This needs a small script on the Zabbix host that utilizes the API, but this is fine for me
You can define so named maintenance period (planned downtime).
You can define the above via Zabbix API.
I continued with my work in progress from the question and want to share the result in case someone else has a similar issue.
First I'm getting the state of the relevant nodes from their specific control plane.
Slurm
Agent configuration on one of the control nodes:
Zabbix discovery rule:
Slurm node {#NODENAME} status
slurm.node.state[{#NODENAME}]
Slurm node {#NODENAME} in stand by
{#NODENAME}
find(/controlnodename/slurm.node.state[{#NODENAME}],#3,"regexp","[~#]$")=1
powersave
:standby
VMware vSphere
I've written a small script to get the powerstate of the ESXi hosts from the vCenter API:
This was the only way I could find to get the status of hosts even when they are turned off.
Zabbix discovery rule:
vmware.hv.discovery[{$VMWARE.URL}]
Type: Simple checkPower state of Hypervisor {#HV.NAME}
vsphere_host_powerstate.py[{#HV.NAME}]
VMware Host {#HV.NAME} in stand by
{#HV.NAME}
last(/vc/vsphere_host_powerstate.py[{#HV.NAME}])="standBy"
powersave
:standby
Common configuration
The field
Operational data
is important, as I use it later when the scripts are called with the macro{EVENT.OPDATA}
. The tagpowersave
with the valuestandby
in triggers is used so I can use a single action for all hosts.I debated if machines that are shut down for powersaving should be disabled in Zabbix or placed into a maintenance window.
I decided that I find it more intuitive to have the machines in a maintenance window:
I've written a script to add and remove the machines in maintenence windows:
Script configuration:
Run Script zbx-set-host.sh maintenance_add
/path/to/zbx-set-host.sh {EVENT.OPDATA} maintenance_add
Run Script zbx-set-host.sh maintenance_add
/path/to/zbx-set-host.sh {EVENT.OPDATA} maintenance_remove
Trigger action:
Disable host in stand by (power saving)
powersave
equalsstandby
Run script "Run Script zbx-set-host.sh maintenance_add"
Run script "Run Script zbx-set-host.sh maintenance_remove"