Ping a Specific Port

Question

Gerald Schneider

Asked: 2022-08-20 00:58:06 +0800 CST2022-08-20 00:58:06 +0800 CST 2022-08-20 00:58:06 +0800 CST

Zabbix monitoring and power conservation

772

How can I configure Zabbix to allow hosts to be unreachable at unplanned times?

We have some cluster running where usually not all nodes are completely utilized (VMware, Slurm, OpenStack) and we are currently evaluating where we can shut down hosts do conserve power when possible.

Both VMware and Slurm can do this easily, but they more or less randomly decide which hosts to shut down and when, so maintenance windows are out.

On Slurm I have to write a script to handle the shutdown and power up, I could add an API call there that disables and enables monitoring of the shut down server, but on VMware I can't do this.

The only way I can currently think of are trigger dependencies, but I don't really see how I could configure them in a useful way.

Some clarifications:

I want to "pause" monitoring of nodes that are shut down intentionally
I still need alarms to be triggered when nodes are down unintentionally
I would prefer not to install anything on an ESXi host, except maybe as a VIB that is going to be supported in the foreseeing future
I would prefer to have everything directly in Zabbix or on the Zabbix host
I would prefer to have the same method for all types of hosts, so it's easier to maintain

My current idea (work in progress):

Monitoring of the node state via control plane
A trigger that is triggered when the state is in planned standby
An action that either enables/disables the host or creates/deletes maintenance periods
This needs a small script on the Zabbix host that utilizes the API, but this is fine for me

0 Answers

Voted

Romeo Ninov · Answer 1 · 2022-08-20T01:13:55+08:00

Romeo Ninov

2022-08-20T01:13:55+08:002022-08-20T01:13:55+08:00

You can define so named maintenance period (planned downtime).

Navigate to Configuration → Maintenance
Click on the Create maintenance period button
Type in the maintenance period name
Select the maintenance type and the activity time window
Add a period during which your maintenance will take place
Select hosts and/or host groups
Optionally, specify tags to suppress only the matching problems 
Add the maintenance period
Wait until the configuration changes are picked up by the Zabbix server
Navigate to Monitoring → Problems
Confirm if the problems on the host are suppressed

You can define the above via Zabbix API.

0

Gerald Schneider · Answer 2 · 2022-10-05T01:52:42+08:00

I continued with my work in progress from the question and want to share the result in case someone else has a similar issue.

First I'm getting the state of the relevant nodes from their specific control plane.

Slurm

Agent configuration on one of the control nodes:

UserParameter=slurm.nodes.discovery,sinfo -N -h -o '{"{#NODENAME}":"%N"}' |jq -sr '[.[]]'
UserParameter=slurm.node.state[*],sinfo -h -o %T -n $1

Zabbix discovery rule:

Key: slurm.nodes.discovery
- Item prototype
  - Name: Slurm node {#NODENAME} status
  - Type: Zabbix agent
  - key: slurm.node.state[{#NODENAME}]
  - Type of information: Text
- Trigger prototype
  - Name: Slurm node {#NODENAME} in stand by
  - Operational data: {#NODENAME}
  - Severity: Information
  - Expression: find(/controlnodename/slurm.node.state[{#NODENAME}],#3,"regexp","[~#]$")=1
  - Tags: powersave: standby

VMware vSphere

I've written a small script to get the powerstate of the ESXi hosts from the vCenter API:

#!/usr/bin/env python3
import os
import sys
from pyVim import connect
from pyVmomi import vim

vc_url = "vcenter.example.com"
vc_user = "zabbix@vsphere.local"
vc_password = "aHR0cDovL2JpdC5seS8xVHFjd243Cg=="

if len(sys.argv) < 1:
print("host parameter missing", file=sys.stderr)
os.exit(3)

esx_name = sys.argv[1]

my_cluster = connect.ConnectNoSSL(vc_url, 443, vc_user, vc_password)

content = my_cluster.RetrieveContent()
object_view = content.viewManager.CreateContainerView(content.rootFolder, [vim.HostSystem], True)

host_list = object_view.view
object_view.Destroy()

for host in host_list:
if host.name == esx_name:
    print(host.runtime.powerState)

connect.Disconnect(my_cluster)

This was the only way I could find to get the status of hosts even when they are turned off.

Zabbix discovery rule:

Key: vmware.hv.discovery[{$VMWARE.URL}] Type: Simple check
- Item prototype
  - Name: Power state of Hypervisor {#HV.NAME}
  - Type: External check
  - Key: vsphere_host_powerstate.py[{#HV.NAME}]
  - Type of information: Text
- Trigger prototype
  - Name: VMware Host {#HV.NAME} in stand by
  - Operational data: {#HV.NAME}
  - Severity: Information
  - Expression: last(/vc/vsphere_host_powerstate.py[{#HV.NAME}])="standBy"
  - Tags: powersave: standby

Common configuration

The field Operational data is important, as I use it later when the scripts are called with the macro {EVENT.OPDATA}. The tag powersave with the value standby in triggers is used so I can use a single action for all hosts.

I debated if machines that are shut down for powersaving should be disabled in Zabbix or placed into a maintenance window.

I decided that I find it more intuitive to have the machines in a maintenance window:

if it is disabled it is not obvious for everybody who is managing the machines in Zabbix WHY it is disabled
The description of the maintenance window is displayed directly in the events

I've written a script to add and remove the machines in maintenence windows:

#!/usr/bin/env bash
# set -xv
ZBX_URL="https://zabbix.example.com/zabbix/api_jsonrpc.php"
ZBX_USERNAME="zabbix-api"
ZBX_PASSWORD="aHR0cDovL2JpdC5seS8xVHFjd243Cg=="
ZBX_MAINTENANCE_ID=431

function usage {
    echo "Usage:"
    echo "  $0 <hostname> <maintenance_add|maintenance_remove>"
}

function api_error {
    echo -n "API ERROR: "
    echo "$*" | jq -r .error.data
    exit 3
}

if [[ $# -lt 2 ]]; then
    echo "Error: argument missing"
    usage
    exit 1
fi

function zbx_request {
    REQUEST='{
        "jsonrpc": "2.0",
        "method": "'$1'",
        "params": '$2',
        "auth": "'$AUTH'",
        "id": 1
    }'
    # echo "$REQUEST" |jq

    RESPONSE=$(curl -s -X POST \
    -H 'Content-Type: application/json-rpc' \
    -d "$REQUEST" "$ZBX_URL")

    echo "$RESPONSE" | jq --exit-status .result > /dev/null || api_error "$RESPONSE"
    echo "$RESPONSE"
}

function host_get {
    PARAMS='{
            "filter": {
                "host": [
                    "'$1'"
                ]
            }
        }'
    RESPONSE=$(zbx_request "host.get" "$PARAMS")
    echo "$RESPONSE" | jq -r .result[0].hostid
}

function maintenance_get {
    PARAMS='{
            "maintenanceids": [ '$ZBX_MAINTENANCE_ID' ],
            "selectHosts": [ "hostid" ]
        }'
    RESPONSE=$(zbx_request "maintenance.get" "$PARAMS")
    echo "$RESPONSE" |jq .result[0].hosts | jq -r 'map(.hostid)'
}

function maintenance_add {
    MAINTENANCE_HOSTS=$(maintenance_get)
    HOSTID=$(host_get "$1")

    if jq -e '. | index("'"$HOSTID"'")' <<<"$MAINTENANCE_HOSTS"  > /dev/null; th
en return; fi

    HOSTS=$(jq '. += ["'"$HOSTID"'"]' <<<"$MAINTENANCE_HOSTS")

    PARAMS='{
            "maintenanceid": "'$ZBX_MAINTENANCE_ID'",
            "hostids": '$HOSTS'
        }'
    zbx_request "maintenance.update" "$PARAMS" > /dev/null
}

function maintenance_remove {
    MAINTENANCE_HOSTS=$(maintenance_get)
    HOSTID=$(host_get "$1")

    HOSTINDEX=$(jq '. | index("'"$HOSTID"'")' <<<"$MAINTENANCE_HOSTS")
    if [ "$HOSTINDEX" == "null" ]; then return; fi
    HOSTS=$(jq 'del(.['"$HOSTINDEX"'])' <<<"$MAINTENANCE_HOSTS")
    PARAMS='{
            "maintenanceid": "'$ZBX_MAINTENANCE_ID'",
            "hostids": '$HOSTS'
        }'
    zbx_request "maintenance.update" "$PARAMS" > /dev/null
}

RESPONSE=$(curl -s -X POST -H 'Content-Type: application/json-rpc' \
-d '
{"jsonrpc":"2.0","method":"user.login","params":
{"user":"'"$ZBX_USERNAME"'","password":"'"$ZBX_PASSWORD"'"},
"id":1,"auth":null}
' "$ZBX_URL")
echo "$RESPONSE" | jq --exit-status .result > /dev/null || api_error "$RESPONSE"
AUTH=$(echo "$RESPONSE" | jq -r .result)

case $2 in
    maintenance_add)
        maintenance_add "$1"
    ;;
    maintenance_remove)
        maintenance_remove "$1"
    ;;
    *)
        echo "invalid state: $2"
        usage
        exit 2
    ;;
esac

Script configuration:

Name: Run Script zbx-set-host.sh maintenance_add
- Type: Script
- Commands: /path/to/zbx-set-host.sh {EVENT.OPDATA} maintenance_add
Name: Run Script zbx-set-host.sh maintenance_add
- Type: Script
- Commands: /path/to/zbx-set-host.sh {EVENT.OPDATA} maintenance_remove

Trigger action:

Name: Disable host in stand by (power saving)
Conditions:
- Value of tag powersave equals standby
Operations
- Operation: Run script "Run Script zbx-set-host.sh maintenance_add"
- Target list: current host
Recovery operations
- Operation: Run script "Run Script zbx-set-host.sh maintenance_remove"
- Target list: current host

Zabbix monitoring and power conservation

Slurm

VMware vSphere

Common configuration

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?