My architecture in AWS is as follows:
There are 2 identical zabbix agents (based on zabbix/zabbix-agent:centos-4.0.11) each one running on a different EC2 instance. Zabbix server runs on a third instance (also dockerized with dockbix using 4.0 version as well), all three of them inside the same VPC.
The idea is to have a Network Load Balancer that listens to the port that both agents run (10050) and have those 2 aforementioned instances being registered on the target group. Then, the DNS of this NLB would be provided to the Zabbix host configuration as the interface. The goal is to have multiple zabbix hosts targeting the same NLB and their requests being routed according to traffic load to the different agent. There is a zabbix agent item in each host that invokes a UserParameter (a python script) that is defined in each one of the two zabbix agent conf file.
My problem is as follows: zabbix_get (and the equivalent call made automatically according to the interval set in the host conf) timeouts occasionally. One time I get a successful response
{"response":"success","info":"processed: 4; failed: 0; total: 4; seconds spent: 0.000106"}
(python script used is pretty fast, it just takes 1 second) and other times I get a response such as:
zabbix_get [4515]: Timeout while executing operation.
This happens one after another. So one succeeded and the next timeouts, then the next succeeds and so on.
I have tried to test the connection with telnet, and it works all the time. I have even tried to use a simple tcp echo container, which also worked fine all the time.
Any ideas on what might be wrong would be greatly appreciated :)
EDIT: Just wanted to note that this behavior occurs not just with my custom UserParameter defined script, but also with built-in agent calls such as agent.version
or agent.ping
or net.tcp.port[<serverIp>, 10051]
etc
EDIT2: With tcpdump src <serverIp>
run inside the agent instances it seems there is similar traffic happening with a successful and a timed out response
So apparently I needed to enable cross-AZ load balancing for my internal nlb. That's why it was timing out every second request, as all my instances were in one availability region.