This is a follow-on question from this previous question, created because I found out more information and it's cleaner to pose this as a new question.
I'm using syslog-ng OSE v3.31.2 to receive RFC3164 syslog messages over UDP port 514 from a bunch of clients and write them to both a file, and forward them to telegraf via non-TLS RFC5424 TCP port 601 for insertion into an InfluxDB database.
My syslog-ng config is:
@version: 3.29
@include "scl.conf"
options {
flush-lines(1);
};
source s_network {
udp(ip(0.0.0.0) port(514));
};
destination d_file {
file("/var/log/messages");
};
destination d_telegraf {
syslog("telegraf" port(601) transport(tcp));
};
log {
source(s_network);
destination(d_telegraf);
destination(d_file);
};
The relevant part of my telegraf config looks like this:
[global_tags]
[agent]
interval = "100ms"
round_interval = true
metric_buffer_limit = 10000
flush_buffer_when_full = true
collection_jitter = "0s"
flush_interval = "100ms"
flush_jitter = "0s"
debug = true
quiet = false
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "logs_db"
[[inputs.syslog]]
server = "tcp://telegraf:601"
Essentially syslog-ng is set up to forward syslog entries over a TCP connection to telegraf.
The problem is that I'm seeing syslog-ng suffer frequent TCP disconnections from telegraf. These show up in the syslog-ng log as:
[2021-11-17T02:55:32.662972] EOF occurred while idle; fd='12'
[2021-11-17T02:55:32.663102] Syslog connection closed; fd='12', server='AF_INET(192.168.0.6:601)', time_reopen='60'
[2021-11-17T02:56:32.719139] Syslog connection established; fd='12', server='AF_INET(192.168.0.6:601)', local='AF_INET(0.0.0.0:0)'
This disconnection is usually triggered when I send a log to syslog-ng with:
logger -i -d --server localhost test
But if I just leave it all idle I'll also get:
[2021-11-17T02:57:05.392356] EOF on control channel, closing connection;
In these cases, 192.168.0.6 is the telegraf server.
Although I can set the option time-reopen(1)
to speed up the reconnection, I'd prefer to find the root cause and prevent the disconnection in the first place.
Is it possible that there is an incompatiblity between syslog-ng and telegraf, that is causing this EOF and an unclean disconnection?
All of this is running within a docker-compose stack on a single host.
EDIT: I've started looking into RFC5424 and RFC6587. Using Wireshark to sniff packets out of syslog-ng, destined for telegraf, I've determined that these are using octet-stuffing (aka non-transparent framing), rather than octet-counting, which telegraf expects by default. The payload of each syslog message to telegraf begins with a "<" character rather than an integer.
I hypothesise that telegraf is accepting these messages but getting stuck parsing them, and therefore closing the connection. The first FIN to close the connection comes from telegraf.
Unfortunately when I set telegraf to accept non-transparent framing it rejects the entire entry and I haven't worked out why yet.
I also haven't yet figured out how to configure syslog-ng to output messages with octet-counting framing.
But at least the EOF message and disconnection has stopped happening. But I'm not sure that means much if telegraf is rejecting all messages outright.
I've determined that syslog-ng is sending octet-counting framed messages to telegraf.
The cause of this issue is that telegraf is disconnecting the TCP connection from syslog-ng after 5 seconds without receiving a message. This is contrary to the documentation provided with the telegraf syslog plugin, which states that this timeout only applies to the time to receive a single message, and not the time between messages. It may be an English language / interpretation issue though. Setting
read_timeout
to0
in the telegraf config is sufficient to prevent telegraf from disconnecting.