On a production server, I have automated task that send the same SSH commands and amount of data over the wire once a minute to a remote production server. The only thing that may change is a few values in the object. This process has been working in a program for years without issues. Without any local changes, we started having random instances of ECONNRESET
and Connection lost before handshake
errors. It started with a few a day and grew to multiple per hour. The destination server admin says their logs aren't providing useful info...just says Received disconnect from <origin_ip> port 21549:11
or pam_unix(sshd:session): session closed for user <username>
.
Since the connection is initially successful (socket connected
), ssh -vvv
or the equivalent inside my ssh tooling hasn't been helpful in gathering additional data when the connection is broken before all data is sent. Sometimes connections are breaking less than 12 seconds after socket is connected.
I ran mtr <destinatioin_ip>
to inspect the trace and with 9 hops there was only packet loss at the last hop, the destination. It would usually be between 12% and 20%. Never less than 6%. But given it's using ping/ICMP which is sometimes throttled, I don't think it reliably confirms a problem with the ssh connection. So I ran mtr -T -P 22 <destination_ip>
to check SSH/TCP which frequently shows 0% loss in the first 8 hops and as much as 29% packet loss only at hop 9, the destination. But less frequently, it sometimes shows as much as 50% packet loss at each of the first 8 hops and never making it to hop 9. Confusing.
While doing test like the above or just letting the automations retry on their own, eventually the destination server will block all my SSH connections. At that point ssh -vvv <destination_ip>
will hang and then say connection timed out:
ssh -vvv <user@destination_ip>
OpenSSH_7.6p1 Ubuntu-4ubuntu0.7, OpenSSL 1.0.2n 7 Dec 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug2: resolving "<destination_ip>" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to <destination_ip> [<destination_ip>] port 22.
debug1: connect to address <destination_ip> port 22: Connection timed out
ssh: connect to host <destination_ip> port 22: Connection timed out
To resolve the connection timed out
, the destination server admin said he restarts the ssh server. At that point i can connect again but the random disconnects continue until eventually being completely blocked again.
pfSense is the firewall for the origin server network along with Ubiquiti switches. The origin firewall shows no blocked SSH connections and never more than 2-3 ssh connections to the destination server at the same time.
Is the above sufficient to suggest the problem is at least not my server and is likely the destination server (hop 9)? Is there anything else I should be looking at locally to isolate if the cause is local?
I have full control over the local production server. The problem is, without sufficient evidence to confirm the issue is not local, i'm having a hard time escalating the remote team to do additional research on their end.
End of story. Problem identified. No discussion about throttling pings. If they will not take ownership of the issue, escalate.