I have three servers, fully connected via wireguard. They run Ubuntu Server 22.04 and postgresql repmr cluster with streaming replication.
All computers have a public address, but PostgreSQL instances, and database clients are using the internal addresses (on wireguard VPN).
On one of the computers, I see this in the logs:
2024-07-26 07:23:14.463 UTC [147915] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:25:56.242 UTC [148509] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:28:17.567 UTC [148818] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:33:13.234 UTC [149090] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 07:48:42.721 UTC [149723] FATAL: terminating walreceiver due to timeout
2024-07-26 07:52:17.298 UTC [151521] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:01:25.141 UTC [151889] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:02:16.337 UTC [152868] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:06:13.169 UTC [152951] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
2024-07-26 08:22:04.180 UTC [153377] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly
Also, when I try to connect to the primary db from a go or a python program, then sometimes I see "connection timeout", or "connection reset by peer", "connection was closed in the middle of operation" and similar messages. It is important to note, that these only happen on one computer, and not on the others.
On the server side (primary postgresql) I see this in the logs:
2024-07-26 12:31:36.667 UTC [3778655] telegraf@telegraf LOG: could not receive data from client: Connection reset by peer
2024-07-26 12:31:36.897 UTC [3777638] telegraf@telegraf LOG: could not receive data from client: Connection reset by peer
2024-07-26 12:31:39.462 UTC [3775606] telegraf@telegraf LOG: could not receive data from client: Connection reset by peer
2024-07-26 12:31:39.480 UTC [3780628] telegraf@telegraf LOG: could not receive data from client: Connection reset by peer
These errors happen just a few times per hour. It is enough to make my applications unreliable, but they are intermittent. I ran this ping test between the public addresses:
ping -c 3600 primary.public.com
# waited an hour...
--- primary.public.com ping statistics ---
3600 packets transmitted, 3600 received, 0% packet loss, time 3603052ms
rtt min/avg/max/mdev = 72.849/73.214/101.325/0.881 ms
I also ran a ping test on the private IP address:
ping -c 1008 primary.private.com
# waited...
--- primary.private.com ping statistics ---
1008 packets transmitted, 783 received, 22.3214% packet loss, time 1013304ms
rtt min/avg/max/mdev = 80.742/91.383/256.720/16.133 ms
In other words, 22% of the ping packets are lost over wireguard.
The MTU value for all wireguard devices are the default 1420.
3: dev0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
link/none
inet 10.241.64.3/32 scope global dev0
valid_lft forever preferred_lft forever
Also tried to test MTU using this script:
size=1272
while ping -s $size -c1 -M do primary.internaladdress.com >&/dev/null; do
((size+=4))
done
echo "Max MTU size: $((size-4+28))
And it also printed 1420.
And please note that the problem only exists between two computer of the three. E.g. it is bad between A and B, but it is good between B-C.
It must be noted, that the problematic computer is far away (on a different continent). But that should not cause this.
As far as I understand, wireguard encapsulates IP packets into encrypted UDP packets, and the TCP protocol takes care of resending the packets that are lost.
It is very strange that IP packets between public addresses have 0% drop date, but wireguard/UDP packets have more than 20%. Is it possible, that UDP packets are dropped by some router or switch? Maybe QoS is happening?
These servers are rented, and they are very far away from each other. Obviously, I can't do anything to eliminate packet drops. I understand that UDP will always be unreliable. But I wonder if I can fix TCP connections somehow. Even if they slow down sometimes (even if they cannot communicate for one or two seconds), they should not reset the connection. What are my options?
All right, here is what I have found. All three servers had IPv6 and IPv4 addresses, and these addresses were assigned to their FQDNs. For some reason, server B-C used the IPv4 address when they established the connection. But A-B used the IPv6 address. It seems that encapsulating IPv4 packets over IPv6 UDP (wireguard) packets is problematic. I could not figure out the exact reason, it might be something outside of my VPS servers. But it is a fact that over IPv6, there was 20-70% packet loss, and it was the worst kind (e.g. no loss for one minute, then 100% loss for several seconds). Also, response times were ridiculous, 20-90ms within the same datacenter.
Then I removed all IPv6 addresses from the public interfaces, forced all wireguard traffic to IPv4. All of a sudden, response time went down to about 2msec on average, and 0% packet loss.
I could not determine the exact cause, but it is almost certain that the problem is NOT because the network is oversubscribed. It is not a real "solution", but it worked for me, and it might work for others.