We have a seemingly complex network issue that I have been struggling with for a while and wondered if anyone had any thoughts
We have a number of servers sitting in AWS London and another set of servers sitting in CoLos in London and Frankfurt
We connect from AWS to these CoLos using AWS DirectConnect
The AWS servers mainly connect to the CoLo servers using PGSQL, with the CoLo servers hosting the databases. Querys are executed back and forth all day long.
The CoLo servers also push a large volume of data back into AWS in the form of Rsync and Elastic. We never see any issues with Rsync or Logstash push into Elastic.
In general everything works fine, except every so often one specific type of PGSQL query hangs between AWS and CoLo.
Going through the PCAP files we see the following:
- For most of the session the server (CoLo) sends 1460 Bytes of data and the client (AWS) ACKs that data. This carries on back and forth with no problems
- Then we see the server send usually between 5-20 segments that do not arrive at the client. This drop could be anywhere as the network path goes through 2-3 different carriers and DirectConnect
- The client then correctly starts ACKing up to the last segement it received
- By the time the server has received the ACK indicating packet loss it has sent another 10-20 segements
- Eventually the server starts resending the first missed segment which is received and ACKed right away.
- The server then sends two of the "current" segements back to back, but again the client ACKs to say it is still not caught up yet
- At this point the server starts an exponential backoff between sending a "current" segment and an old segment until it reaches 120 second gap
- The process repeats as follows:
- Server sends most recent missing segment
- Client ACKs this segment right away
- Server sends two "current" segments in quick succession
- Client ACKs these with the same ACK as step 2
- Server backs off exponentially and goes back to step 1 (the gap rises until it reaches 120s)
- Eventually after a long enough period (sometimes up to an hour) the server is able to resend all the lost segements and carries on as normal.
The following should be noted:
- At not point anywhere in the capture is a Zero Window size set, in fact during the recovery the client is increasing its Window by ~ 6bytes every ACK
- Running netstat shows no receive or transmit queuing during the incident
- Running the "ss" command seems to show the activation of the Cubic congestion avoidance algo
Obviously any packet loss is not ideal, but my question is surely this is not correct behaviour, in that 10-20 segments lost takes up to any hour to recover?
This looks very much like our issue: https://engineering.skroutz.gr/blog/uncovering-a-24-year-old-bug-in-the-linux-kernel/ I might speak to our CoLo server hosts and see about a kernel upgrade