I have server A and server B, with A attempting to open TCP connections to server B. curl
shows that approximately 5-10% of connection attempts time out.
mtr
shows 0% loss to B, and at intermediate hops (except those hops that do not respond at all).
mtr --tcp
shows ~5% loss to B, and 0% loss at intermediate hops (with the same caveat).
I've verified these results at multiple times.
Server A is an AWS EC2 instance which we've setup on a dedicated VPC to eliminate any issues on our end. We've been unable to reproduce this issue from Azure, GCP, a business ISP and a consumer ISP. We have been able to reproduce this via two different transit providers when originating from AWS, interestingly. However, I can't say for sure that they use the same hops leading into B, as we don't have mtr responses for a few hops prior to B.
Server B is a CDN edge server. (We're their customers as well but this doesn't seem relevant as we're now troubleshooting at the IP address level and it only affects a few of their edge servers.)
The issue is reproducible with both TCP and UDP, but we have been unable to reproduce over ICMP, regardless of packet size. In addition, the route taken appears to meaningfully differ between TCP/UDP and ICMP--ICMP packets seem to follow one route and one route only; TCP/UDP packets have multiple routes they follow but the ICMP route is not one of them.
Is there any further troubleshooting/investigation I can meaningfully do to narrow down why/where this is happening? AWS and the server B CDN provider are pointing the finger at each other unhelpfully, so any further evidence would be extremely helpful.