I'm handling large numbers of concurrent downloads (approx. 500 per server) using Java.
All the files are being downloaded from Amazon S3, and the downloading server is an EC2 m1.large instance.
Occasionally, 2 or more of the streams will simultaneously be broken, resulting in a java.net.SocketException. Occasionally up to 10 of the streams can be simultaneously broken.
I am having the same results downloading from both Amazon S3 and Akamai servers. It only happens when the load starts to be quite high (200 or more concurrent downloads).
I'm well within normal CPU, network load and memory bounds.
I strongly suspect the problem is on my server, and not S3 and Akamai's. How could I debug this and track down the cause?
You could capture the traffic with
tcpdump
and look at that after the connections break. Wireshark for example has an option "follow TCP stream" that allows you to easily isolate a broken one once you locate the last packet.It might still be quite a lot of data to go through, but as you're saying that it only happens when the load is quite high, I don't think that there's a way around that.
As a start, you could have a look at the errors reported by the network interface (through
ifconfig
) and see if that number increases significantly when connections are dropped.Is there a firewall/NAT on the path between you and S3?
Could you simlutaneously capture (
tcpdump -w file -s 0
) the traffic at 2 points - between your server and the firewall, and between the firewall and S3, then compare the dumps? Before launching tcpdump, make sure that the clock is precisely synchronized using NTP on the capturing hosts.Then compare both network captures at the point in time when the connection has been dropped.
I had a similar elusive problem, and by comparing network traffic dumps I've discovered that it was due to SACK being active on my Linux server, but being improperly interpreted by the Cisco ASA firewall that handled the traffic going from the Internet.
Had to disable SACK using sysctl (
net.ipv4.tcp_sack
).