I started using multiple machines in a cluster on AWS EC2. Since I started this project, I see costs for regional traffic in my billing information:
regional data transfer - in/out/between EC2 AZs or using elastic IPs or ELB
According to the name, it's three possibilities:
- different Availability Zones
- communication using elastic IPs
- using an Elastic Loud Balancer
Had different AZ's for my machines, that was a problem. So I solved this, now all machines are in the same AZ, but the costs have been increasing for 24 hours now (there were 3 updates during that time). So it seems, setting all machines to the same AZ did not solve it.
However, I do not use Elastic IPs nor ELB. When I visit these pages on my web portal, it just shows me an empty list with a message that I do not have any components at the moment.
In another serverfault question we also read that this happens, when public IP addresses are used for communication, but on some github discussion we can read that even the public DNS name will be resolved to the internal IP internally (still, the public IP will always go through external network, so would in fact trigger costs).
If I track my network communication from the master and one of the slaves in my cluster using
sudo tcpdump -i eth0 | grep -v $MY_HOSTNAME
I can see only internal traffic like this:
IP ip-172-31-48-176.ec2.internal.56372 > ip-172-31-51-15.ec2.internal.54768
So my problem: How can I find out which component is causing this regional traffic?
tl;dr
The huge amount of regional traffic was caused by an
apt-get update
on startup of the machine.At first I suspected the software I am running on the cluster, because this sends a hell a lot of DNS requests out - probably it does not use any DNS caching. And the DNS server is in another Availability Zone.
Full way to debug such stuff
I debugged this with a friend, here is how we arrived at the solution so everyone with this issue can follow:
First of all, from the billing management, you can see that the cost is $0.01 per GB. So it reflects the following points from the Pricing web page (which go a bit more into detail):
Next I checked an explanation on AWS about Availability Zones and Regions. What I have to pay for is definitely traffic that comes from the same region (
us-east-1
in my case). It can either be traffic passing from one AZ to another AZ (we knew before) or traffic using a public IP address or Elastic IP address within the same AZ (we also knew from the other serverfault question). However, it now seems that this list is indeed exhaustive.I knew I had:
Peered VPC
VPC is an own product, so go to VPC. From there you can see how many VPCs you have. In my case it was only one, so peering is not possible at all. But you can still go to Peering Connections and see if anything is set there.
Subnets
From the Subnet section in VPC we also found out some important clue for further debugging. IP ranges of the different Availability Zones in
us-east-1
:172.31.0.0/20
forus-east-1a
172.31.16.0/20
forus-east-1b
172.31.32.0/20
forus-east-1e
172.31.48.0/20
forus-east-1d
Since all my machines should be in
us-east-1d
, I verified that. And indeed they all had IPs starting with172.31.48
,172.31.51
and172.31.54
. So far, so good.tcpdump
This then finally helped us setting the right filters for tcpdump. Now knowing with which IPs I should be communicating in order to avoid costs (network
172.31.48.0/20
only), we set up a filter fortcpdump
. This helped removing all the noise that made me not see the external communication. Plus, before I did not even know that communication with[something].ec2.internal
could be the problem, since I did not know enough about regions, AZs and their respective IP-ranges.First we came up with this tcpdump filter:
This should show all traffic coming in from everywhere but
us-east-1d
. It showed a lot of traffic from my SSH connection, but I saw something weird flying by - anec2.internal
address. Shouldn't they have all been filtered out, because we do not show AZ-internal traffic anymore?But this is not internal! It's from another AZ, namely
us-east-1a
. This is from the DNS system.I extended the filter to check how many of these messages occur:
I waited 10 seconds, stopped the logging and it was 16 responses from DNS!
Next days, still the same problem
However, after installing dnsmasq nothing has changed. Still several GB of traffic when I used the cluster.
From day to day I removed more tasks from the cluster and finally tried it one day without any startup scripts (fine!) and one day with startup scripts only + instant shutdown (traffic!).
The analysis of the startup script revealed that
apt-get update
andapt-get install ...
are the only component pulling huge files. Through a Google research I learned that Ubuntu indeed has a package repository inside AWS. This can also be seen from thesources.list
:Resolving the hostname leads to the following IP addresses:
So I setup a Log Flow service and logged the cluster during boot time. Then, I downloaded the log files and ran them through a python script to sum up all transferred bytes to any of these 4 IP addresses. And the result matches my traffic. I had 1.5 GB traffic during the last test, had 3 clusters of 5 machines each and according to my Log Flow log each machine queries about 100 MB from the Ubuntu repository.