I have a docker swarm cluster with 12 nodes. Containers deployed on single node can reach each other fine via overlay network, but when they are deployed on different nodes, there are connectivity issue: hostnames are resolved and I can ping one container from another, but when I try reach other container via tcp (for example with telnet) I'm getting long wait and then connection timeout. Firewall on each node are already set up for docker swarm, with ports 2377, 7946 and 4789 open.
Example: On my master node I ran this commands to create services scheduled for different nodes:
docker network create -d overlay test_net
docker service create --constraint node.labels.first==true --name first --network test_net ubuntu/nginx:1.18-20.04_beta
docker service create --constraint node.labels.second==true --name second --network test_net ubuntu/nginx:1.18-20.04_beta
Then from container first I'm running:
root@37be801ebe8b:/# ping second
PING second (10.0.5.18): 56 data bytes
64 bytes from 10.0.5.18: icmp_seq=0 ttl=64 time=0.092 ms
64 bytes from 10.0.5.18: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from 10.0.5.18: icmp_seq=2 ttl=64 time=0.083 ms
64 bytes from 10.0.5.18: icmp_seq=3 ttl=64 time=0.073 ms
^C--- second ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.067/0.079/0.092/0.000 ms
But then, when I'm trying to connect other node with telnet (there are nginx in this container listening on port 80):
root@37be801ebe8b:/# telnet second 80
Trying 10.0.5.18...
telnet: Unable to connect to remote host: Connection timed out
Can someone suggest workaround for this problem?
Found answer here https://stackoverflow.com/questions/66251422/docker-swarm-overlay-network-icmp-works-but-not-anything-else
The problem was with the bad checksums on the outbound packets. Which were dropping by network interface because of that.
The solution was to disable checksum offloading. Using ethtool:
I deployed my Docker Swarm cluster on multiple VMWare ESXi VMs and tried the solution @hattivatt suggested and it worked. However, this does not persist over reboot and additional effort was required and honestly it didn't seem to be the proper way to do this.
I changed the data port that docker swarm uses (4789 by default) and it worked!
According to portainer and VMWare, this could happen when using NSX, causing conflict on vxlan port. However, I'm not using it!