I have a question regarding connection reliability running a tcp server from an ec2 instance.
We are currently serving mobile customers around the world from the Oregon region using a c3.4xl ec2 instance. Our product is a live game server written in python using the gevent framework. Right now we serve about 200 - 300 customers concurrently.
The issue is that we have a lot of customers from the other side of the world that are having trouble connecting and staying connected to the server. The server consistently has the clients time out without closing the socket. We're seeing times of > 30s without hearing back from a heartbeat.
Is it wrong of us to assume that a mobile client can establish a long term tcp connection from around the world and have it not be interrupted?
If so, what would be the best way to mitigate this problem?
If not, does anyone have any strategies for debugging the lost connections?
Thanks in advance :)
Yes, it's very wrong to assume that TCP is going to be completely reliable. You need to design your application with fault tolerance in mind. TCP will break, timeout, and otherwise behave poorly given the vast array of client devices out there.
How you fix this depends greatly on your application, and is very off topic for serverfault. You'd probably have better luck on stack overflow, or the gamedev stack exchange.
Spin up a micro or m1.small instance in an AWS region close to your end-users, with HAProxy installed on it.
Configure the proxy in TCP mode to listen on the appropriate port and relay the connections over to Oregon.
The proxy will actually be managing 2 separate connections for each session, one in each direction (from the user, and to your server) and you may find this setup helps stabilize things. The proxy will listen for connections, and each time one comes in, it will make a separate connection outbound to your server. Once that connection comes up, the proxy will blindly tie the data pipes from those connections together and hold the connections up until one end or the other drops -- or the proxy's internal idle timeout timer expires, which will also close the connections, so you may need to increase the timeouts from their default values.
Theoretically, this should not matter, but in practice, the quality and reliability of the connections between the users and the proxy -- and between the proxy and your main server -- may be much better than the "direct" connections, making end-to-end connectivity more reliable.
You should find that HAProxy can handle hundreds of simultaneous connections on a very small server. It's not the only tool for this application but it's the one with which I'm most familar.