We have a server that receives some data, acting as a TCP client, processes it in some way, and serving processed data to clients, acting as a TCP server. It also stores this data on disk and can serve it from files instead of real-time stream.
The problem is that this service has to be available in 24x7 mode, no interruptions allowed. Right now it is done by having two servers, one acting as a hot backup - clients maintain connections to both servers, and if something happens to the primary server, they just switch to the backup. While this solution works for about 15 years already, it is kind of inconvenient and puts a lot of failover logic on the clients.
Lately people started talking about using a cluster to ensure availability of this service, but no matter how hard I search I just can't find any clustering solutions that would allow transparent TCP connection failover so no-one would notice that something happened to the server. There are some research papers around, but I was unable to find any working implementations. Here is how I think it should work:
Both servers receive the data via TCP. Ideally it should look like a single connection to the "outside" world, to save bandwidth and, more importantly, to ensure that both servers receive identical data streams.
When a client connects to the cluster IP, it receives the processed data in a single connection, but both servers should see this connection and provide the data, it is just that only one of the streams actually reaches the client, the backup one goes to /dev/null, so to say.
When the server fails (it doesn't transmit any data for some time, say, 5 seconds), the client should continue receiving the same stream within the same connection. It needs to happen pretty fast, so the overall streaming delay doesn't exceed approximately 10 seconds.
Reliability is the most important thing here. Quick failover is the next one. Open source Linux solutions are preferred, but if commercial and/or non-Linux near-perfect solutions exist, I would like to know about them too. Solutions that impose a lot of restrictions or require modifications of the server application software are perfectly acceptable too.