We have a server that receives some data, acting as a TCP client, processes it in some way, and serving processed data to clients, acting as a TCP server. It also stores this data on disk and can serve it from files instead of real-time stream.
The problem is that this service has to be available in 24x7 mode, no interruptions allowed. Right now it is done by having two servers, one acting as a hot backup - clients maintain connections to both servers, and if something happens to the primary server, they just switch to the backup. While this solution works for about 15 years already, it is kind of inconvenient and puts a lot of failover logic on the clients.
Lately people started talking about using a cluster to ensure availability of this service, but no matter how hard I search I just can't find any clustering solutions that would allow transparent TCP connection failover so no-one would notice that something happened to the server. There are some research papers around, but I was unable to find any working implementations. Here is how I think it should work:
Both servers receive the data via TCP. Ideally it should look like a single connection to the "outside" world, to save bandwidth and, more importantly, to ensure that both servers receive identical data streams.
When a client connects to the cluster IP, it receives the processed data in a single connection, but both servers should see this connection and provide the data, it is just that only one of the streams actually reaches the client, the backup one goes to /dev/null, so to say.
When the server fails (it doesn't transmit any data for some time, say, 5 seconds), the client should continue receiving the same stream within the same connection. It needs to happen pretty fast, so the overall streaming delay doesn't exceed approximately 10 seconds.
Reliability is the most important thing here. Quick failover is the next one. Open source Linux solutions are preferred, but if commercial and/or non-Linux near-perfect solutions exist, I would like to know about them too. Solutions that impose a lot of restrictions or require modifications of the server application software are perfectly acceptable too.
You could get a PhD in this stuff -- it's an immensely complicated problem. Or, you could take the easy approach and fix the protocol so it wasn't so temperamental about connection failure. SMTP is a decent model for how to avoid most forms of failure-induced data loss.
You should look at HAProxy. HAProxy is usually run in HTTP mode, but it can handle raw TCP connections as well. It supports load balancing between servers and can use Heartbeat to detect if an instance is down.
If you setup needs to be totally transparent (servers getting the source IPs versus that of the HAProxy server), you might have to patch your Linux kernal for TProxy or find a Linux distribution that supports TProxy within the kernel or as a module.
That's the best open source solution. If you need something more comprehensive than that, you'd have to look at commercial offerings such as Citrix Netscaler for F5's BigIP.