We've already got webservers that are loadbalanced. And even though outages shouldn't happen, they do, for a variety of reasons. (central switch failure, misconfigured ISP routers, backbone failures, DOS attack on shared infrastructure) I want to put a second set of servers in a completely different geographical location with entirely different connections. I can sync the SQL servers with a number of different techniques, so that's not a problem. But what I don't know how to do is transparently redirect existing user web sessions to the backup servers when the primary goes down or becomes unreachable.
AFAIK, the three most common ways of dealing with this are:
- DNS load balancing, which uses a very-low TTL to intelligently resolve DNS requests to server IPs in the best environment.
- Intelligent Redirection, which uses a 3rd site to authoritatively redirect users to well-known, but secondary DNS names like na1.mysite.com and eu.mysite.com.
- Use an intelligent, minimal proxy server to relay the requests to different sites while hosting the proxy server in the cloud somewhere.
But in the case of a site failure, the first would leave users unable to reach the server until the TTL causes clients to requery DNS and resolve to the DR site, or causes excessive extra DNS requests. The second method still leaves us with a potential single-point-of-failure (although I could see multiple A-records being used to duplicate the master "login" role between environments) but still doesn't redirect users when the site that they're currently using goes down. And the third isn't redundant at all if the cloud goes down. (as they all have from time to time)
From what I know about networking, isn't there a way that I can give 2 different servers in 2 geographically separated environments the same overlapping IP address and let IP packet routing take over and route traffic to the server accepting requests? Is this only feasible with IPv6? What is it called and why don't DR site failovers currently use such a technique? Update: This is called anycast. How do I make this happen? And is it worth the trouble?
To clarify: this question is specific to HTTP server traffic only with service interruption allowed for up to 60 seconds. Users should not need to close their browser, go back to the login page, or refresh anything. Mobile users cannot accept an extra DNS query for every page request.
I've been here before.
A few times.
Here's some of my past questions.
The general TL;DR is that DNS isn't a solution, for many reasons, some of which you've identified. Some of which are in the answers to the above linked questions.
The only real way to do geographic resilience is with BGP, and subdivide a /23 up into 2 /24s, have those advertised by your upstreams, and then do individual DNS stuff from there.
Then you get the irritating problem of synchronisation between them, but that's another story.
Well, it's not a problem you've had yet.
If you used intelligent redirection, either by changing the hostname, or by proxying the request, then you've got yet another problem.. "Where do you put the proxy, so that it's not a SPOF"
Otherwise, you'd have N geographically separate sites, but one single point of failure (The proxy/redirect engine).
I suppose, in theory you could use MPLS instead to make your locations appear to be on the same L2 network, although I'm uncertain how this would actually help improve resilience to failure.
DNS by itself doesn't provide automatic failover capability. But combined with browser's client retry, it do offer a free (in terms of network investment) and low latency (~1s) solution. See references below for more details.
http://blog.engelke.com/2011/06/07/web-resilience-with-round-robin-dns/
Multiple data centers and HTTP traffic: DNS Round Robin is the ONLY way to assure instant fail-over?