I own and operate visualwebsiteoptimizer.com/. The app provides a code snippet which my customers insert in their websites to track certain metrics. Since the code snippet is external JavaScript (at the top of site code), before showing a customer website, a visitor's browser contacts our app server. In case our app server goes down, the browser will keep trying to establish the connection before it times out (typically 60 seconds). As you can imagine, we cannot afford to have our app server down in any scenario because it will negatively affect experience of not just our website visitors but our customers' website visitors too!
We are currently using DNS failover mechanism with one backup server located in a different data center (actually different continent). That is, we monitor our app server from 3 separate locations and as soon as it is detected to be down, we change A record to point to the back up server IP. This works fine for most browsers (as our TTL is 2 minutes) but IE caches the DNS for 30 minutes which might be a deal killer. See this recent post of ours visualwebsiteoptimizer.com/split-testing-blog/maximum-theoretical-downtime-for-a-website-30-minutes/
So, what kind of setup can we use to ensure an almost instant failover in case app data center suffers major outage? I read here www.tenereillo.com/GSLBPageOfShame.htm that having multiple A records is a solution but we can't afford session synchronization (yet). Another strategy that we are exploring is having two A records, one pointing to app server and second to a reverse proxy (located in a different data center) which resolves to main app server if it is up and to backup server if it is up. Do you think this strategy is reasonable?
Just to be sure of our priorities, we can afford to keep our own website or app down but we can't let customers' website slow down because of our downtime. So, in case our app servers are down we don't intend to respond with the default application response. Even a blank response will suffice, we just need that browser completes that HTTP connection (and nothing else).
Reference: I read this thread which was useful serverfault.com/questions/69870/multiple-data-centers-and-http-traffic-dns-round-robin-is-the-only-way-to-assure
Your situation is fairly similar to ours. We want split datacentres, and network-layer type failover.
If you've got the budget to do it, then what you want, is two datacentres, multiple IP transits to each, a pair of edge routers doing BGP sessions to your transit providers, advertising your IP addresses to the global internet.
This is the only way to do true failover. When the routers notice that the route to your servers is no-longer valid (which you can do in a number of ways), then they stop advertising that route, and traffic goes to the other site.
The problem is, that for a pair of edge routers, you're looking at a fairly high cost initially to get this set up.
Then you need to set up the networking behind all this, and you might want to consider some kind of Layer2 connectivity between your sites as a point-to-point link so that you'd have the ability to route traffic incoming to one datacentre, directly to the other in the event of partial failure of your primary site.
BGP Multihomed/Multi-location best practice and Best way to improve resilience? are questions that I asked about similar issues.
The GSLB page of shame does raise some important points, which is why, personally I'd never willingly choose a GSLB to do the job of BGP routing.
You should also look at the other points of failure in your network. Make sure all servers have 2 NICs (connected to 2 separate switches), 2 PSUs, and that your service is comprised of multiple backend servers, as redundant pairs, or load-balanced clusters.
Basically, DNS "load balancing" via multiple A records is just "load-sharing" as the DNS server has no concept of how much load is on each server. This is cheap (free).
A GSLB service has some concept of how loaded the servers are, and their availability, and provides some greater resistance to failure, but is still plagued by the problems related to dns caching, and pegging. This is less cheap, but slightly better.
A BGP routed network, backed by a solid infrastructure, is IMHO, the only way to truly guarantee good uptime. You could save some money by using route servers instead of Cisco/Juniper/etc routers, but at the end of the day, you need to manage these servers very carefully indeed. This is by no means a cheap option, or something to be undertaken lightly, but it is a very rewarding solution, and brings you into the internet as a provider, rather than just a consumer.
OK, this was asked a while ago, but I'm first seeing it now.
You should:
Doing anything else is irresponsible, really. I assume you already have this in place.
You should not base your service on BGP routing tricks unless you have or obtain the know-how to do so. Complex BGP routing scenarios are decidedly non-trivial to implement; don't do this yourself if you don't have the domain specific knowledge.
Your question itself is a little confused. Analysis of how to create a highly available service begins with the application data, because that's your "state". The stateless parts are easy to make highly available, the state-full parts are not. So instead of focusing on your servers and DNS, look at where your application maintains state. Start by optimizing there, and possibly asking for algorithm advice on Stack Overflow. Could you implement a notion of transactions and smart server retry in your Javascript file fx?
Actually, what you want could be upgraded to help your split testing activities as well if you combine geodns and dns failover.
Sending group A to ip 1 and group B to ip 2, even if they were on the same server would let you separate your testing groups. Group A and Group B are from different geographical regions. To be fair, the next day/week/month, you flip the groups to make sure that you allow for geographic differences. Just to be rigourous in your methodology.
The geodns/failover dns service at http://edgedirector.com can do this
disclosure: i am associated with the above link, stumbled in here researching an article on applying stupid dns tricks to split testing.