We have a SaaS application that we need to be highly available. We already have an expensive, well-maintained Hyper-V failover cluster, but today the datacenter where we host that cluster had a five-hour power outage that knocked us completely offline. So now we're wondering if a better approach might be to use servers at two separate datacenters. Assuming we get all the back-end file replication and data replication working between these two sites, we're wondering how to handle the front-end routing -- no wonder how we approach the problem, we always wind up with the load balancer being a single point of failure.
So the question is ... how can we set up load-balancing between two hosting sites such that the load balancer isn't the single point of failure? Is there a way to use two separate load balancers, one at each site? Should we be considering round-robin DNS?
To do this properly, you need to have:
There are two common ways of doing this. One simple, one... not.
DNS
Round-Robin DNS isn't quite what you want, because chances are you want all requests to go to the primary DC, and the second DC is only used during downtime of the first.
What you can do though is set a very low TTL on your DNS (say, 30 seconds, or 5 minutes), which will mean that if your DC does go down, you just update your DNS and within 5 minutes or so, all of your clients will be pointing at your other DC.
This means that because your two DC's will have different IP layouts, you need to adjust for this in your setup of the datacenter.
BGP
Basically, if you're asking this question, then this is out of your reach. In short, your IP addresses stay the same, but they are "moved" from one datacenter to the other. This involves expensive routers, expensive IP ranges, and expensive subscriptions to your local registry for AS numbers and IP ranges.
Your BGP routers stop advertising your at your primary datacenter, and start advertising at your secondary datacenter. Then the internet routes around the offline datacenter and sends traffic to your new DC.
If you are virtualised with ESXi and vSphere, VMWare have a pretty good product that we trialled once called VMWare Site Recovery Manager, which basically does everything for you. It keeps your VM configs in sync and powers them up on the 2nd site when the 1st site goes offline. It is big bucks though.
Years later…but for those still looking these seem to be the most affordable / simple solutions for DNS failover:
You need to load-balance the load-balancers.
You can do this with DNS round-robin but that approach has many problems. You cannot control clients that cache entries longer than you'd like and you cannot force traffic to go to a certain location.
You can also do this with Global Server Load Balancing (GSLB). This is a more advanced way to leverage DNS to give you visibility into multiple data centers from the internet. In short, you setup some mechanism to split your traffic up into slices and use DNS to pick a slice. We use a hash of the DNS resolver configured to do lookups for the client. Other folks use geography to route to the "closest" data center. You'll need to add in some mechanism to quickly remove an IP from the GSLB should some single point of failure for that data center or cluster go down.
http://www.eukhost.com/web-hosting/kb/global-server-load-balancing/
Finally, some really advanced folks tackle this problem with Anycast DNS. This again tries to leverage the "closest" data center approach. Anycasting your service means you will need to eliminate any "statefullness." This may prove difficult.