I know many variants of this question has been asked already, but I still can't find a good answer to my needs.
What I want to do is setup a few (2 at minimum) VPS's to host my web apps on. I'd like to provide some load balancing (which is pretty easy achievable with let's say Varnish), and relatively high availability - which is my problem.
Using the load balancer (which I'd need to host on one of VPS's) introduces a single point of failure, which almost as bad as having just one machine to serve the content.
http://i.stack.imgur.com/lFafj.png
And AFAIK DNS round-robin method not only is a bad idea for load-balancing, but also does not provide a fail-over mechanism. If one of servers goes down, some people (with cached DNS IP) will still try to connect to the unavailable server. And forget short TTL - this is not the correct solution.
http://i.stack.imgur.com/mTLRf.png
One very important thing to consider: I want to have my VPS's divided across many datacenters, so if electricity or ISP fails in one datacenter, the website won't go down.
The only 2 solutions I can think of is either rely on dns round robin (and in case of server failure at least serve the content to some percentage of users untill recovery), or buy a dedicated server in a datacenter well prepared for blackout and equipped with several internet connections (which is insanely expensive compared to renting even 10 VPS's).
So the question is: What is the correct way to avoid single point of failure while having several load balanced VPS's?
Please excuse the images. They're just as-basic-as-possible examples of what I meant.
Notes:
How much are you willing to spend, I've yet to see someone relying on VPSs and really wanting to spend the money for a datacenter failure case.
Regarding your drawings:
The fail in the first one is true if (and only if) the load balancer is a single machine, if it's a single system (as in a system built from multiple hosts) it's not true anymore.
SPA (Shortest possible answer):
Really short answer: You need to get a service IP that is available in all your locations. And set up BGP routing.
A little bit longer: Typically this is done by using BGP and announcing the IP on 2 different locations. You can set it up in a way that the IPs are announce all the time but one has a lower preference than the other. This way under normal circumstances you traffic will go to only one site, if that fails the BGP route is dropped and traffic switches over to the IP still available.
We have a few setups similiar to this, typical layout is:
(per location):
2 loadbalancers
This is the place where BGP also runs and announces it's IPs. Usually Quagga and some IPVS setup (we use keepalived)
n
servers to handle the load (FE)The failure cases:
Any 1 Loadbalancer (at a single site) fails
Any
n-k
of the FEs fail (k
being the number of FEs that can fail without us experiencing issues)n-(k+1)
FEs fail (at a single site)any major outage at a single site
I'm sorry I'm not in the mood right now to go further into the details of doing this manually. My guess is you'll be better (and cheaper) of by renting a loadbalancer service that will do the magic for you. I've read that Amazon provides these but I don't know if their usage is possible without using the rest of their infrastructure.
I'm trying to achieve exactly the same thing, if you find a good solution please post ! :)
What I got so far is Amazon EC2 "Elastic IP" (and also "Elastic Load Balaning") which can be routed to instances across different datacenters in one region. (Ironically they once had outage that took down all the datacenters in one region).
Also I've googled this one: http://www.fibercloud.com/MatrixTechnology - looks like they also provide what you are looking for. (I myself didn't dare asking about pricing :)
So far I see that the ultimative answer is managing own BGP but at least for me it's out of question.
For DNS option I'd generally agree that it's not perfect because of some unavoidable caching, mainly inside end-user browsers. I also agree that low TTL is not perfect since I think that having low TTL for small sites will cause minor slow-down since for most users recursive DNS resolvers will not have it in cache. (Though worth mentioning that google.com have TTL of 300 seconds). BTW, AFAIK browsers will failover to second IP advertised in DNS, once they time-out connecting to first one so it's won't completely fail but rather I'd call it degrade.
I'm thinking about combining both - using 2 VPS providers with each one hosting 2 hosts - between hosts at the same datacenter use IP failover, and across datacenters use DNS (normally both IPs advertised, with low TTL, and once one of them fails, failed IP removed).
You should be aware of split-brain case where both sites fight to remove each other IP since they think each other as unavailable. I think I've found a good solution for it - run your DNS server at each host, so upon split brain each site removes the other only from it's own DNS server thus user who is able to reach one host will also get DNS resolved to the very that accessible (he reached it, right ?) host.
If having 4 hosts is too expensive, I think that it's better to use just IP failover at some reliable provider and not rely on DNS alone.
Regards, Alex