From reading, it seems like DNS failover is not recommended just because DNS wasn't designed for it. But if you have two webservers on different subnets hosting redundant content, what other methods are there to ensure that all traffic gets routed to the live server if one server goes down?
To me it seems like DNS failover is the only failover option here, but the consensus is it's not a good option. Yet services like DNSmadeeasy.com provide it, so there must be merit to it. Any comments?
By 'DNS failover' I take it you mean DNS Round Robin combined with some monitoring, i.e. publishing multiple IP addresses for a DNS hostname, and removing a dead address when monitoring detects that a server is down. This can be workable for small, less trafficked websites.
By design, when you answer a DNS request you also provide a Time To Live (TTL) for the response you hand out. In other words, you're telling other DNS servers and caches "you may store this answer and use it for x minutes before checking back with me". The drawbacks come from this:
The more common methods of getting good uptime involve:
A very small minority of web sites use multi-datacenter setups, with 'geo-balancing' between datacenters.
DNS failover defintely works great. I have been using it for many years to manually shift traffic between datacenters, or automatically when monitoring systems detected outages, connectivity issues, or overloaded servers. When you see the speed at which it works, and the volumes of real world traffic that can be shifted with ease - you'll never look back. I use Zabbix for monitoring all of my systems and the visual graphs that show what happens during a DNS failover situation put all my doubts to and end. There may be a few ISPs out there that ignore TTLs, and there are some users still out there with old browsers - but when you are looking at traffic from millions of page views a days across 2 datacenter locations and you do a DNS traffic shift - the residual traffic coming in that ignores TTLs is laughable. DNS failover is a solid technique.
DNS was not designed for failover - but it was designed with TTLs that work amazingly for failover needs when combined with a solid monitoring system. TTLs can be set very short. I have effectively used TTLs of 5 seconds in production for lightening fast DNS failover based solutions. You have to have DNS servers capable of handling the extra load - and named won't cut it. However, powerdns fits the bill when backed with a mysql replicated databases on redundant name servers. You also need a solid distributed monitoring system that you can trust for the automated failover integration. Zabbix works for me - I can verify outages from multiple distributed Zabbix systems almost instantly - update mysql records used by powerdns on the fly - and provide nearly instant failover during outages and traffic spikes.
But hey - I built a company that provides DNS failover services after years of making it work for large companies. So take my opinion with a grain of salt. If you want to see some zabbix traffic graphs of high volume sites during an outage - to see for yourself exactly how good DNS failover works - email me I'm more than happy to share.
The issue with DNS failover is that it is, in many cases, unreliable. Some ISPs will ignore your TTLs, it doesn't happen immediately even if they do respect your TTLs, and when your site comes back up, it can lead to some weirdness with sessions when a user's DNS cache times out, and they end up heading over to the other server.
Unfortunately, it is pretty much the only option, unless you're large enough to do your own (external) routing.
The prevalent opinion is that with DNS RR, when an IP goes down, some clients will continue to use the broken IP for minutes. This was stated in some of the previous answers to the question and it is also wrote on Wikipedia.
Anyway,
http://crypto.stanford.edu/dns/dns-rebinding.pdf explains that it is not true for most of the current HTML browsers. They will try the next IP in seconds.
http://www.tenereillo.com/GSLBPageOfShame.htm seems to be even more strong:
Maybe some expert can comment and give a more clear explanation of why DNS RR is not good for high availability.
Thanks,
Valentino
PS: sorry for the broken link but, as new user, I cannot post more than 1
I ran DNS RR failover on a production moderate-trafficked but business-critical website (across two geographies) for many years.
It works fine, but there are at least three subtleties I learned the hard way.
1) Browsers will failover from a non-working IP to a working IP after 30 seconds (last time I checked) if both are considered active in whatever cached DNS is available to your clients. This is basically a good thing.
But having "half" your users wait 30 seconds is unacceptable, so you will probably want to update your TTL records to be a few minutes, not a few days or weeks so that in case of an outage, you can rapidly remove the down server from your DNS. Others have alluded to this in their responses.
2) If one of your nameservers (or one of your two geographies entirely) goes down which is serving your round-robin domain, and if the primary one of them goes down, I vaguely recall you can run into other issues trying to remove that downed nameserver from DNS if you have not set your SOA TTL/expiration for the nameserver to a sufficiently low value also. I could have the technical details wrong here, but there is more than just one TTL setting that you need to get right to really defend against single points of failure.
3) If you publish web APIs, REST services, etc, those are typically not called by browsers, and thus in my opinion DNS failover starts to show real flaws. This may be why some say, as you put it "it is not recommended". Here's why I say that. First, the apps that consume those URLs typically are not browsers, so they lack the 30-second failover properties/logic of common browsers. Second, whether or not the second DNS entry is called or even DNS is re-polled depends very much on the low-level programming details of networking libraries in the programming languages used by these API/REST clients, plus exactly how they are called by the API/REST client app. (Under they covers, does the library call get_addr, and when? If sockets hang or close, does the app re-open new sockets? Is there some sort of timeout logic? etc etc)
It's cheap, well-tested, and "mostly works". So as with most things, your mileage may vary.
There are a bunch of people that use us (Dyn) for failover. It's the same reason sites can either do a status page when they have downtime (think of things like Twitter's Fail Whale)...or simply just reroute the traffic based on the TTLs. Some people may think that DNS Failover is ghetto...but we seriously designed our network with failover from the beginning...so that it would work as well as hardware. I'm not sure how DME does it, but we have 3 of 17 of our closest anycasted PoPs monitor your server from the closest location. When it detects from two of the three that it's down, we simply reroute the traffic to the other IP. The only downtime is for those that were at that requested for the remainder of that TTL interval.
Some people like to use both servers at once...and in that case can do something like a round robin load balancing...or geo based load balancing. For those that actually care about the performance... our real time traffic manager will monitor each server...and if one is slower...reroute the traffic to the fastest one based on what IPs you link in your hostnames. Again...this works based on the values you put in place in our UI/API/Portal.
I guess my point is...we engineered dns failover on purpose. While DNS wasn't made for failover when it originally was created...our DNS network was designed to implement it from the get go. It usually can be just as effective as hardware..without depreciation or the cost of hardware. Hope that doesn't make me sound lame for plugging Dyn...there are plenty of other companies that do it...I'm just speaking from our team's perspective. Hope this helps...
Another option would be to set up name server 1 in location A and name server 2 in location B, but set each one up so all A records on NS1 point traffic to IPs for location A, and on NS2 all A records point to IPs for location B. Then set your TTLs for a very low number, and make sure your domain record at the registrar has been setup for NS1 and NS2. That way, it will automatically load balance, and fail over should one server or one link to a location goes down.
I've used this approach in a slightly different way. I have one location with two ISPs and use this method to direct traffic over each link. Now, it may be a bit more maintenance than you're willing to do... but I was able to create a simple piece of software that automatically pulls NS1 records, updates A record IP addresses for select zones, and pushes those zones to NS2.
The alternative is a BGP based failover system. It's not simple to set up, but it should be bullet proof. Set up site A in one location, site B in a second all with local IP addresses, then get a class C or other block of ip's that are portable and set up redirection from the portable IP's to the local IP's.
There are pitfalls, but it's better than DNS based solutions if you need that level of control.
One option for multi data-center failover is to train your users. We advertise to our customers that we provide multiple servers in multiple cities and in our signup emails and such include links directly to each "server" so that users know if one server is down they can use the link to the other server.
This totally bypasses the issue of DNS failover by just maintaining multiple domain names. Users who go to www.company.com or company.com and login get directed to server1.company.com or server2.company.com and have the choice of bookmarking either of those if they notice they get better performance using one or the other. If one goes down users are trained to go to the other server.
All of these answers have some validity to them, but I think it really depends on what you are doing and what your budget is. Here at CloudfloorDNS, a large percentage of our business is DNS, and offering not only fast DNS, but low TTL options and DNS failover. We wouldn't be in business if this didn't work and work well.
If you are a multinational corporation with unlimited budget on uptime, yeah, the hardware GSLB load balancers and tier 1 datacenters is great, but your DNS still needs to be fast and rock solid. As many of you know, DNS is a critical aspect of any infrastructure, other than the domain name itself, it's the lowest level service that every other part of your online presence rides on. Starting with a solid domain registrar, DNS is just as critical as not letting your domain expire. DNS goes down, it means the whole online aspect of your organization is also down!
When using DNS Failover, the other critical aspects are server monitoring (always multiple geo locations to check from and always multiple (at least 3) should be checking to avoid false positives) and managing the DNS records properly a failure is detected. Low TTL's and some options with the failover can make this a seamless process, and beats the heck out of waking up to a pager in the middle of the night if you are a sys admin.
Overall, DNS Failover really does work and can be very affordable. In most cases from us or most of the managed DNS providers you'll get Anycast DNS along with Server monitoring and failover for a fraction of the cost of hardware options.
So the real answer is yes, it works, but is it for everyone and every budget? Maybe not, but until you try it and do the tests for yourself, it's tough to ignore if you are a small to medium business with a limited IT budget that wants the best uptime possible.