I thought DNS primary/secondary for redundancy purposes was straightforward. My understanding is that you should have a primary and at least one secondary, and that you should set up your secondary in a geographically different location, but also behind a different router (see for example https://serverfault.com/questions/48087/why-are-there-several-nameservers-for-my-domain)
Currently, we have two name servers both in our main data center. Recently, we've suffered some outages for various reasons that took out both name servers, and left us and our customers without working DNS for a few hours. I've asked my sysadmin team to finish setting up a DNS server in another data center and configure it as the secondary name server.
However, our sysadmins claim that this doesn't help much if the other data center is not at least as dependable as the primary data center. They claim that most clients will still fail to look up properly, or time out too long, when the primary data center is down.
Personally, I'm convinced we're not the only company with this kind of problem and that it most likely is already a solved problem. I can't imagine all those internet companies being affected by our kind of problem. However, I can't find good online docs that explain what happens in failure cases (for example, client timeouts) and how to work around them.
What arguments can I use to poke holes in our sysadmins' reasoning ? Any online resources I can consult to better understand the problems they claim exist ?
Some additional notes after reading the replies:
- we're on Linux
- we have additional complicated DNS needs; our DNS entries are managed by some custom software, with BIND currently slaving from a Twisted DNS implementation, and some views in the mix as well. However we're completely capable of setting up our own DNS servers at another data center.
- I'm talking about authoritative DNS for outsiders to find our servers, not recursive DNS servers for our local clients.
There is a really great, albeit quite technical "Best Practices" document that may prove useful when combating your sysadmin. http://www.cisco.com/web/about/security/intelligence/dns-bcp.html
If he/she doesn't recognize the validity of articles written by Cisco, then you might as well stop arguing with the sysadmin - go up a level of management.
Many other "Best Practices" document recommend separating your primary and secondary nameservers not only by IP block, but by physical location. In fact, RFC 2182 recomends that secondary DNS services be geographically separated. For many companies, this means renting a server in another datacenter, or subscribing to a hosted DNS provider such as ZoneEdit or UltraDNS.
Ah, the focus is dependable. It sounds like they are taking a jab at your link to the outside, rather than setting up secondary DNS. All the same, do set up secondary DNS and proceed from there. It will help with the load and will prop things up in a pinch...but do inquire as to why they think the other location is not dependable.
You're not the only company, and this has probably been rehashed a million times in companies the world over.
You can do all kinds of things, including setting up an external DNS service that is registered as the authority for your zone, but secretly making the (outside) authoritative servers secondaries to your own (inside) DNS servers. This configuration is horrible, wrong, shows that I am truly an evil SysAdmin, and a kitten dies every time I recommend it. But it does two things:
The reasons that this is the wrong thing to do:
Unfortunately the Linux DNS resolver doesn't seem to have direct support for detecting and doing failovers for DNS servers. It keeps feeding requests to your primary resolving nameserver, waits for a configured timeout, attempts again, etc.
This often means up to 30s delays for any request. Without first trying the secondary as long as the primary is down.
I wanted to solve this as our Amazon EC2 resolving nameserver is unreachable for many of our workers. This causes big delays in our processes and even downtime in some cases because we rely on resolution. I wanted a good failover to Google / Level3 nameservers in case Amazon's went down again. And fall back ASAP, because then Amazon will resolve hostnames to local addresses where applicable, resolving in lower latency for instance to instance communication.
But whatever the usecase, there's a need for better failover. I wanted to solve this. I wanted to stay away from proxy-ing daemons, services, etc. As that would just introduce more Single Point Of Failures. I wanted to use as archaic & robust a technology as I could.
I decided to use crontab & bash, and wrote nsfailover.sh. Hope this helps.
It sounds like the problem is that clients—which could be anyone, anywhere—see two DNS servers and if one fails, they either do not failover to the secondary server or there is a long timeout before they do.
I agree that the primary and secondary DNS servers should be located at different facilities as a best practice, but I don’t see how that would fix this particular problem.
If the client is going to insist on querying a specific IP address, ignoring the secondary’s IP address (or taking a while to timeout to it), then you simply have to come up with a solution that keeps that IP address working, even if the primary server is down.
Some directions to explore would be a load balancer that can redirect traffic for a single IP address to multiple servers at different data centers; or perhaps anycast routing.
As long as each of your datacenters is on different circuits (ideally with different upstream providers far up into the cloud), you can setup pretty reliable DNS with just the two datacenters. You simply need to make sure your registrar of choice populates the appropriate glue records to the big servers in the sky.
Our setup is:
This setup has been effective enough to give us roughly 5 9's of uptime over the last 6 or 7 years, even with the occasional server downtime for updates, etc. If you're willing to spend a few additional dollars, you can look at outsourced hosting of the zone with someone like ultradns...
As to the load conversation that KPWINC mentioned, that is 100% correct. If your smallest datacenter can't handle 100% of your load, then you are likely boned anyway because your outage is going to occur when you least want it =)
I take the maximum load from all my edge routers, add them all together, and then divide by 0.65...that is the minimum bandwidth that we must have at each datacenter. I put that rule into place about 5 years ago, with some documents to justify it I gathered from CCO and about the internet, and it has never failed us. However, you must check those stats at least quarterly. We had our traffic increase almost 3 fold between November and February last year and I was not prepared for it. That bright side is that the situation did allow me to generate some very clear hard data that says at 72% load on our WAN circuit, we start dropping packets. No additional justification has ever been required of me for more bandwidth.
Thomas,
After reading your update I've revised my post (previous post has reference to Windows software).
It almost sounds to me like your sysadmin(s) are telling you that your secondary location doesn't have the necessary hardware to handle the FULL LOAD?
It sounds as if he's saying, "Hey buddy, if our primary location (which includes the primary DNS) goes down then DNS is the LEAST of our worries because if COLO1 is down then COLO2 can't handle the load anyway."
If THAT is the case, then I would suggest you look over your infrastructure and try and come up with a better design. This is easier said than done, especially now that you're live in a production environment.
All that aside, in a perfect world, COLO1 and COLO2 would be able to stand alone and handle your load.
Once that was in place... the DNS is really nothing more than having enough DNS servers with a fast enough refresh and if one side fails you can rewrite your DNS to point to the servers that are UP.
I've used this method in small to reasonable sized environments and it works great. Failover typically takes less than 10 minutes.
You just have to make sure your DNS servers can handle the extra load of a short TTL (time to live).
Hope this helps.
I realized from reading your description that it's not clear whether you mean authoritative DNS for outsiders to find your servers, or recursive DNS servers for your local clients. The behavior of those two is very different.
For authoritative DNS servers, the "clients" will be other DNS servers which have caching and plenty of intelligence. They'll tend to try multiple servers at once if the first one is at all slow, and will tend to prefer the one that gives them faster replies. Downtime for one data center in that case would have a very slight performance impact.
For recursive DNS servers, the clients are your local clients that probably have the DNS servers listed in DHCP. They'll try their servers in the listed order every time, with a painfully long (several seconds) timeout before moving from the first server to the second server.
If your primary datacenter is down, nobody will be able to reach those servers anyways, but often the errors from that are more intelligible than the errors from unreachable DNS servers. "couldn't contact server" or "connection timed out" instead of "couldn't find server" or "no such server". For instance, most SMTP servers will queue up mail for a week if they see the server in DNS but just can't reach it; if they can't find it in DNS at all they may immediately refuse to even try to deliver it to your domain.
Secondary DNS being geographically and network-separated is a good thing. You might be able to trade secondary DNS with a friendly company, and there's plenty of DNS providers you can pay to do it for you. Some registrars have secondary DNS as a service, too.
Your sysadmins are (mostly) wrong.
The recursive servers that query your authoritative servers will notice very rapidly if either site is unresponsive.
Yes, there's some chance that clients may experience very modest DNS resolution delays when there's an outage, but they'll only be a second or two, and once the client's own DNS servers have learned that one of the servers is down they'll use the remaining servers in preference to the failed one.
If necessary (to appease the sysadmins) continue to run two servers at your primary data center, but do put at least one more outside.
A secondary dns server never hurts, depending on where it is hosted it will give you more or less functionality.
If your primary host fails, a secondary can take over no matter if it's sitting next to it or in a remote location. If however your datacenter uplink fails you might still get DNS replies from the server in another datacenter but you won't be able to reach your servers anyhow. So your end users won't directly benefit from the secondary DNS in the remote location.
Different clients react in other ways to DNS servers not being available so there's some truth to clients timing out, but not all.
A secondary DNS in a remote datacenter will, however, still be capable to resolve the IP address of the server you want to reach so you can debug the routing and see when they come up again. And if you have set up secondary MX servers correctly you won't even lose any mail.