We have a small datacenter with about a hundred hosts pointing to 3 internal DNS servers (bind 9). Our problem comes when one of the internal DNS servers becomes unavailable. At that point all the clients that point to that server start performing very slowly.
The problem seems to be that the stock Linux resolver doesn't really have the concept of "failing over" to a different DNS server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary DNS server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us.
My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it.
I was wondering how other folks handled this issue?
I can see 3 possible types of solutions:
Use linux-ha/Pacemaker and failover IPs (so the DNS IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing).
Run a local DNS server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.
Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks DNS servers as up or down, and won't use 'down' DNS servers.
Any-casting seems to work only at the IP routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast DNS is more aimed at service discovery and auto-configuration rather than regular DNS resolving.
Am I missing an obvious solution?
A couple of options. Both will distribute the DNS load across your DNS servers.
options rotate
in resolv.conf. This will minimize the impact of the primary server being down. If one of the other servers is down, it will slow down actions.These options can be combined with
options timeout:1 attempts:5
. Increase the attempts if you decrease timeout so you can handle slow external servers.Depending on your router configuration you may be able to configure your DNS servers to take over the primary DNS server's IP address when it is down. This can be combined with the above techniques.
NOTE: I run years without unscheduled DNS outages. As others have noted, I would work on solving the issues causing the DNS servers to fail. The above steps, also help with misconfigured DNS servers with specifying unreachable name servers.
Check out "man resolv.conf". You can add a timeout option to the resolv.conf. The default is 5, but adding the following to resolv.conf should bring it down to 1 second:
Clustering software such as heartbeat or pacemaker/corosync is your friend here. As an exmple, we've set up pacemaker/corosync as follows:
Production hours are 24x7, but we strongly believe that it should be possible for every server to fail without impacting customers. option rotate is merely a workaround, I wouldn't do that.
FWIW, this is the only workable solution that I have found for this problem. You do need to restrict the server to only listen on localhost, but it has completely eliminated users noticing DNS outages in our environment.
One interesting side effect is that if the localhost server goes down for some reason, the standard resolver libraries seem to handle the failover to the next server much faster than in the standard case.
We have been doing this for about 3 years now and I've not seen a single issue that can be related to the failure/outage of a dns server running on localhost.
If a nameserver is going down for maintenance, it is normal procedure to reduce the timeouts in the SOA for that domain ahead of time, so that when the maintenance occurs, changes (like removing NS records before the maintenance and putting them back after the maintenance) propagate quickly. Note that this is a server-side approach - changing resolvers is a client-side approach and ... unless you can talk to each and every one of your clients and get them to make this adjustment on their machine ... might not be the right approach. Well, I guess you did say only a hundred clients all in a data center using internal DNS servers, but really do you want to change the config on a hundred clients when you can just change the zone?
I'd tell you which values in the SOA to adjust, but I was surfing the web to find out that exact info when I ran across this question.
Perhaps you can put your DNS servers behind a load balancer? Apparently LVS can balance UDP. Obviously make your LB highly available so it's not a single point of failure.
A more network-centric solution would be use two DNS servers with the same (dedicated) IP and Anycast routing. (I haven't noticed this answer in this thread so far, but that's what is used here.)
As long as both are up, the nearest server is used. If one goes down, traffic for that IP will be routed to the other node until it comes up again. This especially makes sense if you have two or more locations or data centers.
I know this might sound trite, but how about building a more stable, reslilient DNS infrastructure as a permanent solution to the problem.