Ping a Specific Port

Question

Neil Katin

Asked: 2011-01-05 13:00:27 +0800 CST2011-01-05 13:00:27 +0800 CST 2011-01-05 13:00:27 +0800 CST

Avoiding DNS timeouts when a DNSserver fails

772

We have a small datacenter with about a hundred hosts pointing to 3 internal DNS servers (bind 9). Our problem comes when one of the internal DNS servers becomes unavailable. At that point all the clients that point to that server start performing very slowly.

The problem seems to be that the stock Linux resolver doesn't really have the concept of "failing over" to a different DNS server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary DNS server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us.

My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it.

I was wondering how other folks handled this issue?

I can see 3 possible types of solutions:

Use linux-ha/Pacemaker and failover IPs (so the DNS IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing).
Run a local DNS server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.
Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks DNS servers as up or down, and won't use 'down' DNS servers.

Any-casting seems to work only at the IP routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast DNS is more aimed at service discovery and auto-configuration rather than regular DNS resolving.

Am I missing an obvious solution?

8 Answers

Voted

BillThor · Answer 1 · 2011-01-05T21:30:57+08:00

A couple of options. Both will distribute the DNS load across your DNS servers.

Try using options rotate in resolv.conf. This will minimize the impact of the primary server being down. If one of the other servers is down, it will slow down actions.
Use a different nameserver order on different clients. This will allow some clients to run normally if the primary DNS server is down. This spreads the impact of an out of service DNS server around.

These options can be combined with options timeout:1 attempts:5. Increase the attempts if you decrease timeout so you can handle slow external servers.

Depending on your router configuration you may be able to configure your DNS servers to take over the primary DNS server's IP address when it is down. This can be combined with the above techniques.

NOTE: I run years without unscheduled DNS outages. As others have noted, I would work on solving the issues causing the DNS servers to fail. The above steps, also help with misconfigured DNS servers with specifying unreachable name servers.

Niall Donegan · Answer 2 · 2011-01-05T13:06:42+08:00

Niall Donegan

2011-01-05T13:06:42+08:002011-01-05T13:06:42+08:00

Check out "man resolv.conf". You can add a timeout option to the resolv.conf. The default is 5, but adding the following to resolv.conf should bring it down to 1 second:

options timeout:1

4

Dennis Kaarsemaker · Answer 3 · 2012-11-13T04:51:06+08:00

Dennis Kaarsemaker

2012-11-13T04:51:06+08:002012-11-13T04:51:06+08:00

Clustering software such as heartbeat or pacemaker/corosync is your friend here. As an exmple, we've set up pacemaker/corosync as follows:

Pair up every server with another one
Per pair have 2 dns vips, usually one on each
Should either bind or the server fail, the vip moves to the other server within milliseconds

Production hours are 24x7, but we strongly believe that it should be possible for every server to fail without impacting customers. option rotate is merely a workaround, I wouldn't do that.

3

Fred the Magic Wonder Dog · Answer 4 · 2014-07-12T08:41:17+08:00

Fred the Magic Wonder Dog

2014-07-12T08:41:17+08:002014-07-12T08:41:17+08:00

Run a local dns server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.

FWIW, this is the only workable solution that I have found for this problem. You do need to restrict the server to only listen on localhost, but it has completely eliminated users noticing DNS outages in our environment.

One interesting side effect is that if the localhost server goes down for some reason, the standard resolver libraries seem to handle the failover to the next server much faster than in the standard case.

We have been doing this for about 3 years now and I've not seen a single issue that can be related to the failure/outage of a dns server running on localhost.

3

Brenda J. Butler · Answer 5 · 2012-09-26T20:02:28+08:00

Brenda J. Butler

2012-09-26T20:02:28+08:002012-09-26T20:02:28+08:00

If a nameserver is going down for maintenance, it is normal procedure to reduce the timeouts in the SOA for that domain ahead of time, so that when the maintenance occurs, changes (like removing NS records before the maintenance and putting them back after the maintenance) propagate quickly. Note that this is a server-side approach - changing resolvers is a client-side approach and ... unless you can talk to each and every one of your clients and get them to make this adjustment on their machine ... might not be the right approach. Well, I guess you did say only a hundred clients all in a data center using internal DNS servers, but really do you want to change the config on a hundred clients when you can just change the zone?

I'd tell you which values in the SOA to adjust, but I was surfing the web to find out that exact info when I ran across this question.

2

rxvt · Answer 6 · 2012-09-26T22:13:44+08:00

rxvt

2012-09-26T22:13:44+08:002012-09-26T22:13:44+08:00

Perhaps you can put your DNS servers behind a load balancer? Apparently LVS can balance UDP. Obviously make your LB highly available so it's not a single point of failure.

1

Axel Beckert · Answer 7 · 2015-01-15T03:45:15+08:00

Axel Beckert

2015-01-15T03:45:15+08:002015-01-15T03:45:15+08:00

A more network-centric solution would be use two DNS servers with the same (dedicated) IP and Anycast routing. (I haven't noticed this answer in this thread so far, but that's what is used here.)

As long as both are up, the nearest server is used. If one goes down, traffic for that IP will be routed to the other node until it comes up again. This especially makes sense if you have two or more locations or data centers.

0

joeqwerty · Answer 8 · 2011-01-05T13:23:17+08:00

joeqwerty

2011-01-05T13:23:17+08:002011-01-05T13:23:17+08:00

I know this might sound trite, but how about building a more stable, reslilient DNS infrastructure as a permanent solution to the problem.

-2

Avoiding DNS timeouts when a DNSserver fails

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?