Whenever one of the servers in /etc/resolv.conf
is unreachable, Linux/glibc/whatever isn't smart enough not to retry it for a while. This results in a lot of services becoming unavailable, because a lot of them do reverse lookups on all incoming connections (like SSH), which will hang for the time-out of the first DNS server query.
How can I make my Ubuntu boxes be smart about the DNS servers it uses? I could hack a bash script that runs every minute that inserts a REJECT rule into iptables for the servers that don't respond to dig queries, but I'd rather not do it that way...
I'm told that Windows does this properly, BTW.
Edit: I worked around it a little bit by putting this in /etc/resolv.conf
(or /etc/resolvconf/resolv.conf.d/base
):
options timeout:2 rotate
Still not perfect, but more workable.
Why are the DNS servers becoming unavailable? That's the issue we should focus on fixing...
You should omit the
rotate
directive if you want to have a deterministic retry order.rotate
basically gives you round-robin lookups, which can have undesirable results in your situation.My DNS
/etc/resolv.conf
tends to look like:Short of that, you do have the option of using a caching DNS service on your local machine, or even enabling the Name Server Caching Daemon (nscd). That will help buffer the delays that come with unreliable DNS resolvers.
Ugh. I've come across this same problem in my systems. When the primary DNS server goes offline, the entire system becomes incredibly slow at best.
In fact, I asked a similar question on this quite some time ago: DNS/resolv.conf settings for a Primary DNS Server failure?. There were some really good answers there, that you might find useful.
I wound up just editing
/etc/resolv.conf
with lower timeout values. (options timeout:1
) Largely because it was the easiest workaround, rather than the most effective. This change means the servers spend less time waiting for dead resolvers. Lookups take 2 seconds rather than 10. This is still terrible if you're trying to do anything that isn't a batch, but at least resulted in very few service failures.