I have two internal dns servers set up and all my servers have both of them in the resolv.conf Our main dns server went down and suddenly no server could see each other. I edited a few of the servers resolv.conf manually and committed out the first (down) dns server and that machine would instantly be able to ping again. What did I do wrong, does it not auto switch to the secondary dns server when it times out?
# File managed by puppet
nameserver 192.168.146.100
nameserver 192.168.159.101
;nameserver 72.14.188.5
domain example.com
search example.com
It's likely that the default timeout is too long and that apps are breaking as a result. Keep in mind that the resolver will go start with the first entry in /etc/resolv.conf -every- time it's called (notwithstanding cached entries).
Try adding something like "options timeout:.5" or similar (see the man page - http://linux.die.net/man/5/resolv.conf) to let the local resolver try alternate name servers sooner. Be careful of making this value too low, as some recursive lookups can legitimately take quite a while.
In addition to decreasing the timeout, you may want to add
options rotate
which causes the resolver to vary which nameserver it starts with. This means that when the first server is unavailable, at least some of the time the resolver will start by querying the second server. Of course, it does mean that the effect of the second nameserver failing is more noticable.It's supposed to be transparent and just work. All I can think is there is an abnormal timeout on the first server. Can you reliably reproduce the problem?