Day One
I have to hide the actual host names, so I'm hoping there is still enough information to answer this question...
I'm trying to resolve a certain host name (let's pretend it's www.example.com
, but this is not the actual host name). A simple dig
request works, but when I try to do a series of dig
starting from a root nameserver, I hit a dead-end. Here's an example:
# Starting with arbitrarily-chosen root nameserver
$ dig @198.41.0.4 www.example.com
(returns the usual list of TLD .com nameservers)
# Using a.gtld-servers.net
$ dig @192.5.6.30 www.example.com
(returns a list of 5 example.com authorities)
At this point, I tried each of the 5 example.com
authorities. Three of them fail with status SERVFAIL
, and the remaining two time out. Here's a SERVFAIL
example:
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 33577
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.example.com. IN A
;; Query time: 74 msec
;; SERVER: <intentionally removed>
;; WHEN: Tue Mar 8 10:10:33 2011
;; MSG SIZE rcvd: 37
I tried this multiple times, from my own machine at home and from a remote machine in our co-lo, and both machines consistently get the same results.
However,
- As I mentioned above,
dig www.example.com
(without specifying an@server
) works fine. - This DNS trace utility is able to resolve the host name, and it clearly shows that it's using one of the name servers that times out for me!
Can anybody help me figure out what's going on?
EDIT 1: In case it helps, what should happen is that this host name should ultimately resolve to a CNAME record pointing to www.example.com.edgesuite.net
, which should in turn resolve to another CNAME record pointing to an Akamai edge server.
EDIT 2: Per Joris's recommendation, I ran dig +trace www.example.com
, and it actually failed to find a result. It gets to the same list of example.com
authorities that I found before, and stops there.
Caching seems like a very likely culprit (and I did think of this earlier), but the weird part is that the actual host name isn't that popular. Would it be cached on two different ISP local nameservers if I'm the first person to request it? :-)
Day Two
OK, I've discovered a few things:
- The two
example.com
authorities that I thought were timing out (as opposed to the other three, that were returningSERVFAIL
) are not actually timing out. They just require a much longer timeout. If I usedig +time=10
, for example, then I do eventually get back a result. - I've tried this from several servers around the U.S., and the story is the same -- using
dig www.example.com
returns a result very quickly, butdig @ns1.example.com
(or@ns2.example.com
) requires using a large timeout parameter.
So my new questions are:
- Could the result really be cached on a variety of proxying DNS servers, even though it's not a commonly-used host name? The TTL is 54,000 (or 15 hours, if I understand correctly).
- If not, then is it possible that
ns1.example.com
is somehow configured to return a result more quickly to proxying DNS servers than to my owndig
queries (some kind of white list)? Or is that just crazy talk?
Don't obscure your DNS data when asking for help with DNS problems. It's pointless and silly, and this is a classic example of how it has served to obscure the actual problem that you have.
There are two major possibilities here:
traceroute
or some such to determine that you really do have IP connectivity to them all. Test UDP/IP connectivity to port 53, specifically, if your tool is capable of it.CNAME
resource record set (which you cannot obtain) was served up, and then cached by your resolving proxy DNS server, when mapping the names of some of the higher level content DNS servers to addresses. Given that your resolving proxy DNS server sometimes works, you can find out how it discovered the answer by looking at its query/response logs. (Some DNS server softwares have to have this logging explicitly turned on. Some have it on by default. How to do it for your particular software, which you haven't stated, is the domain of a separate question.)Note that the only caching that occurrs here is local, on your resolving proxy DNS server. The content DNS servers that you are querying don't cache. (Or, more strictly speaking, if they do cache they cache the back-end databases that they are working from, in ways that have little to nothing to do with resource record TTLs, and that aren't publicly visible through the DNS protocol.)
There are a couple of minor and fairly unlikely possibilities as well, including daft firewalls at your site rewriting DNS traffic on the fly as it passes through them. But since you've not provided proper data, there's little more to narrow down and rule-out the possibilities that you can obtain from random passers-by out on Internet.
To eliminate any chance of faulty queries, could you try dig +trace example.com? It will follow the chain for you. If that succeeds, (it will only try one of the authorities at each level), you have at least one working trail.
If multiple tries all fail, there is something broken. Chances are you're seeing cached answers with the "normal" request; expect the breakage to surface as soon as the TTL expires.
I get two authorities:
Both resolve
www.foo.com
correctly.Did you change the number of authorized name servers for your domain ? At the level that you are querying, these are handled at the registrar level. If you saw 5 before and now you see 2, I'd have to guess you made a change to your authorized name server entries ?
Next time try a simple telnet to port 53 of the authorized name servers
This sounds like an ACL issue on the name servers. The best practice is to separate resolvers and authoritative servers. The domain sounds to have 2x authoritative servers and three caching resolvers that have restrictive acls preventing your query for the domain. Your "however" case works due to the request being a query and not asking for recursion.
In bind you should have the following three options specified or the default will be applied, which has changed with bind versions.
allow-recursion { none; };
allow-query { any; };
allow-query-cache { ournets; };
I just found a solution to a similar problem:
Debian server registered at a Windows Server 2003 was not resolved correctly. sometimes I was able to reach the server through its host name, sometimes not.
the problem was an ipv6 address of the Linux machine. disabling ipv6 solved the problem.
Just like to point out one additional possibility, since I just ran into this: someone set the TTL of the DNS record to zero. This actually resolved fine when querying the name server directly, but gave a SERVFAIL when asked trough the local (ubuntu) resolver.
The problem is that you are not using the right queries. You have to ask the root servers for the NS for the TLD (e.g.
dig @ROOT-NS com.
), then ask the TLD bout your domain (e.g.dig @TLD-NS example.com
) then ask the NS for example.com about www.example.com (e.g.dig @eaxmple.com-NS-IP www.example.com
)Edit: Here is a full example: