When the accuracy of a DNS cache is in question, dig +trace
tends to be the recommended way of determining the authoritative answer for an internet facing DNS record. This seems to be particularly useful when also paired with +additional
, which also shows the glue records.
Occasionally there seems to be some disagreement on this point -- some people say that it relies on the local resolver to look up the IP addresses of the intermediate nameservers, but the command output offers no indication that this is happening beyond the initial list of root nameservers. It seems logical to assume that this wouldn't be the case if the purpose of +trace
is to start at the root servers and trace your way down. (at least if you have the right list of root nameservers)
Does dig +trace
really use the local resolver for anything past the root nameservers?
This is obviously a staged Q&A, but this tends to confuse people often and I can't find a canonical question covering the topic.
dig +trace
is a great diagnostic tool, but one aspect of its design is widely misunderstood: the IP of every server that will be queried is obtained from your resolver library. This is very easily overlooked and often only ends up becoming a problem when your local cache has the wrong answer for a nameserver cached.Detailed Analysis
This is easier to break down with a sample of the output; I'll omit everything past the first NS delegation.
. IN NS
(root nameservers) hits the local resolver, which in this case is Comcast. (75.75.75.75
) This is easy to spot.serverfault.com. IN A
and runs againste.root-servers.net.
, randomly selected from the list of root nameservers we just got. It has an IP address of192.203.230.10
, and since we have+additional
enabled it appears to be coming from the glue.com.
TLD nameservers.dig
did not derive the IP address ofe.root-servers.net.
from the glue.In the background, this is what really happened:
+trace
cheated and consulted the local resolver to obtain the IP address of the next hop nameserver instead of consulting the glue. Sneaky!This is usually "good enough" and won't cause a problem for most people. Unfortunately, there are edge cases. If for whatever reason your upstream DNS cache is providing the wrong answer for the nameserver, this model breaks down entirely.
Real world example:
In the above case,
+trace
will suggest that the domain owner's own nameservers are the source of the problem, and you're one call away from incorrectly telling a customer that their servers are misconfigured. Whether it's something you can (or are willing to) do something about is another story, but it's important to have the right information.dig +trace
is a great tool, but like any tool, you need to know what it does and doesn't do, and how to troubleshoot the issue manually when it proves insufficient.Edit:
It should also be noted that
dig +trace
will not warn you aboutNS
records that point atCNAME
aliases. This is a RFC violation that ISC BIND (and possibly others) will not attempt to correct.+trace
will be completely happy to accept theA
record it gets from your locally configured nameserver, whereas if BIND were to be performing full recursion it would be rejecting the entire zone with a SERVFAIL.This can be tricky to troubleshoot if glue is present; this will work just fine until the NS records are refreshed, then suddenly break. Glueless delegations will always break BIND's recursion when a
NS
record points at an alias.Another way of tracing DNS resolution without using the local resolver for anything except finding the root nameservers, is using dnsgraph (Full disclosure: I wrote this). It has a command line tool and a web version, of which you can find an instance at http://ip.seveas.net/dnsgraph/
Example for serverfault.com, which actually has a DNS problem right now:
Very late to this thread, but I think the part of the question as to why a dig +trace uses recursive queries to local resolvers hasn't been directly explained, and this explanation is relevant to the accuracy of dig +trace's results.
After the initial recursive query for the NS records of the root zone, then dig may issue subsequent queries to local resolvers under the following conditions:
a referral response is truncated due to the size of the response exceeding 512 bytes for the next iterative query
dig selects an NS record from the AUTHORITY section of the referral response for which the corresponding A record (glue) is missing in the ADDITIONAL section
Because dig has only a domain name from the NS record, dig must resolve the name to an IP address by querying the local DNS server. This is the root cause (pun intended, sorry).
AndrewB has an example which is not fully consonant with what I just described, in that the root zone NS record chosen:
. 121459 IN NS e.root-servers.net.
has a corresponding A record:
e.root-servers.net. 354907 IN A 192.203.230.10
Note however that there is not a corresponding AAAA record for e-root, as well as no AAAA record for some other root servers.
Also, note the size of the response:
;; Received 496 bytes from 75.75.75.75#53(75.75.75.75) in 10 ms
496 bytes is a common size for responses that have been truncated (i.e. the next glue record would have been > 16 bytes, putting the response over 512 bytes). In other words, in a query for the NS records of root, a complete AUTHORITY and complete ADDITIONAL (both A and AAAA records) will exceed 512 bytes, so any UDP-based query which does not specify a larger query size via EDNS0 options will get a response that is cut off somewhere in the ADDITIONAL section, as the above trace shows (only f, h, i, j, and k have A and AAAA glue records).
The lack of a AAAA record for e.root-servers.net and the size of the response to the "NS ." query strongly suggest that the next recursive query was done for the reason I'm claiming. Perhaps the client O/S is IPv6-capable, and prefers AAAA records--or some other reason.
But in any case, after reading this thread, I looked into the phenomenon of dig +trace performing recursive queries subsequent to the initial one for root. The correspondence between selecting an NS record without a corresponding glue A/AAAA record and dig then sending a recursive query for that record to the local DNS is 100%, in my experience. And the reverse is true--I haven't seen recursive queries when the NS record selected from the referral has a corresponding glue record.