We're experiencing a peculiar issue with our bind
installation (version 9.8.4).
In this scenario, bind
is configured as a caching name server for a small network. For the large majority of queries, everything works fine.
However, we've noticed that queries for some hosts that are configured with a very low TTL, we sometimes get NXDOMAIN responses even though the host name exists.
As an example, take www.cdn77.com—here's the output of dig
when run on the name server itself:
$ dig www.cdn77.com
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34440
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 6, ADDITIONAL: 12
;; QUESTION SECTION:
;www.cdn77.com. IN A
;; ANSWER SECTION:
www.cdn77.com. 196 IN CNAME 1669655317.rsc.cdn77.org.
1669655317.rsc.cdn77.org. 0 IN A 185.59.220.12
;; AUTHORITY SECTION:
org. 170517 IN NS a2.org.afilias-nst.info.
org. 170517 IN NS c0.org.afilias-nst.info.
org. 170517 IN NS b0.org.afilias-nst.org.
org. 170517 IN NS d0.org.afilias-nst.org.
org. 170517 IN NS a0.org.afilias-nst.info.
org. 170517 IN NS b2.org.afilias-nst.org.
;; ADDITIONAL SECTION:
a0.org.afilias-nst.info. 170517 IN A 199.19.56.1
a0.org.afilias-nst.info. 170517 IN AAAA 2001:500:e::1
a2.org.afilias-nst.info. 170517 IN A 199.249.112.1
a2.org.afilias-nst.info. 170517 IN AAAA 2001:500:40::1
b0.org.afilias-nst.org. 170517 IN A 199.19.54.1
b0.org.afilias-nst.org. 170517 IN AAAA 2001:500:c::1
b2.org.afilias-nst.org. 170517 IN A 199.249.120.1
b2.org.afilias-nst.org. 170517 IN AAAA 2001:500:48::1
c0.org.afilias-nst.info. 170517 IN A 199.19.53.1
c0.org.afilias-nst.info. 170517 IN AAAA 2001:500:b::1
d0.org.afilias-nst.org. 170517 IN A 199.19.57.1
d0.org.afilias-nst.org. 170517 IN AAAA 2001:500:f::1
;; Query time: 42 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Dec 2 14:27:03 2015
;; MSG SIZE rcvd: 487
And here's an example of when a NXDOMAIN response is returned:
$ dig www.cdn77.com
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 28771
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cdn77.com. IN A
;; ANSWER SECTION:
www.cdn77.com. 327 IN CNAME 1669655317.rsc.cdn77.org.
;; AUTHORITY SECTION:
cdn77.org. 59 IN SOA ns1.cdn77.org. admin.cdn77.com. 1449062655 10800 180 604800 60
;; Query time: 34 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Dec 2 14:24:52 2015
;; MSG SIZE rcvd: 115
We use Google's public name servers as forwarders, and they never seem to respond with NXDOMAIN:
$ dig www.cdn77.com @8.8.8.8
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> www.cdn77.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35091
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cdn77.com. IN A
;; ANSWER SECTION:
www.cdn77.com. 851 IN CNAME 1669655317.rsc.cdn77.org.
1669655317.rsc.cdn77.org. 0 IN A 185.59.220.11
;; Query time: 40 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Dec 2 14:29:16 2015
;; MSG SIZE rcvd: 85
The authoritive answer, by the way, looks like this:
$ dig 1669655317.rsc.cdn77.org @ns1.cdn77.org
; <<>> DiG 9.8.4-rpz2+rl005.12-P1 <<>> 1669655317.rsc.cdn77.org @ns1.cdn77.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11529
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;1669655317.rsc.cdn77.org. IN A
;; ANSWER SECTION:
1669655317.rsc.cdn77.org. 1 IN A 185.59.220.12
;; Query time: 20 msec
;; SERVER: 37.235.105.100#53(37.235.105.100)
;; WHEN: Wed Dec 2 14:32:57 2015
;; MSG SIZE rcvd: 58
Interestingly, even though the authorative TTL for the record is one, Google's public nameserver always reduces it to zero (see this article for an interesting read about this behavior). I don't think this has anything to do with the problem though, as the successful responses from our bind
also show TTL zero.
I've increased bind
's logging level, but find it very hard to identify any entries that might have something to do with the problem. Even with querylog
activated, all that's visible is the query itself and resolver: debug 1: createfetch: 1669655317.rsc.cdn77.org A
lines.
Any pointers towards how to better diagnose (or even solve) this issue would be greatly appreciated.
The upstream forwarders seem to have inconsistent data -
although the cause of which is not clear- one forwarder in your round-robin is returningNXDOMAIN
which is being cached locally:Google's Public DNS IPv6
2001:4860:4860::8888
is currently returningNXDOMAIN
, despite8.8.8.8
working correctly (ie., matching the Authoritative Answer)The short-term solution is to remove the offending forwarder, then restart Bind or clear the negative cache.
See Alex Dupuy's answer for a clear explanation of the root cause
The problem is that the authoritative nameservers for cdn77.org fail to properly handle ECS (EDNS Client-Subnet) options when they contain an IPv6 client subnet, although they handle IPv4 client subnets just fine.
If you build dig with EDNS client-subnet support, you can check this on the command line; or you can use the online KeyCDN DNS Lookup tool to check this (select the details checkbox and de-select the recursive checkbox, and omit the @ before ns1 when you give it as Custom DNS):
The same query with an IPv4 client address works just fine:
When you send your query to an IPv6 address for Google Public DNS, your client IP subnet is of course an IPv6 subnet, and when the authoritative server answers NXDOMAIN, the (cached?) answer for IPv6 clients is NXDOMAIN too. If you send your query to an IPv4 address for Google Public DNS, your client subnet is an IPv4 subnet, and you get the correct (possibly cached) answer.
Sorry for the inconvenience, this bug has been causing problems only to a handful of our clients, Alex Dupuy has provided great explanation of the problem. We have added IPv6 EDNS support and enabled IPv6 anycast on our DNS servers and this problem is now gone.