I haven't changed anything related to the DNS entry for serverfault.com, but some users were reporting today that the serverfault.com DNS fails to resolve for them.
I ran a justping query and I can sort of confirm this -- serverfault.com dns appears to be failing to resolve in a handful of countries, for no particular reason that I can discern. (also confirmed via What's My DNS which does some worldwide pings in a similar fashion, so it's confirmed as an issue by two different sources.)
Why would this be happening, if I haven't touched the DNS for serverfault.com ?
our registrar is (gag) GoDaddy, and I use default DNS settings for the most part without incident. Am I doing something wrong? Have the gods of DNS forsaken me?
is there anything I can do to fix this? Any way to goose the DNS along, or force the DNS to propagate correctly worldwide?
Update: as of Monday at 3:30 am PST, everything looks correct.. JustPing reports site is reachable from all locations. Thank you for the many very informative responses, I learned a lot and will refer to this Q the next time this happens..
This is not directly a DNS problem, it's a network routing problem between some parts of the internet and the DNS servers for serverfault.com. Since the nameservers can't be reached the domain stops resolving.
As far as I can tell the routing problem is on the (Global Crossing?) router with IP address
204.245.39.50
.As shown by @radius, packets to ns52 (as used by stackoverflow.com) pass from here to
208.109.115.121
and from there work correctly. However packets to ns22 go instead to208.109.115.201
.Since those two addresses are both in the same
/24
and the corresponding BGP announcement is also for a/24
this shouldn't happen.I've done traceroutes via my network which ultimately uses MFN Above.net instead of Global Crossing to get to GoDaddy and there's no sign of any routing trickery below the
/24
level - both name servers have identical traceroutes from here.The only times I've ever seen something like this it was broken Cisco Express Forwarding (CEF). This is a hardware level cache used to accelerate packet routing. Unfortunately just occasionally it gets out of sync with the real routing table, and tries to forward packets via the wrong interface. CEF entries can go down to the
/32
level even if the underlying routing table entry is for a/24
. It's tricky to find these sorts of problems, but once identified they're normally easy to fix.I've e-mailed GC and also tried to speak to them, but they won't create a ticket for non-customers. If any of you are a customer of GC, please try and report this...
UPDATE at 10:38 UTC As Jeff has noted the problem has now cleared. Traceroutes to both servers mentioned above now go via the
208.109.115.121
next hop.your dns servers for serverfault.com [ ns21.domaincontrol.com, ns22.domaincontrol.com. ] are unreachable. for last ~20h, at least from couple major isps in sweden [ telia, tele2, bredband2 ].
at the same time 'neighbor' dns servers for stackoverflow.com & superuser.com [ ns51.domaincontrol.com, ns52.domaincontrol.com ] are reachable.
sample traceroute to ns52.domaincontrol.com:
and to ns21.domaincontrol.com
maybe screwed up filtering / someone triggered some unwanted ddos protection and blacklisted some parts of internet. probably you should contact your dns service provider - go daddy.
you can verify if problem is [partialy] solved by:
edit: traceroutes from working places
poland
germany
edit: all works fine now indeed.
My suggestions: as explained by Alnitak, the problem is not DNS but routing (probably BGP). The fact that nothing was changed in the DNS setup is normal, since the problem was not in he DNS.
serverfault.com has today a very poor DNS setup, certainly insufficient for an important site like this:
We've just seen the result: a routing glitch (something which is quite common on the Internet) is sufficient to make serverfault.com disappears for some users (depending on their operators, not on their countries).
I suggest to add more name servers, located in other AS. This would allow failure resilience. You can either rent them to private companies or to ask serverfault users to offer secondary DNS hosting (may be only if the user has > 1000 rep :-)
I do confirm that NS21.DOMAINCONTROL.COM and NS22.DOMAINCONTROL.COM are also unreacheable from ISP Free.fr in France.
Like pQd traceroute, mine also end after 208.109.115.201 for both ns21 and ns22.
But ns52.domaincontrol.com (208.109.255.26) do works and is in the same subnet as ns22.domaincontrol.com (208.109.255.11)
As you can see, this time after 204.245.39.50 we go to 208.109.115.121 instead of 208.109.115.201. And pQd has the same traceroute. From a working place I did not cross this 204.245.39.50 router (Global Crossing).
More traceroute from working and non working place would help, but it's highly probable that Global Crossing has a bogus routing entry for 208.109.255.11/32 and 216.69.185.11/32 as 208.109.255.10, 208.109.255.12, 216.69.185.10, 216.69.185.12 are working well.
Why it has a boged routing entry is hard to know. Probably 208.109.115.201 (Go Daddy) is advertising a non working route for 208.109.255.11/32 and 216.69.185.11/32.
EDIT: You can telnet route-server.eu.gblx.net to connect to the Global Crossing route server and do traceroute from within Global Crossing network
EDIT: It seems that the same problem already occured with others NS few days ago, see: http://www.newtondynamics.com/forum/viewtopic.php?f=9&t=5277&start=0
What would be handy would be to see a detailed resolution trace from the locations that are failing... see what layer of the resolution path it's failing on. I'm not familiar with the service you're using, but perhaps it's an option somewhere.
Failing that, it's most likely that the problems are "lower down" in the tree, as failures at the root or TLDs would affect more domains (you'd hope). To increase resilience, you can delegate to a second DNS service to ensure better redundancy in resolution if there are problems with domaincontrol's network(s).
I'm surprised you don't host your own DNS. The advantage of doing it that way is if the DNS is reachable, so is (hopefully) your site.
From UPC at least, I get this reaction when trying to get your A record from your authoritive server (ns21.domaincontrol.com).
When I try the same thing from a machine on a different network (OVH), I get an answer
I get similar behaviour for a couple of other domains, so I assume that UPC (at least) is silently redirecting DNS queries to their own caching nameserver, and spoofing the replies. If your DNS had misbehaved briefly, this could explain it as UPC's nameservers may be caching the NXDOMAIN response.