I have two web servers at our colocation running CentOS 6.0. One runs our main marketing web site (production server) and the other is a staging server for the production server, so almost an exact replica. Both of them are behind a firewall and have private IP addresses. The firewall is connected to our main office with a site-to-site VPN tunnell. Both of the servers have their nameservers set up to use our internal DNS servers here in our main office.
On the production server, I'm facing this exact same issue, even the same hostname of phx1-ss-2-lb.cnet.com. The problem is that whenever I ping a domain name that doesn't exist, I get that cnet.com hostname in return. Even on my own domains, if I do somestupidsubdomain.mydomain.com, it returns with the cnet address. In that thread, they said it was NXDOMAIN hijacking and that they should use different name servers. In my situation, this production server is using the same nameservers as everyone else in the company, but this isn't an issue for anyone else. Even the staging server that's a mirror of the production server isn't having the issue.
I've checked the /etc/hosts file and nothing out of the ordinary is there. I looked up how to flush the local DNS cache through either nscd or bind and neither are even installed. I used nslookup and queried my two assigned DNS servers and they came back with domain not found errors, as would be expected.
Where should I look next?
EDIT
I used tcpdump on port 53 and than pinged some jibberish domain and this is the output I got
14:55:39.884442 IP 192.168.4.11.59726 > 192.168.0.22.domain: 27749+ A? asdfjjjf.com. (30) 14:55:39.905778 IP 192.168.0.22.domain > 192.168.4.11.59726: 27749 NXDomain 0/1/0 (103) 14:55:39.905930 IP 192.168.4.11.46752 > 192.168.0.22.domain: 18476+ A? asdfjjjf.com.com. (34) 14:55:39.926982 IP 192.168.0.22.domain > 192.168.4.11.46752: 18476 2/0/0 CNAME phx1-ss-2-lb.cnet.com., A 64.30.224.112 (82)
14:55:39.962067 IP 192.168.4.11.44686 > 192.168.0.22.domain: 5275+ PTR? 112.224.30.64.in-addr.arpa. (44)
14:55:39.983324 IP 192.168.0.22.domain > 192.168.4.11.44686: 5275 1/0/0 PTR phx1-ss-2-lb.cnet.com. (79)
So if I'm reading this right, does that mean that my DNS server is definitely responding with the cnet.com address? If I use nslookup, set it to the 192.168.0.22 server, and query a jibberish domains A record, it returns with nothing.
Aha! You've got a search suffix of
com
- your first query toasdfjjjf.com
got the properNXDOMAIN
, while the second toasdfjjjf.com.com
came back with the accurate information for what's apparently a wildcardCNAME
at*.com.com
. Drop that search suffix, and you should be fine.There's now a more detailed discussion going on over at
http://centos.org/modules/newbb/viewtopic.php?topic_id=36693&forum=59
Using "strace" on "ping" has made it clear that the problem really is in the local libraries. The trace shows the DNS calls, and the local library really is sticking an extra ".com" on DNS request retries. The trace clearly shows the library making a DNS request of "noexample.com", then trying "noexample.com.com", then using the result from "noexample.com.com" for pinging.
I've seen exactly the same situation on a dedicated server co-located at Codero. It's a full dedicated server, 64-bit CentOS 6, no virtualization, administered with Webmin. It doesn't run "named"; all DNS queries are sent to Codero's in-house DNS servers. As with the example above, "ping" (and anything that uses getaddrinfo) will, given a nonexistent domain in ".com", return a host at CNET:
However, "nslookup" and "host" properly don't find "noexample.com". So Codero's DNS servers aren't doing this.
/etc/resolv.conf (generated by WebMin) is just this:
nameserver 69.64.66.11 nameserver 69.64.66.10
If I try "noexample.net", it doesn't find an IP address. It's only a .com problem.
I've noticed that "getaddrinfo" now tries sticking a ".com" on the end of things that don't resolve. If I try to resolve "example", it finds "example.com". So I get the A record idea.
This looks like a bug in "getaddrinfo". It should never add ".com" to something that already has it.
Here's what's going on.
I think I see what's going on. See the man page for "resolv.conf:
http://linux.die.net/man/5/resolv.conf
Note what the default is:
domain Local domain name. Most queries for names within this domain can use short names relative to the local domain. If no domain entry is present, the domain is determined from the local hostname returned by gethostname(2); the domain part is taken to be everything after the first '.'. Finally, if the hostname does not contain a domain part, the root domain is assumed.
In this case, the default name of the server is "sitetruth.com". So the "domain part" is ".com", and any failed lookups are retried with ".com" appended.
Why doesn't this happen all the time? Because most servers have names assigned by some hosting service, like "gator123.hostgator.com". In such cases, the default domain is "hostgator.com", and that's what gets appended on failed domain searches. If your server has a two-component name as its main name, though, there's a problem.
The default in "resolv" is badly chosen.
Going back to the original question, where the problem occurred only on the production server, I'll bet that the production server has a name like "companyname.com", while the test server has a longer name, like "test.companyname.com". That's enough to create this situation.
Setting "ndots" to 0, or providing an empty "search" line ought to disable this behavior, but so far, it's not doing so. So I don't have a fix yet.