For analytics purposes, I'm looking at large sets of IP addresses in server log files. I'm trying to perform reverse-DNS lookups to understand where traffic is coming from - e.g. what percentage of IPs resolve to corporations, schools, government, international etc.
Despite a bunch of optimizations, individually reverse-DNS'ing every IP address still appears to be fairly expensive though. So -
is there any way to obtain an entire range of IPs from a reverse-DNS?
If yes, this could greatly reduce the number of actual reverse-DNS lookups.
Example (numbers slightly obfuscated):
- Log file contains a request from an IP
128.151.162.17
- Reverse DNS resolves to
11.142.152.128.in-addr.arpa 21599 IN PTR alamo.ceas.rochester.edu
- (So this is a visitor from Rochester University, rochester.edu)
- Now, would it be safe to assume that all at least all IPs from
128.151.162.*
will also resolve to rochester.edu? - What about
128.151.*.*
? Is there a way to get the exact IP range?
Not really, no; in extremely rare cases you might be able to do a DNS zone transfer query to get all the records in the zone (the whole /24, generally), but there's a very low chance that the name server you're querying will respond to this request. Expect one query per address for reverse DNS (sorry!).
Generally speaking, probably, as a university they're likely to own the whole /24. However, that's not a good rule to apply as a general case; a smaller school might not have a whole /24, or might not have it in reverse DNS.
The reverse DNS itself is going to be pretty hit-or-miss - in many cases it'll be just generated names under the ISP's hostnames or no records at all. For better data, we're going to make things even more expensive - you should also look at data from whois.
For example, here's the info from that Rochester IP - it shows the size of the allocation (the whole /16 range, so in this case that applies to
128.151.*.*
) and the organization it's allocated to.The whois info should provide a great source of truth for the info you want, and has the upside of being able to see what range that applies to. The downside is that for smaller allocations, a range will often just show as belonging to the ISP instead of the end customer. Combining both whois and reverse DNS should provide the best information (and be ridiculously slow).
You can generally get info about netblocks from whois (eg
whois 128.151.162.17
refers toCIDR: 128.151.0.0/16
), but you'll probably find that there's some variation in the format of the responses you get depending on which registry is involved, and also that whois servers are likely to cap the number of requests you can make. Also note that netblocks are typically nested with smaller ones inside larger ones, and so you may get info about multiple netblocks for one IP.A DNS request packet can contain multiple requests, which may speed things up if you need to resolve a lot of requests, but the main techniques you need are to paralellise requests, and to cache responses.
General advice about this kind of algorithm:
Generally you'll find the data is nearly infinitely cacheable. The data changes so rarely that you might as well do it in batch and save the data to a on-disk cache that all your code uses. The TTL on the data might be 1 hour, but when I was on the internet mapping project we found that as far as domains changing, the data was stable for more than a year.
If you are doing a lot of DNS queries, rate-limit how many you send to any particular DNS server. Otherwise it is rude at best, and a DoS attack at worst.
If you are doing the lookups "on demand", just use some kind of write-through cache and you should be fine.