Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
Is there an official API to iplists.com from where I can get the list of spiders?
My intention is to whitelist these IPs for site scraping.
Not that I know of, and it could change at any time at the discretion of the bot operators.
Google offers some specific guidance and explanation on this:
and they suggest using a DNS check (forward and reverse) to verify:
This is probably the best general advice, but it is somewhat resource intensive (CPU cycles for DNS lookups).
There's no list of IP addresses for "good" search engine bots that I know of, and if there were it would be horribly out of date pretty quickly, as you've already discovered.
One thing you can do is to create a bot trap. This is simple in theory: You create a page that is linked to in your web site but hidden from normal users (e.g. via CSS tricks) and then
Disallow
it inrobots.txt
. You then wait a week since legitimate search engines may cacherobots.txt
for that long, then start banning anything that hits the trap page (e.g. with fail2ban).Google bot: https://developers.google.com/search/apis/ipranges/googlebot.json
Bing bot: https://www.bing.com/toolbox/bingbot.json
Facebook https://developers.facebook.com/docs/sharing/webmasters/crawler/