Google bot is crawling my site right now and it's killing my server. Its only crawiling one or two pages a second, but those pages are really CPU intensive. I have already added those CPU intensive files to the robots.txt file, but googlebot hasn't detected those changes yet. I want to block google bot at the apache.cong level so my site can be back right now. How can I do this? This one apoache instance is hosting a few PHP sites and a django powered site, so I can't use .htaccess files. The server is running Ubuntu 10.04.
I see you are currently trying to use glob-patters in your robots.txt.
From The web robots page:
You would either need to do what Arenstar or Tom O'Connor recommend (that is, use an Apache ACL to block them, drop the traffic at the IP level) or, possibly, route the IP addresses via 127.0.0.1 (that'd stop them from establishing TCP sessions in the first place).
Long-term, consider if you can place all your CPU-intensive pages under a common prefix, then you'll be able to use robots.txt to instruct crawlers to stay away from them.
Use a robots.txt file in your document root directory firstly.. Spiders and Bots normally look for this file before beginning the scan..
Use a .htaccess file ( this could also be put in your apache configs, though needs syntax change )
http://www.besthostratings.com/articles/block-bad-bots.html
Hope this helps.. :D
If you know the googlebot's IP address, you could set a DROP rule in iptables, but that's a real hack.
where [source ip] is the googlebot's IP.
This'd definitely stop them, instantly, but it's a bit.. low level.
To unblock
Assuming you don't actually want your site delisted from Google (which the accepted answer will eventually cause) set a crawl delay value for your site in Google Webmaster Tools. It is reported that Google does not support
Crawl-Delay
inrobots.txt
, though you may wish to set that value for other search engines and crawlers to use.We wanted to block a specific directory from robots. We had a robots.txt entry but it's being ignored by many robots. So we added this snippit below to our apache configuration file; note that we uncommented the Wget because we wanted to allow that. It works by blocking based on the
HTTP_USER_AGENT
.The list comes (obviously) from http://www.javascriptkit.com/howto/htaccess13.shtml; when we modify configuration files with information we get from the Web we always put the back-pointer so we know where it came from.