Ping a Specific Port

Question

nbv4

Asked: 2010-11-12 22:37:25 +0800 CST2010-11-12 22:37:25 +0800 CST 2010-11-12 22:37:25 +0800 CST

How to block Googlebot quickly?

772

Google bot is crawling my site right now and it's killing my server. Its only crawiling one or two pages a second, but those pages are really CPU intensive. I have already added those CPU intensive files to the robots.txt file, but googlebot hasn't detected those changes yet. I want to block google bot at the apache.cong level so my site can be back right now. How can I do this? This one apoache instance is hosting a few PHP sites and a django powered site, so I can't use .htaccess files. The server is running Ubuntu 10.04.

5 Answers

Voted

Vatine · Answer 1 · 2010-11-13T01:56:16+08:00

Vatine

2010-11-13T01:56:16+08:002010-11-13T01:56:16+08:00

I see you are currently trying to use glob-patters in your robots.txt.

From The web robots page:

Note also that globbing and regular expression are not supported in either
the  User-agent or Disallow lines. The '*' in the User-agent field is a 
special value meaning "any robot". Specifically, you cannot have lines like 
"User-agent: *bot*",     "Disallow: /tmp/*" or "Disallow: *.gif".

You would either need to do what Arenstar or Tom O'Connor recommend (that is, use an Apache ACL to block them, drop the traffic at the IP level) or, possibly, route the IP addresses via 127.0.0.1 (that'd stop them from establishing TCP sessions in the first place).

Long-term, consider if you can place all your CPU-intensive pages under a common prefix, then you'll be able to use robots.txt to instruct crawlers to stay away from them.

6

Arenstar · Answer 2 · 2010-11-12T22:41:43+08:00

Arenstar

2010-11-12T22:41:43+08:002010-11-12T22:41:43+08:00

Use a robots.txt file in your document root directory firstly.. Spiders and Bots normally look for this file before beginning the scan..

Use a .htaccess file ( this could also be put in your apache configs, though needs syntax change )

   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^googlebot
   RewriteRule ^(.*)$ http://google.com/

http://www.besthostratings.com/articles/block-bad-bots.html

Hope this helps.. :D

4

Tom O'Connor · Answer 3 · 2010-11-13T01:34:01+08:00

Best Answer

Tom O'Connor

2010-11-13T01:34:01+08:002010-11-13T01:34:01+08:00

If you know the googlebot's IP address, you could set a DROP rule in iptables, but that's a real hack.

iptables -I INPUT -s [source ip] -j DROP

where [source ip] is the googlebot's IP.

This'd definitely stop them, instantly, but it's a bit.. low level.

To unblock

iptables -D INPUT -s [source ip] -j DROP

4

Michael Hampton · Answer 4 · 2012-10-21T11:27:15+08:00

Michael Hampton

2012-10-21T11:27:15+08:002012-10-21T11:27:15+08:00

Assuming you don't actually want your site delisted from Google (which the accepted answer will eventually cause) set a crawl delay value for your site in Google Webmaster Tools. It is reported that Google does not support Crawl-Delay in robots.txt, though you may wish to set that value for other search engines and crawlers to use.

3

vy32 · Answer 5 · 2012-10-21T10:36:12+08:00

We wanted to block a specific directory from robots. We had a robots.txt entry but it's being ignored by many robots. So we added this snippit below to our apache configuration file; note that we uncommented the Wget because we wanted to allow that. It works by blocking based on the HTTP_USER_AGENT.

The list comes (obviously) from http://www.javascriptkit.com/howto/htaccess13.shtml; when we modify configuration files with information we get from the Web we always put the back-pointer so we know where it came from.

    <Directory "/var/www/domaintoblock/directorytoblock/">

            # Block bots; from http://www.javascriptkit.com/howto/htaccess13.shtml                    
            # Note that we allow wget                                                                 
            RewriteEngine On
            RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
            RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
            RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
            RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
            RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
            RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
            RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
            RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
            RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
            RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
            RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
            RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
            RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
            RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
            RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
            RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
            RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
            RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
            RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
            RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
            RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
            RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
            RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
            #RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]                                                
            RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
            RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
            RewriteCond %{HTTP_USER_AGENT} ^Zeus
            RewriteRule ^.* - [F,L]
</Directory>

How to block Googlebot quickly?

Ping a Specific Port

How do I tell Git for Windows where to find my private RSA key?

How do you restart php-fpm?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

How can I sort du -h output by size

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?