I'd like to block some spiders and bad bots by user-agent text string for all of my virtual hosts via httpd.conf but have yet to find success. Below are the contents of my http.conf file. Any ideas why this isn't working? env_module is loaded.
SetEnvIfNoCase User-Agent "^BaiDuSpider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Yandex" UnwantedRobot
SetEnvIfNoCase User-Agent "^Exabot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Cityreview" UnwantedRobot
SetEnvIfNoCase User-Agent "^Dotbot" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sogou" UnwantedRobot
SetEnvIfNoCase User-Agent "^Sosospider" UnwantedRobot
SetEnvIfNoCase User-Agent "^Twiceler" UnwantedRobot
SetEnvIfNoCase User-Agent "^Java" UnwantedRobot
SetEnvIfNoCase User-Agent "^YandexBot" UnwantedRobot
SetEnvIfNoCase User-Agent "^bot*" UnwantedRobot
SetEnvIfNoCase User-Agent "^spider" UnwantedRobot
SetEnvIfNoCase User-Agent "^crawl" UnwantedRobot
SetEnvIfNoCase User-Agent "^NG\ 1.x (Exalead)" UnwantedRobot
SetEnvIfNoCase User-Agent "^MJ12bot" UnwantedRobot
<Directory "/var/www/">
Order Allow,Deny
Allow from all
Deny from env=UnwantedRobot
</Directory>
<Directory "/srv/www/">
Order Allow,Deny
Allow from all
Deny from env=UnwantedRobot
</Directory>
EDIT - @Shane Madden: I do have .htaccess files in each virtual host's document root with the following.
order allow,deny
deny from xxx.xxx.xxx.xxx
deny from xx.xxx.xx.xx
deny from xx.xxx.xx.xxx
...
allow from all
Could that be creating conflict? Sample VirtualHost config:
<VirtualHost xx.xxx.xx.xxx:80>
ServerAdmin [email protected]
ServerName domain.com
ServerAlias www.domain.com
DocumentRoot /srv/www/domain.com/public_html/
ErrorLog "|/usr/bin/cronolog /srv/www/domain.com/logs/error_log_%Y-%m"
CustomLog "|/usr/bin/cronolog /srv/www/domain.com/logs/access_log_%Y-%m" combined
</VirtualHost>
Try this, and if it fails, try it in a .htaccess file...
Follow this pattern, and don't put an [OR] on the very last one.
EDIT: New solution:
If you want to block all (friendly) bots, make a file called "robots.txt" and put it in where your index.html is. Inside it, put this:
You'd still need to maintain a list like my original answer (above) to disallow the bots that ignore robots.txt.
For the benefit of those who may read this later, here's the deal:
I deleted the order allow, deny directives from my .htaccess files and was able to trigger the expected behavior for certain user-agents when I spoofed them with User Agent Switcher in Firefox, so it does appear that there was some conflict. Other user-agents on my list, however, were not blocked -- but that's because I was unclear as to the significance of the carat (^) as used in my httpd.conf. The Regular Expression tutorials I read stated this but it didn't really sink in at first: the carat forces the server to look only at the very beginning of the entire user-agent string (not individual strings within, as I originally thought) when parsing the connection request. As the key identifying string for some of the spiders & bots I wish to block occurs later in the user-agent string, I needed to drop the carat to get things working.