Our websites are being crawled by content thieves on a regular basis. We obviously want to let through the nice bots and legitimate user activity, but block questionable activity.
We have tried IP blocking at our firewall, but this becomes to manage the block lists. Also, we have used IIS-handlers, however that complicates our web applications.
Is anyone familiar with network appliances, firewalls or application services (say for IIS) that can reduce or eliminate the content scrapers?
If the scrapers are BOTS and not humans, you could try creating a honeypot directory that they would crawl to and be blocked (by IP address) automatically via a "default page" script in that directory. Humans could easily unblock themselves, but it would thwart bots as they would get a 403 "not authorized" error on any further access. I use a technique like this to block bad robots that disobey robots.txt, but not permanently block humans who either share the same IP or "accidentally" navigate to the blocking script. That way, if a shared IP gets blocked, it's not permanent. Here's how:
I set up a default (scripted) page in one or more subdirectories (folders) blocked in robots.txt. That page, if loaded by a misbehaving robot -- or a snooping human -- adds their IP address to a blocked list. But I have a 403 ("not authorized") error-handler that redirects these blocked IPs to a page explaing what's going on and containing a captcha that a human can use to unblock the IP. That way, if an IP is blocked because one person used it one time for a bad purpose, the next person to get that IP won't be permanently blocked -- just inconvenienced a little. Of course, if a particular IP keeps getting RE-blocked a lot, I can take further steps manually to address that.
Here is the logic:
That's it! One script file to handle the block notice and unblock captcha submission. One entry (minimum) in the robots.txt file. One 403 redirection in the htaccess file.
Check the request headers? Depending on whether they are kiddies or not, it may be enough
You want a hardware firewall that does HTTP inspection. This won't come cheap, I'm afraid.
I seem to recall that a Cisco ASA 5520 will do this, but the list price for one of these is about £4600 ~= $6900.
You could probably do something similar with a linux box running a firewall app, for a fraction of the cost.