The problem:
I manage a website which has lot's of dynamically generated pages. Every day bots from Google, Yahoo and other search engines download like 100K+ of pages. And sometimes i have problems with "hackers" trying to massively download the whole website.
I would like to block ip addresses of "hackers" while keeping search engine bots crawling the pages. What is the best way to do this ?
Note:
Right now i am solving the problem as follows. I save ip of each page request to a file every X seconds. And i have a crontab script which counts repetitive ip's every 30 minutes. For ip's which are repeated too many times, the script checks a hostname- if it doesn't belong to Google/Yahoo/Bing/etc then we have a candidate for banning.
But i don't really like my solution and think that auto banning could be done better or using some out of the box solution.
You didn't state your OS, so I will happily tell you the OpenBSD version: in
pf.conf
place something like the following in your ruleset (for 100 conns per 10 secs max):you could add a whitelist and a cron job kicking addresses from bad_hosts after a day or two.
I would have thought fail2ban is the answer.
You can use white lists to stop search engines getting blocked.
Have a look at Simple Event Correlator. It can automatically run commands (ie. add a block to
iptables
) after a certain amount of lines matching a regular expression have been seen within a window of time. It can also define a "context" that expires. When a context expires you can unblock the IP in question (ie. remove fromiptables
).