We were a little surprised to see this on our Cacti graphs for June 4 web traffic:
We ran Log Parser on our IIS logs and it turns out this was a perfect storm of Yahoo and Google bots indexing us.. in that 3 hour period, we saw 287k hits from 3 different Google IPs, plus 104k from Yahoo. Ouch?
While we don't want to block Google or Yahoo, this has come up before. We have access to a Cisco PIX 515E, and we're thinking about putting that in front so we can dynamically deal with bandwidth offenders without touching our web servers directly.
But is that the best solution? I'm wondering if there is any software or hardware that can help us identify and block excessive bandwidth use, ideally in real time? Perhaps some bit of hardware or open-source software we can put in front of our web servers?
We are mostly a Windows shop but we have some Linux skills as well; we're also open to buying hardware if the PIX 515E isn't sufficient. What would you recommend?
If your PIX is running version 7.2 or greater of the OS, or can be upgraded to it, then you can implement QOS policies at the firewall level. In particular this allows you to shape traffic and should allow you to limit the bandwidth used by bots. Cisco have a good gudie to this here.
I'm not sure about yahoo, but you can configure the frequency Google's bot indexes your site. Have a look at Google Webmasters. I'm not sure if Yahoo has anything similar. At any that'll reduce your traffic up to 50%.
Alternatively, some web servers can limit traffic per connection so you can try that. I personally would stay away from hardware solutions since it's most likely going to cost more.
To reduce the crawling load - This only works with Microsoft and Yahoo. For Google, you’ll need to specify a slower crawling speed through their Webmaster Tools (http://www.google.com/webmasters/).
Be VERY careful when implementing this because if you slow down the crawl too much, robots won’t be able to get to all of your site, and you may lose pages from the index.
Here are some examples (these go in your
robots.txt
file):Slightly off-topic, but you can also specify a Sitemap or Sitemap index file.
If you’d like to provide search engines with a comprehensive list of your best URLs, you can also provide one or more Sitemap autodiscovery directives. Please note that user-agent does not apply to this directive, so you cannot use this to specify a sitemap to some but not all search engines.
We use a Watchguard firewall (ours is a X1000 which is end-of life now). They have many feautres revolving around blocking domains or ips who are seen time and time again or are using an obsesive amount of bandwidth.
This would need some tweaking because you obvisouly would not want to block Jon Skeet on stackoverflow :)
I'd recommend Microsoft ISA Server 2006. Specifically for this requirement, it will limit to 600 HTTP requests/min per IP by default and you can apply an exception for Jon Skeet (sorry, I realise that "joke" has been made already!).
You have the additional benefits of application-level filtering, the ability to load-balance across multiple webservers (instead of NLB on those servers), VPN termination etc. There's a number of commercial extensions available and you can even write your own ISAPI filter if you're feeling brave.
It's obviously not open-source but has benefits to a Windows shop and runs on commodity hardware.
We use Foundry load-balancers (specifically SI850s) to handle this kind of shaping issue, it also handles quite a lot of other 'nastys' like SYN-floods etc. Might be a bit overkill for you guys though.
Bluecoat (formerly Packeteer) PacketShaper products can dynamically throttle excessive usage on traffic that it manages.
You can even perform rudimentary rate-limiting with any regular Cisco router of any decent capacity/vintage. Are you using a Cisco router?