I want to block all request from yandex.ru search bot. It is very traffic intensive (2GB/day). I first blocked one C class IP range, but it seems this bot appear from different IP ranges.
For example:
spider31.yandex.ru -> 77.88.26.27 spider79.yandex.ru -> 95.108.155.251 etc..
I can put some deny in robots.txt but not sure if it respect this. I am thinking of blocking a list of IP ranges.
Can somebody suggest some general solution.
Don't believe what you read on forums about this! Trust what your server logs tell you. If Yandex obeyed robots.txt, you would see the evidence in your logs. I have seen for myself that Yandex robots do not even READ the robots.txt file!
Quit wasting time with long IP lists that only serve to slow down your site drastically.
Enter the following lines in .htaccess (in the root folder of each of your sites):
I did, and all Yandex gets now are 403 Access denied errors.
Good bye Yandex!
I'm too young here (reputation) to post all the URLs I need to as hyperlinks, so pardon my parenthesized URLs, please.
The forum link from Dan Andreatta, and this other one, have some but not all of what you need. You'll want to use their method of finding the IP numbers, and script something to keep your lists fresh. Then you want something like this, to show you some known values including the sub-domain naming schemes they have been using. Keep a crontabbed eye on their IP ranges, maybe automate something to estimate a reasonable CIDR (I didn't find any mention of their actual allocation; could just be google fail @ me).
Find their IP range(s) as accurately as possible, so you don't have to waste time doing a reverse DNS look up while users are waiting for (http://yourdomain/notpornipromise), and instead you're only doing a comparison match or something. Google just showed me grepcidr , which looks highly relevant. From the linked page: "grepcidr can be used to filter a list of IP addresses against one or more Classless Inter-Domain Routing (CIDR) specifications, or arbitrary networks specified by an address range." I guess it is nice that its a purpose built code with known I/O, but you know that you can reproduce the function in a billion different ways.
The most, "general solution", I can think of for this and actually wish to share (speaking things into existence and all that) is for you to start writing a database of such offenders at your location(s), and spend some off-hours thinking and researching on ways to defend and counter attack the behavior. This takes you deeper into intrusion detection, pattern analysis, and honey nets, than the scope of this specific question truly warrants. However, within the scope of that research are countless answers to this question you have asked.
I found this due to Yandex's interesting behavior on one of my own sites. I wouldn't call what I see in my own log abusive, but spider50.yandex.ru consumed 2% of my visit count, and 1% of my bandwidth... I can see where the bot would be truly abusive to large files and forums and such, neither of which are available for abuse on the server I'm looking at today. What was interesting enough to warrant investigation was the bot looking at /robots.txt, then waiting 4 to 9 hours and asking for a /directory/ not in it, then waiting 4 to 9 hours, asking for /another_directory/, then maybe a few more, and /robots.txt again, repeat ad finitum. So far as frequency goes, I suppose they're well behaved enough, and the spider50.yandex.ru machine appeared to respect /robots.txt.
I'm not planning to block them from this server today, but I would if I shared Ross' experience.
For reference on the tiny numbers we're dealing with in my server's case, today:
That's in a shared host who doesn't even bother capping bandwidth anymore, and if the crawl took some DDoS-like form, they would probably notice and block it before I would. So, I'm not angry about that. In fact, I much prefer having the data they write in my logs to play with.
Ross, if you really are angry about the 2GB/day you're losing to Yandex, you might spampoison them. That's what it's there for! Reroute them from what you don't want them downloading, either by HTTP 301 directly to a spampoison sub-domain, or roll your own so you can control the logic and have more fun with it. That sort of solution gives you the tool to reuse later, when it's even more necessary.
Then start looking deeper in your logs for funny ones like this:
Hint: No /user/ directory, nor a hyperlink to such, exists on the server.
According to this forum, the yandex bot is well behaved and respects
robots.txt
.In particular they say
Personally I do not have issues with it, and googlebot is by far the most aggressive crawler for the sites I have.
My current solution is this (for NGINX web server):
This is case insensitive. It returns response 444.
This directive looks at the User Agent string and if "Yandex" is detected connection is closed without sending any headers. 444 is a custom error code understood by the Nginx daemon
Get nasty by adding these lines to your .htaccess file to target all visitors from 77.88.26.27 (or whatever the IP is) who try to access a page ending in .shtml:
That Yandex bot now gets rickrolled every time it tries to index your site. Problem solved.
Please, people look up the OSI model. I recommend you to block this networks in routing level. This is the third (4th transport) layer of the network OSI model. If you block them in server level it is in the 4th (5,6,7th) layer and has already passed. Also a Kernel is able to handle those request 100 times better as a Apache Server. RewriteRule over RewriteRule, SetEnv directives and so on are just bugging your server, regardless if you present the cool 403. A Request is a request and Yandex also Baidu doing a lots of them, while google is also scanning in the background! You really like to be flooded by requests, this costs you webserver slots and Baidu is known for doing this by intention.
New Ranges: (Updated Tue, May, 8th, 2012)
New Ranges: (Updated Sun, May, 13th, 2012)