How do large sites (e.g. Wikipedia) deal with bots that are behind other IP masker? For instance, in my university, everybody searches Wikipedia, giving it a significant load. But, as far as I know, Wikipedia can only know the IP of the university router, so if I set up an "unleashed" bot (with only a small delay between requests), can Wikipedia ban my bot without banning the whole organization? can a site actually ban an IP behind an organizational network?
I have found out that McAfee SiteAdvisor has reported my website as "may be having security issues".
I care little about whatever McAfee thinks of my website (I can secure it myself and if not, McAfee definitely is not the company I'd be asking for help, thank you very much). What bothers me, though, is that they have, apparently, crawled my website without my permission.
To clarify: There's almost no content on my website yet, just some placeholder and some files for my personal usage. There are no ToS.
My questions is: Does McAffee have the right to download content from / crawl my website? Can I forbid them to do so? I have a feeling there should be some kind of "My castle, my rules" principle, however I basically know nothing about all the legal stuff.
Update: I probably should have mentioned my server provider sends me emails about SiteAdvisor's findings on a regular basis - that's how I found out about their 'rating' and that's why I'm annoyed.
I've installed Apache a while ago, and a quick look at my access.log shows that all sorts of unknown IPs are connecting, mostly with a status code 403, 404, 400, 408. I have no idea how they're finding my IP, because i only use it for personal use, and added a robots.txt hoping it'd keep search engines away. I block indexes and there's nothing really important on it.
How are these bots (or people) finding the server? Is it common for this to happen? Are these connections dangerous/what can I do about it?
Also, lots of the IPs come from all sorts of countries, and don't resolve a hostname.
Here's a bunch of examples of what comes through:
in one large sweep, this bot tried to find phpmyadmin:
"GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 403 243 "-" "ZmEu"
"GET /3rdparty/phpMyAdmin/scripts/setup.php HTTP/1.1" 404 235 "-" "ZmEu"
"GET /admin/mysql/scripts/setup.php HTTP/1.1" 404 227 "-" "ZmEu"
"GET /admin/phpmyadmin/scripts/setup.php HTTP/1.1" 404 232 "-" "ZmEu"
i get plenty of these:
"HEAD / HTTP/1.0" 403 - "-" "-"
lots of "proxyheader.php", i get quite a bit requests with http:// links in the GET
"GET http://www.tosunmail.com/proxyheader.php HTTP/1.1" 404 213 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
"CONNECT"
"CONNECT 213.92.8.7:31204 HTTP/1.0" 403 - "-" "-"
"soapCaller.bs"
"GET /user/soapCaller.bs HTTP/1.1" 404 216 "-" "Morfeus Fucking Scanner"
and this really sketchy hex crap..
"\xad\r<\xc8\xda\\\x17Y\xc0@\xd7J\x8f\xf9\xb9\xc6x\ru#<\xea\x1ex\xdc\xb0\xfa\x0c7f("400 226 "-" "-"
empty
"-" 408 - "-" "-"
That's just the gist of it. I get all sorts of junk, even with win95 user-agents.
Thanks.
I'm having an issue with a certain individual who keeps scraping my site in an aggressive manner; wasting bandwidth and CPU resources. I've already implemented a system which tails my web server access logs, adds each new IP to a database, keeps track of the number of requests made from that IP, and then, if the same IP goes over a certain threshold of requests within a certain time period, it's blocked via iptables. It may sound elaborate, but as far as I know, there exists no pre-made solution designed to limit a certain IP to a certain amount of bandwidth/requests.
This works fine for most crawlers, but an extremely persistent individual is getting a new IP from his/her ISP pool each time they're blocked. I would like to block the ISP entirely, but don't know how to go about it.
Doing a whois on a few sample IPs, I can see that they all share the same "netname", "mnt-by", and "origin/AS". Is there a way I can query the ARIN/RIPE database for all subnets using the same mnt-by/AS/netname? If not, how else could I go about getting every IP belonging to this ISP?
Thanks.
I have several sites in a /24
network that all get crawled by google on a pretty regular basis. Normally this is fine. However, when google starts crawling all the sites at the same time, the small set of servers that back this IP block can take a pretty big hit on load.
With google webmaster tools, you can rate limit the googlebot on a given domain, but I haven't found a way to limit the bot across an IP network yet. Anyone have experience with this? How did you fix it?