This is a rapidly changing event that has no answer yet.
Please do not post your findings or assumptions as answers; reserve the answer field for when you actually have an answer.
If you have something new to add, please edit it directly in to the question.
Since the beginning of the year, I'm getting a lot of traffic with the user agent:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729).
My access logs show 40% - 60% from that user agent. That's strange because the user agent states a Firefox 3.0.10 browser (is anybody using that browser in 2012? Definitely not 40%-60% of visitors on a normal website).
Also, the logs show that this user agent only requested the HTML document and no referenced assets like images, css, js files.
I checked the IPs of those requests (with that UA). It's coming from all over the world. I recognized that those IPs sometimes have a mobile user agent.
So my suspicion is a mobile app that is doing a lot of "spider requests". It would be good to know the root cause of the traffic from that user agent.
Can anybody identify the root cause?
In the last couple weeks, we recognized that the traffic from that UA dropped and other traffic increased. It looks like that bot/crawler is now using a more common UA and therefore is more difficult to block. I saw somebody else saying that in an answer to this question but it got removed when serverfault decided to re-arrange this question.
OLD answers as reference
Update from Dee
I run my own pretty highly trafficked website and I'm seeing the exact same thing in our apache logs for the last month or so (I haven't had a chance to check further back yet). 40% of all requests is the percentage I'm seeing, which is nuts, obviously.
And I also noticed the requests always seem to say the requesting browser doesn't support gzip compression -- resulting in all webpage requests being sent uncompressed and our bandwidth usage spiking through the roof!
But so far I've been unable to determine what's really going on -- all I suspect so far is that it may be some kind of proxy server or such for a mobile device that is sending a fake useragent string.
EDITED TO ADD: Just did some more research and it looks like it might be antivirus software: http://www.webmasterworld.com/search_engine_spiders/4428772.htm
Update from jamur21
Yes, we've noticed similar traffic across multiple sites.
We're still looking for the root cause, but some of our findings include:
If it's a spider, it's doing a pretty poor job. It seems to hammer only one or two URLs per domain for a while (maybe a couple hours), until it moves on to another URL. The content is always relatively "current", though, which lends credence to Google News being a factor, as posited in the link Dee posted in his/her answer (all of our sites are news sites).
While the IPs are spread out geographically, for us most of them seem located near the origin site (most of our sites are local news outlets, so they don't get a lot of national traffic). Almost none of the requests come from outside the USA. Again, this lends credence to the URLs getting slurped from Google News (I'm guessing people who have localized Google News by zip code will see our content).
Most of the time, the requests can be written off as background noise (albeit an especially noisy one), but a couple times a day we'll spike and this UA alone will account for ~100mbps of traffic for about 15-30 minutes.
Unfortunately, while Google News seems like a possible vector for these URLs to be discovered, everything we've seen is circumstantial and we still don't have any smoking gun for exactly how or why these URLs are getting hammered.
Update from Bannow Bay
We have big news site - our stories get picked up by Google News several times a week. We have been getting traffic from this source since late November - and it is growing week by week - maybe 30 million imps in February.
Appearance on the front page of Google News US is a trigger for this traffic - about 75 per cent purports to be from US IPs. But whatever it is is making great efforts to obscure itself. And that is not friendly.
We have not found smoking gun either -but a major security vendor has kindly agreed to investigate further on our behalf.
Update from Artem Russakovskii
Just had the same thing happen to a news site (AndroidPolice.com) for the first time. About 10 minutes of these random requests that spiked QPS over 5000% our average (5000qps, which is Linode's NodeBalancer's limit). The CPU started idling as the requests were eating up I/O and network - it was a real DDOS.
I'd really like to get to the bottom of this, but at the moment it seems completely puzzling.
Update from Mark
Just adding a +1. We are seeing the same behavior on our site. Not a ton of new information to add here, but here's the general shape of our traffic:
- Traffic is highly distributed. The traffic is coming from over ~60k unique IPs.
- Vast majority of the traffic is hitting a single URL, typically a recent URL listed on Google News (though Google News does not always appear to be the vector)
- All of this traffic is coming from the same Firefox/3.0.10 user agent as noted in this thread, though we have seen some oddball mobile agents here and there.
- All of the traffic coming in from this agent contains no referrer data.
- Burst occurs once or twice a week for 30-60 minutes and then goes away.
Update from Don Ireland
The last post was April 13 but the traffic certainly has not ended. The strangest part of this may be the fact that any malware author worth his salt could surely (would surely) use a user-agent string from a modern browser, making the block-user-agent defense worthless. This fact makes it seem as if a 'harmless' news aggregator or some other application is the source. So far, though, I also have been unable to reach any real conclusion and hope anyone with information will post it here.
We are seeing the same pattern, with a story picked up by google news followed by very high spikes of traffic requesting the story (but not accessory files such as images). The outbound response traffic causes spikes which can saturate the network (or did, till we began responding with only a 503 error). These attacks (what else can we call them?) last about 30 minutes on average, but very popular stories can have high traffic for an hour or more (I am speaking of the firefox 3.0.10 traffic, of course normal traffic also remains high for a while).
In a one hour period (for a single server in a load balanced group) we saw 200,000 requests of which 97,000 were the firefox 3.0.10 requests, nearly 50% of all requests. And when you consider that normally a page generates 10 or more requests for the main file and accessory files the 97,000 looms much larger. I note that of the 97,000 there were 51,000 unique IP addresses. And I am talking about a single hour (actually it was closer to 45 minutes). Whatever is causing this is quite widespread.
Update from user119708
We have the same issue on a huge french high-tech news website.
Whenever a news is published and viewable on google news, traffic increases greatly on the news with about 50 to 100 visits by IP and user agent "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)".
All IP adresses seem to be located in France or in french countries and have no referer. It seems to be a bot but why a single remote adress has to come back 50 or 100 times on the same news during a few minutes? Could it be infected computers? Why the phenomenon appears when the news is visible on google news? Is Google responsible of this strange traffic?
If someone in this topic has found the explication, I think it would help many medium or big websites to control their traffic!
EDIT: http://2bits.com/botnet/botnet-hammering-web-site-causing-outages.html If it is indeed infected computers, it is very worrying given the number of addresses involved. We will implement this script for Apache to block all traffic :
# Referer is empty
RewriteCond %{HTTP_REFERER} ^$
# User agent is bogus old browser
RewriteCond %{HTTP_USER_AGENT} "Gecko/2009042316 Firefox/3.0.10"
# Forbid the request
RewriteRule ^(.*)$ - [F,L]
Update from Ernesto
Medium spanish general news site, noticed high traffic in some irrelevant news since a few days.
Whoever it is, it loads the complete HTML, as we notice it due to the "page view" count we increment via database updates once page is loaded.
We only notice one or two URLs targeted each day.
Lots of requests (7000-12000) over the same URL in a few seconds, distributed over the day from different IPs. Next days other URLs targeted.
No referer.
The articles targeted appeared on Google News, but we can't assure it is related.
Google Analytics doesn't recognize it as legitimated traffic. We have articles with more than 8000 hits and GA only reports 25 or so (I assume that javascript it's not been interpreted).
Update from Old Pro
Adding a few data points for you.
Bots vs. Browsers does not consider this UA to be a bot (yet).
On the most highly trafficked site for which I have logs, May 2012 usage to date shows this UA as less than 1% of traffic. A significant portion of the UA requests appear legitimate (loading all the expected resources, for example). This is basically the same as for Feb 2012.
This site's front page is rarely updated and all the dynamic content is blocked by robots.txt.
This is likely from Genieo. They have updated their application to use a new user agent: Mozilla/5.0+(compatible;+Genieo/1.0+http://www.genieo.com/webfilter.html). It hits with the same pattern as the original user agent but now they seem to identify themselves. If you look at the URL in their user agent they even acknowledge that they may have been or may still be generating too much traffic to certain web sites. -dflaw
Update from Mike Fagan
We've been fighting what we assumed were DDOS attacks for weeks now. We just started seeing Genieo as the useragent for these attacks. Previously we saw "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)" and a ton of requests from "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0". 10k+ different IPs, Upwards of 1 million requests per day to just 3 or 4 pages where the same IP was requesting pages 100+ times and not pulling any additional assets or ads. My finding is that none of these IPs actually went to any other pages on our site.
I contacted Genieo and this is their response:
"Thank you for contacting us.
Old version of Genieo might have caused the traffic loads you describe. We apologize for any inconvenience this may have caused. We released and updated yesterday that address this, data load from our application should fade away in the next 24 hours. We believed we were doing a good service to your site by introducing it to new users. We didn't assess properly that as our install base is growing it may have induce overload on some sits.
Genieo is a personal newspaper or a smart RSS reader. It’s a client side RSS reader with smart semantic personalization filtering. Genieo application follow RSS data from the user’s favorite sites “read” the articles by performing semantic analysis and filter them with respect to the users areas of interest. If the article matches the user interests the application displays the title and snippet of the article in the user homepage. Clicking on the title will lead to the article’s site - your site. Genieo agent is autonomous (for privacy reasons); it runs on the end users machine, this is why you see the agent access your site from many different IP’s.
Most of Genieo data comes from user’s normal RSS feeds, but Genieo also adds some content from new news sites that were not previously registered by the users (for serendipity and diversity). Genieo algorithms looks for “hot” articles, Twitter top hits, YouTube most viewed, and Google news highlights and checks if they match the user’s interest
We were not aware that this was causing load issue for some site. Once this was brought to our attention we update the current users with a new version that prevents load spikes.
Best regards,
-Dotan