I have been getting a lot of web hits in my logs that crawl most top level pages of my site and show a referrer as a Java version.
I see different variants of the Java versions in the referrer, i.e. Java/1.6.0_04, Java/1.4.1_04, Java/1.7.0_25, etc.
And sometimes, but not always, I get a 404 for /contact/ but none of the other pages below.
The IPs are usually always spam harvesters and bots, according to Project Honeypot
78.129.252.190 - - [24/Jan/2014:01:28:52 -0800] "GET / HTTP/1.1" 200 6728 "-" "Java/1.6.0_04" 198 7082
78.129.252.190 - - [24/Jan/2014:01:28:55 -0800] "GET /about HTTP/1.1" 301 - "-" "Java/1.6.0_04" 203 352
78.129.252.190 - - [24/Jan/2014:01:28:55 -0800] "GET /about/ HTTP/1.1" 200 29933 "-" "Java/1.6.0_04" 204 30330
78.129.252.190 - - [24/Jan/2014:01:28:56 -0800] "GET /articles-columns HTTP/1.1" 301 - "-" "Java/1.6.0_04" 214 363
78.129.252.190 - - [24/Jan/2014:01:28:57 -0800] "GET /articles-columns/ HTTP/1.1" 200 29973 "-" "Java/1.6.0_04" 215 30370
78.129.252.190 - - [24/Jan/2014:01:28:58 -0800] "GET /contact HTTP/1.1" 301 - "-" "Java/1.6.0_04" 205 354
78.129.252.190 - - [24/Jan/2014:01:28:58 -0800] "GET /contact/ HTTP/1.1" 200 47424 "-" "Java/1.6.0_04" 206 47827
What are they looking for? A vulnerability?
Can I block these visits by their Java referrer? If so, how? Or with a php function?
I know how to block IPs in .htaccess, but blocking by User-agent is a more proactive method for me).
Update 2/04/14 I'm not able to block a Java User-agent with either of these two rules.
RewriteCond %{HTTP_USER_AGENT} Java/1.6.0_04
RewriteRule ^.*$ - [F]
RewriteCond %{HTTP_USER_AGENT} ^Java
RewriteRule ^.*$ - [F]
Note: I'm on shared hosting and have limited access to apache configs.
User Agent string matching is not reliable method, as anyone can change it on headers.
From my experience, every internet facing webserver is akin to be crawled and surfed (that's THE point, right? :).
If anything, they're just crawling your webserver for indexing of some sort. If you want to frustrate or limit the frequency for those requests, I'd suggest apache mod_evasive, or mod_dosevasive, or mod_qos, to limit the number of concurrent connections per IP per second, and more.
Keep in mind that this solution could lead to your webserver blocking legitimate requests from NAT routed requests and so on.
Then, you'll need to code the 403 forbidden yourself defining a set of rules from crawling behaviour into your php app when bots learn your apache mods evasive frequency setup.
Is AllowOverride set to All?
As a more proper solution, I would recommend using mod_evasive[1] to block excessive scanning by any client. Requires iptables though.