I often have to add new rules to the apache-badbots.conf file, and every time I have the doubt that it no longer works...
For example, this is my current apache-badbots.conf file:
[Definition]
badbotscustom = MQQBrowser|LieBaoFast|Mb2345Browser|zh-CN|python-requests|LinkpadBot|MegaIndex|Buck|SemrushBot|SeznamBot|JobboerseBot|AhrefsBot|AhrefsBot/6.1|MJ12bot|[email protected]|SemrushBot/6~bl|cortex|Cliqzbot|Baiduspider|serpstatbot|Go 1.1 package http|Python-urllib|StormCrawler|archive.org_bot|CCBot|BLEXBot|ltx71|DotBot|EmailCollector|WebEMailExtrac|Track$
badbots = Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrow$
#failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
failregex = ^<HOST> -.*"(GET|POST).*HTTP.*".*(?:%(badbots)s|%(badbotscustom)s).*"$
ignoreregex =
datepattern = ^[^\[]*\[({DATE})
{^LN-BEG}
Yesterday I added "MQQBrowser | LieBaoFast | Mb2345Browser | zh-CN" and today I see a lot of MQQBrowser and LieBaoFast in my access logs.
sudo cat /var/log/apache2/access.log | awk -F\" '{print $6}' | sort | uniq -c | sort -n
...
3408 Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36
3418 Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0
3444 Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN
3473 Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3
What's wrong? It's working? Is there a way to tell if there's an error and what is the error?
I update because it still does not understand something, for example today I found other bots that should be banned in my logs.
Just to understand, this filter looks for the string I add in apache-badbots.conf in the server's access .log, and if it finds it adds a rule to fail2ban, right?
- So for example, is there a difference if I write "netEstate NE Crawler" or just "netEstate"?
- Why this string "
atSpider/1\.0
" have all these slashes? - All the "." must be preceded by a slash? (
China Local Browse 2\.6|DataCha0s/2\.0|DBrowse 1\.4b
) - Can an email be used as a string? (ex: [email protected])
- Strings with spaces like "
Go 1.1 package http
" are correct or generate an error? - Can the "-" character be used? (ex: python-requests, Python-urllib)
- Can the "_" character be used? (ex: archive.org_bot)
Do you have robots.txt? This is for tellin bots where and how to crawl. There are more then one questions here, so lets see them one by one.
If you want to check if its working, just go to the fail2ban log. To understand HOW its working please go to the /usr/share/doc/fail2ban/README or other docs. In short: fail2ban reads your logfile by the defined filters and creates firewall rules for the problematic IP addresses found in problematic log lines. If the defined bot has one IP address it will be banned after a few attempts. In this particular case there is thousands of requests with MQQBrowser and LieBaoFast from a lots of different IPs. So until they have some IP which is not blocked, there will be new log entries.
These config lines are regular expressions, and this is the short answer for some of the other questions, like:
If you realy want to select your clients based on useragent string, you should use .htaccess, BBQ or some other plugin.