we have a situation where we log visits and visitors on page hits and bots are clogging up our database. We can't use captcha or other techniques like that because this is before we even ask for human input, basically we are logging page hits and we would like to only log page hits by humans.
Is there a list of known bot IP out there? Does checking known bot user-agents work?
I would think from the sysadmin standpoint repeated hits from a single IP at a regular interval would indicate a likely bot. You could find this by simply parsing the logs.
I might first filter out IPs with a large number of hits. Then fill an array with the times of those hits, and maybe look at standard deviation of the interval between those hits.
The distinct advantage to a solution like this is that get to write something fairly interesting if you are working a full time admin ;-)
There are a few elements to pursue with this.
user-agent string is one value, but it can be trivially spoofed.
I've found a reasonably useful heuristic is to do a bit of pre-processing, then look at traffic:
Parse out your access logs adding host, ASN, CIDR, and ASN name information. Subset URLs to the nonvariant part (stripping everything past '?' generally, though YMMV). If you've got specific search or utility pages, focus on these (typically I've seen problems either with bots using some sort of user verification service, or search).
Look for single IPs with high volumes of traffic.
Look for single CIDR blocks or ASNs with high volumes of traffic.
Rule out legitimate search traffic (Google, Bing, Yahoo, Baidu, Facebook, and similar bots / network space). This is probably going to be one of your larger areas of ongoing maintenance, this stuff changes all the time.
Rule out legitimate user traffic. Especially for high-volume users of your site.
Identify what normal patterns of usage are, for both end-users and search bots. If a typical user visits 1-3 pages per minute, with a typical session of 5-10 minutes, and Googlebot limits itself to, say, 10 searches per minute, and you suddenly see a single IP or CIDR block lighting up with hundreds or thousands of searches per minute, you may have found your problem.
Investigate the origins of high-volume / high-impact (in a negative sense) traffic. Frequently a WHOIS query will reveal that this is some sort of hosting space -- not typically where you'll see a lot of legitimate user traffic. Patterns may appear in user-agent strings, request URLs, referrer strings, etc., that tip you off to additional patterns.
A caching whois client can be a big assistance if you end up doing a lot of WHOIS lookups, both the speed the process, and to avoid rate-limiting/throttling by registrars (for some reason, they don't take kindly to entities conducting thousands of repeat/automated lookups). You may be able to go to registrars directly for more information, though I haven't pursued this.
Checks against various reputation databases (spam lookups, SenderBase, there's now some Google stuff along these lines) may also corroborate poorly-policed network space.
I'd love to say I've got something to sell you along these lines, but what I'm working with is mostly some awk and other tools to pull this together. It'll parse a million lines of log a minute or so (plus a bit of preparatory overhead to prepare hashes for IPs and ASN/CIDR information) on a modest workstation. Not fully automated, but it'll give me a decent picture of an issue with a few minutes of work.
A quick google search revealed this site. It could be a good starting point.
Checking only for the user agent may not be enough since the user agent can be easily forged.
There's actually a new technology out there which is set to combat bots on larger scales. This can be helpful to programmatic media buyers. It's called device fingerprinting and it essentially replaces the cookie-based visitor tracking. The premise behind it is that cookies are often used by fraudsters, and IP addresses can also be changed via VPNs. Fingerprints, on the other hand remain unique to the device, IP and GEO and can't be changed. There are a couple of websites that provide this solution - fraudhunt.net, CPA Detective and Distil - just to name a few.
This technology certainly has its limitations. If you don't want to dig deeper into it and install other tools - you can eliminate bots in GA. Here are a few known bot domains you should definitely block
darodar.com (and various subdomains) econom.co ilovevitaly.co semalt.com (and various subdomains) buttons-for-website.com see-your-website-here.com