I'll begin by telling you what we do.
The measures we have implemented catch a lot of spiders, but we have no idea how many we are missing. Currently, we apply a set of measures that are obviously partially overlapping:
monitor requests for our robots.txt file: then of course filter all other requests from same IP address + user agent
compare user agent and IP addresses against published lists: iab.net and user-agents.org publish the two lists that seem to be the most widely used for this purpose
pattern analysis: we certainly don't have pre-set thresholds for these metrics but still find them useful. We look at (i) page views as a function of time (i.e., clicking a lot of links with 200 msec on each page is probative); (ii) the path by which the 'user' traverses out Site, is it systematic and complete or nearly so (like following a back-tracking algorithm); and (iii) precisely-timed visits (e.g., 3 am each day).
Again, I am fairly sure we're getting the low-hanging fruit, but I'm interested in getting the views from the community.
These Newsletter posts tagged as Web Log Analysis at
the commercial Web log Analyzer from Nihuo site pages could be useful reading.