We have been using awstats for some time now to parse our apache server logs into a format for the billing dept.
A set of custom python script are in use to generate the merge logs based on those passed from each of the servers in the hosting cluster/farm.
The issue I am currently facing is that our logs have grown considerably for certain projects, some generating ~30GB/day in uncompressed logs. awstats is not the most memory efficient of parsers, and will use upward of 1GB of memory to process these logs, (by comparison a python script + regex of mine will with in 450kb of memory).
What I need is a replacement to awstats that can handle large logfiles on a fairly basis and produce a "billing friendly" output.
Stats should include, bandwidth, unique visitos, vists per unique pages served etc ...
Idealy this should also allow us to import the historical Awstats data (which is currently in text files).
So in summary my question is, is there any software available to do this?
As this has not been answered in over a year I thought I would post an update on my plans.
I'll be leveraging python multiprocessing to provides distributed processing of the logs, using custom map + reduce methodology.
If you find this question and do not want to "roll your own" there are a few hadoop projects around that may help (I suggest looking at pig).