I have a need to analyze logs generated by Apache2 webserver. I'm considering either piping the log to a script that will just wait on the stdin for input, or analyzing the logs nightly with a batch job.
One of my requirements involve writing some of the information to a database (ie, execution time, size, URI, remote IP). For that reason I'm apprehensive about piping to a script and keeping a database connection open. So I'm leaning towards a nightly analysis.
Anyone using a setup like this in a production environment or have any thoughts?
I would recommand to don't use a pipe because foreach request you will lost a lot of time in apache to pipe to the script and wait the end of the script to free apache ressource. So, if for a reason, you database become very slow to do INSERT, you could have all apache threads/process to be waiting for your script to finish your job and can't be used to process new user request
You could try using mod-log-sql to log to a database on every request. My preference is to a something nightly when you have lower load, but it depends on how up to date you need the data.
There are several advantages to running log analysis as a nightly job if you don't require real-time update of data:
You have control of the load on the web server and the database server caused by the log analysis, ie. it is unaffected by traffic peaks. You can even move your log files off to a different server for analysis after they've been rotated, if you like. This is very useful if your app is running on multiple web servers.
You can perform aggregations on the data before inserting it to the database, if you require.
Several database engines support some kind of mass insertion which is much faster than inserting the same data one record at a time.
Error handling (bugs in your analysis code, unavailable database etc.) is simpler -- just rerun the same script on the relevant log data after the problem is fixed.
I'm sure there are others, depending on your requirements and hosting situation. Personally I wouldn't even consider piping from Apache unless real-time updates were an absolute must, and even then I would extract only what was needed in the piped script and handle everything else with a nightly job.
If you're in need of a close to real-time solution, you should consider using a message queue system (like ActiveMQ, for instance), and post messages to the queue through the Apache log pipe.