I run a few applications which creates their own logs. Then I run cron scripts on the same server to do importing of data for my app. When these cron errors out, the default is it sends emails to the user that runs the cron job.
There are just too many places that I need to check the logs and mails for stuff that might have potentially went wrong. My question is, what is the best way to do this or even better is like a log parser application which will go through all the system logs when something really goes wrong instead of me having to go through it daily?
Logwatch is a good solution, but you are still dealing with lots of emails. I favor feeding everything into syslog and then collecting those syslogs on a central logging machine. You can then do various sorts of processing and event correlation on the logs all in one place.
First, how to get your application logs into syslog? There are a few ways. For the simplest case you can call
logger
in shell scripts to create syslog messages. If you are running perl scripts, you can retrofit them to use Log4Perl to redirect logs into syslog. There are similar approaches available for other languages.You should consider replacing the stock system sylog with something like syslog-ng for better performance and the ability to filter logs as they move through the pipeline. syslog-ng also supports blocking pipes so you can redirect an existing program's output directly into syslog-ng without modifying the program and without losing data.
Once you get you logs in one place, you can set up tools like Simple Event Correlator to find patterns. You can also run tools like logstash to save logs in a database and enable more powerful queries and graphing.
Of course there are commercial tools to do this sort of thing as well. One of the most popular is Splunk which is free to try and for limited amounts of data. Splunk comes with a client that you can run on multiple servers to save you the step of getting all your logs into a central syslog server. If you have more money that developers, something like Splunk might be worth considering.
Finally, here's a nice central logging mini-howto that covers a lot of the same ground as I just did.
To my mind, logwatch is best. It comes by default with many distros, and although it's a pig to get the hang of the config syntax, once the work's put in it becomes very much a sit-back-and-relax job. It works in batch mode, running every so often to digest recent logfiles, and sort and summarise the entries.
If you don't want to work that hard, swatch is less bang for much less buck. It also runs interactively, eating each new line of a logfile as it appears, and alerting you as requested if it matches certain conditions.
Whatever approach you take, I'd personally recommend against a "sort through the logfiles when things go wrong" approach , in favour of "sort through the logfiles all the time, so I can get the hang of my system".
Firstly, the logs of the observed failure may not be the logs of the root of the failure. Your web server can scream that the data from the cookie cache file is in the wrong format (assertion failure!) until it's blue in the face, but if you're not looking at the system log that says
/cache
is 100% full and infer from that that no cookie data can be written to the cache, you won't really know what's gone wrong, by way of example.Secondly, to my mind it's unreasonable to expect an application to know what sort of logs your system produces, either in normal or pathological operation. Knowing the intricacies of your system is your job as a sysadmin; most sysadmins will go a step further and automate the exclusion of all normal behaviours and the notification of all pathological ones, using regular tools (such as those above) customised for their system, or by writing their own.
Another solution if you have the resources is SPLUNK. You create a SPLUNK server on your network ship all the logs from all your servers and all your apps to it. It indexes the logs and syncs them against a timeline.
It's an awesome concept and can really help with debugging issues. It's free for up to 500MB a day of logs:
http://www.splunk.com/view/free-vs-enterprise/SP-CAAAE8W
Andrew
You can also take a look at Octopussy (disclaimer: my project), quite difficult at start, but really powerful at last.