I have a server with a faulty power button that likes to reboot itself. Usually there are warning signs, like the acpid log file in /var/log starts spamming garbage for about 10hrs or so.
Is there an easy way I can have something monitor the acpid log and email me when it has new activity?
I wouldn't consider myself extremely advanced so any "guides" you may have for accomplishing something like this would be very helpful and much appreciated. Thank you!
You could use something like LogWatch. Or even a simple script like this (it's pseudo code you'll need to modify it for your enviroment):
Put that in cron to run every hour or so and you should get an email letting you know when it's getting wierd.
You can use OSSEC HIDS to set up rules on log files and, at the same time, get security information from your host.
Setting it up is very easy:
/var/ossec/rules/local_rules.xml
as specified below/var/ossec/bin/ossec-control start
local_rules.xml
Rules can be very flexible and complex. See this table to get an idea of the parameters involved in a rule.
If you don't want or need the other security features you can deactive them by removing the
include
lines under therules
tag.I would suggest Nagios its what we run where I work for monitoring multiple machines with are network. Its very good i've not used it specifically for what your doing but you can certainly set it up to email you when errors occur.
There is a guide here for installing it on Ubuntu http://beginlinux.com/blog/2008/11/install-nagios-3-on-ubuntu-810/ and one here for installing on http://www.debianhelp.co.uk/nagiosinstall.htm.
And you can send it with something like this:
Download and install Splunk on the server. It's similar to logwatch, but provides you with a search engine for your logs.
You can configure it to index your logs, you can then search the logs and find patterns, find the errors, and then look at what other logs are doing at that specific point of failure.
It can also be set to send alerts or execute scripts at certain thresholds. So if a particular error starts being spammed to your log, you can script it to automatically restart the offending service.
We use splunk in our server cluster and it has been a lifesaver!
I'm using Zabbix with IPMI tools to restart faulty servers on demand. Also, I think OSSEC is a good choice too, but you really need to experiment and debug before put it in prod...
At a previous employer, we used logsurfer+ to monitor logs in real time and send email alerts. It does take a lot of time and configuration to tune for false positives, but we had a ruleset that worked quite well for a variety of findings and alerting, far more valuable than Nagios was for similar purposes.
Unforunately I don't have access to the config file anymore to provide samples of what we filtered, but the site should provide more information and examples.
You can also take a look at my Octopussy project.