We have several LAMP servers that each run a special script we wrote to report on various system metrics. The report runs daily, and the idea is to be able to make a quick pass and spot any potential issues on the system.
Each LAMP server runs RedHat Enterprise and hosts 40-50 (and growing) public facing websites each (a mix of HTML, custom PHP and Drupal sites).
Here is what the script currently includes:
- Server load and users logged in
- Last 10 logins and times
- Disk usage
- Last 10 lines from various logs (qmail, mysql, secure, apache error, package)
- username, port and last login time for every account
- top dump
The report is already long, so I'm interested in brevity as much as is feasible.
Have you found other metrics important to include in such a script? Would you drop any from this list?
Thanks, team.
j
I would test to make sure your environment is sane. Test that PHP is running correctly (write a simple PHP script that echos something, wget it, make sure you received what you expected), your database (just connect and make sure you can see databases), etc.
Also, if you're doing SSL on those, check the certificate for expiration, unexpected changes, etc.
I would recommend using an automated gathering tool such as Cacti that will gather and report on various metrics over time. This will allow you to easily spot trends and plan for the future. There is an excellent book by John Allspaw called The Art of Capacity Planning which goes into this topic in great detail. I highly recommend this for anyone who needs to track metrics on servers.
My advice would be not to report any of those things routinely. You will be swamped with information and human nature dictates that when an issue does arrive, you may indeed overlook it.
Instead report only when one of those variables is abnormal. Perhaps even more frequently during the day. You could use a monitoring and graphing system such as Cacti which will alert of you of such changes and keep historical data for future reference.
for your script - for performance checks i'd add content of mysql slow queries log [you need to activate it in mysql's my.cnf ]. if you have queries longer than few seconds there is probably some performance bottleneck.
put some consistency check - tripwire - integrit or sth else - on places that are not supposed to be modified [ /etc/ ? binaries ? ]. maybe add checking of ftp logs against geoip - probably you dont expect to see successful logins from china or ex-soviet union.
put at least iptables logging for outgoing connections and include all unusual destination ports that showed up in your logs. you probably do not want outgoing http connections - maybe except update check within drupal - those can be sign of some attempts to download potentially malicious payload. or even better drop all outgoing traffic and add exceptions eg for outgoing tcp/25 if you send some mails.
add graphical trend monitoring eg with munin. plot charts with load, number requests per sec, memory and swap usage, mysql queries per sec, slow queries, total eth traffic, mail traffic.
and as Dan C suggests - you'll be drawn in logs and will start ignoring them. set up nagios or any other checking system. that will report to you when bad things happen.
You really want to have constant monitoring. We use Nagios to check each of our webservers several times a minute to make sure they're still working. We also monitor our databases and anything else we can think of. Over time you'll have outages and discover things that you should have monitored.
The other side of the monitoring is some sort of graphing. We use Munin, but cacti or ganglia are common solutions. Graphing is invaluable to spot trends in your system.
I would also suggest to fetch the server-status page with a script and parse the necessary values to get runtime information.
Add this to your apache configuration (already existing in most distribution)
This handler is provided by mod_status. You can see current requests, requests per second, etc.