I'm looking to see if there is a good "Best Fit" solution for what I'm being asked to deliver.
The IT company I work for contracts to a local business. While we have been long term partners, recently we have signed a long term, updated agreement with this business. One of the things that they want is downtime tracking.
There are a LOT of options out there for downtime tracking... enough that I'd like to ask the experts what my best options are.
I want to track downtime at the __ level:
- Server - When did it go down. Why. How long.
- Connection - Tracking ISP, Phone, Server to Server connections... Server is acting fine, but "downtime" was caused because someone hit the pole that holds the line between two sites.
- application - Server is fine. Connection is fine. But application stops responding. We have servers that have anywhere between 1 and many applications on it.
- Also wanting to track bad/extreme lag in downtime. This includes all the above. Server/Connection/App is running... but it takes 5 minutes for a page load... or a document to print.
Some of the applications we want to track are home-grown, making it possible to add test parts/procedures for application level tracking of downtime. Some applications are web based making it possible to use simple, periodic web-page loads to track downtimes... but we have many closed applications that are going to be tricky, if not plain impossible, to write tests for.
So what I'm thinking is that, unfortunately, I'm going to need a system that has an "easy" way to add or edit downtime statistics.
Given a diverse set of OS's (Windows, Unix, AIX, DB2), connections, and applications... what is a good platform to use to track this information in a way that's as automated as possible BUT still easy to manually edit where required... with appropriate reporting options.
I'm looking at Zabbix now... Just wondering if it's even remotely in the right area of what I want/need.
Personally I like Nagios for all my monitoring. It has built-in SNMP support, and can be extended with your own scripts and probing commands. I use it for network monitoring, application monitoring (where the application has a TCP port that should respond), and some custom queue monitoring (mail queue sizes, etc) which I expose on the target machines through SNMP
You can also look at extending Nagios, Zabbix, SiteScope (etc) with probes that use image recognition on the screen... iMacros or eggplant maybe?