For a more comprehensive list of monitoring tools and their features, check out this Wikipedia page.
As the question states, what are the most commonly used tools used for this task and what are their strengths and weaknesses?
For a more comprehensive list of monitoring tools and their features, check out this Wikipedia page.
As the question states, what are the most commonly used tools used for this task and what are their strengths and weaknesses?
I've used Nagios in the past with success. It's very extensible (over 200 add-ons), relatively easy to use and lots of reports. A negative would be the initial setup.
Cacti is a very good web-based frontend to RRDTool, providing very handy graphs and stats. RRDTool is the part that gathers data from multiple systems and monitors a wide range of technical data.
We're using that cacti/RRDTool solution to monitor Unix and Windows systems. We get a lot of useful metrics including load, CPU/RAM usage, HD space, users logged in, network traffic, running processes, and so on.
You will find more information on cacti on the What is Cacti? page.
Personally, I love Munin which is very easy to install and to write plugins for as it has a very straightforward architecture. There are quite many plugins already around for all the purposes you could imagine, so you probably won't even have to write plugins in the first place.
It also provides beautiful graphs and the option to configure (very basic) alerts.
Zabbix. It's open-source, and reasonably simple to setup and customise. We have a lot of custom monitoring scripts that feed into the zabbix server, but it takes care of centralising that data, displaying it appropriately, notifications (email, IM, SMS, twitter, etc), and so forth.
I have been doing roll outs of Spiceworks at our company and we are finding it to be a great tool not just for monitoring servers but everything else on the network.
It does things like automatic inventory and custom monitoring to send you emails when there is a problem (EG: Printer is down to 10% of ink or hard drive of this server has 20%).
Its downside would probably be is density of information per computer, don't get it wrong it has A LOT of data per machine but for things like servers where you might want a lot of stats you might need to use another tool.
EDIT: oh did i mention its business model is based around it being free forever.
Smokeping not only checks the availability of various servers and services but also keeps track of their latency while providing easy to use, nice looking, and quick to display graphs.
Wide range of latency measurement plugins is available out of the box. If you know some Perl, it is easy to create your own ones for any exotic needs.
Large installations will benefit from Master/Slave System for distributed measurement.
Highly configurable alerting system will help you notice issues before they start affecting users or evolve into major outage.
Smokeping is free and OpenSource Software written in Perl by Tobi Oetiker, the creator of MRTG and RRDtool
Zenoss Core is of some use, We are using it (for about a year) for lightweight monitoring of servers, net switches and UPSs.
OpenNMS is used where I work to monitor more than a thousand Linux machines. We monitor the hardware of each machine and the applications running on them.
I've used:
Nagios is great since it's free and there is plenty of plugin's for it. However the UI and config is very difficult.
It's exact opposite in pro's/con's which is also great is Microsoft System Centre Operations Manager (SCOM) which is not free, has less plugin's but setup and config are brilliant and easy.
I must admit if I was in a primarily Microsoft company, had very high reliance requirements (i.e. can't afford for monitoring to break) or had to think about getting developers to work with it then SCOM would be my recommendation over Nagios.