I've got a nagios server setup for monitoring ~ 30 Windows servers. I want to add some trending charts. I've read that nagios graphing plugins are simple and many people use seperate, standalone charting/trending tools.
What are the restrictions of the nagios graphing plugins vs standalone products like ganglia/munin/cacti?
I'm interested in specific features and advantages that standalone packages offer and nagios graphing plugins don't.
I concur with lynxman. NAGIOS is for immediate qualitative data (is X OK or not?); munin is for historical quantitative data (how full is X now, and how full has it been this year?). All my NAGIOS installations, some of which monitor several hundred services, are linked to munin systems to do the quantitative monitoring.
Note also that munin has specific hooks for feeding data into NAGIOS. It understands the concept of WARNING and CRITICAL thresholds, and where notification (and a view on the NAGIOS "big board") is required it's very very easy to have a single munin variable inform the state of a single NAGIOS service.
The usual workflow is that noone looks at the munin graphs until NAGIOS alerts that a threshold has been breached, but then the munin graphs become invaluable for finding out whether something has been slowly ramping up over time, or this is an out-of-the-blue increase, or we have a weekly up-and-down cycle which is slowly increasing in amplitude, or what.
As lynxman says, the UNIX way is "one task, one tool". Making a toolchain of munin and NAGIOS works very well for me to provide quantitative and qualitative monitoring as well as notifications. It also has the distinct advantage of keeping the interfaces clean: when you look at NAGIOS, you see a simple view of how well things are working right now, with no historical data cluttering up the view; when you look at munin, you see historical information pertinent to the issue ready for your analysis, without "host is down" or "sshd won't talk to me" errors cluttering the view.
given that you already have a nagios installation, consider nagiosgraph or pnp4nagios.
nagiosgraph and pnp4nagios do a pretty nice job of plotting nagios performance data. nagiosgraph has a parameter-based approach to configuration, pnp4nagios has a template-based approach.
slicing and dicing the data are pretty important, imho. for example, you can view all services on a single host, or view all hosts with a specific service, or view arbitrary collections of graphs for arbitrary hosts and services.
installation is not trivial, but not difficult. a lot depends on how much you want to customize things. for example, nagiosgraph is 'install.pl' or 'rpm -i nagiosgraph.rpm' or 'dpkg -i nagiosgraph.deb'. pnp4nagios is './configure; make; make install'.
n2rrd can do some of these things as well, but it is not as polished and requires more work to configure.
rrdtool has quirks wrt data storage, and any system will have sampling issues. rrdtool does some data smoothing by default, but you can capture (and graph) maximums and/or minimums in addition to averages if necessary.
every rrdtool-based approach suffers from data/graph staleness since the schema in each rrd file is static and most systems use the rrd filename to identify the data. data are typically never lost when a hostname or service name changes; the rrd files still exist on disk. but some user interfaces provide ways to see 'stale' rrd files, others require manual housekeeping via command line. on many installations this is only an issue when initially configuring the system, but in dynamic environments (e.g. monitoring virtual machines whose lifetime is only a few months) it can become tedious.
one final note. there are actually two parts to trending: data collection and data display. if you go with a standalone graphing system rather than extending your existing nagios installation, then you might have to install additional components on your windows machines in order to collect the data.
Nagios graphing plugins as you say are very restricted, they offer a very basic rrdtool interface and the UI design is a bit counter intuitive, it's basically a hack over nagios, tried to use that just for fun but it broke several times without warning.
Going for a standalone product (especially munin or ganglia) offers you a big range of services that nagios can't accomplish, as the unix mantra it's better to be good at just one thing than try to be good at many, nagios is amazing for monitoring and munin/ganglia/cacti are amazing at graphing.
At Stack Overflow we use n2rrd which is a Nagios plugin for graphing performance data. To an extent I would agree with lynxman that it does have a big of a hackish feel.
However:
The rrd graphs are stored according to the server names, so if you change the name of something you sort of loose the data... You could always just rename the files are symlink them though and you won't loose the data.
I have some examples of these graphs up at my recent Some Tips for Better RRD Graphs Server Fault Blog post. Also, the n2rrd page includes both the cacti demo as well as rrd2graph.
I think the bottom line is that going the Nagios route might be lacking in a feature or two but is pretty complete if you don't mind getting your hands dirty with the details of writing rrd templates yourself*. It is probably going to take more of your time but it will encourage to develop more expertise in rrd.
I demand accurate data and rrd's data display is not accurate - it's normalized! For most users this is fine because they're not using very accurate data to begin with. They're using data whose sample rates are often at a minute or more and that isn't going to give you a very accurate description of what is happening. This also means that if you have a spike in your data somewhere you may never see it.
Consider this - say your Gb network is humming along at about 10MB/sec and all of a sudden there is a spike of 100MB/sec for a couple of minutes. Also note if it was only a 30 second spike you might not even see it at sampling rates of a few minutes. If you look at the data for the day, that 'spike' may only show up as 15MB/sec, though the actual value depends on a number of other factors as well. There's also a very likely probability you'll assume your network is happy when it isn't!
What's even more frustrating for me is the data normalized to the physical width of the graph and range of the x-axis. What this means is that spike I mentioned you didn't see? If you zoom in it magically appears! I'll stick to gnuplot - the graphs may not be as pretty but they're rock solid and gnuplot never modifies the data before displaying it.
-mark
I find using pnp4nagios works quite well for graphing. It supports zoom as well. It is not the easiest to implement, but nothing with nagios ever is.