Good Day.
Like all of you know, (and me) there are a lot of Monitoring software (Open Source or not), I'm talking about Nagios, Hyperic, OpenNMs, Tivoly, Microsoft...
Like all of you know, best way to extract info of a Tomcat server is by a correctly secured JMX bean.
Well, My problem is double:
First: The default polling time of the monitoring servers is way too long. A Five minutes pull can not detect a problem or a peak of use, even a restart.
The solution to this is as simple as reduce the polling time to 5 seconds (or less).
This will probably saturate the server, But that's easy to solve. (More Iron)
Anyway, this polling time leads us to the 2º Problem
Second:
If i pull the JMx counter in a 5 second interval, and I get about 5 counters for each instance of tomcat, and we have more than 15 servers...
That's 375 samples per 5 seconds. 4500 samples per minute. Yes. The Database will grow very fast.
Reason many samples:
I dont really need each sample, but the average of them in a given time period (10 mins). But if a pull the counter in a 10 minute interval, I will loose a lot of info, that I need for detecting problems, monthly graphics of use...
Question:
So, the question is simple. Is there any software that can pull frecuently but stores only the average of the samples in a given period?
Manual Workaround
Of course, There's "manual" options to this problem... maybe daily tasks on the DB that calculate the average of the table in a period...
Or a perl-based script that makes the iteration... and then stores it in the DB
But before to that programming, I'm looking in the net and asking.
Thanks in advance
First, I think your misunderstand the use of a monitoring system. Detecting every small peak is overkill in most situations, and for detecting restarts of your server, reading logfiles and maybe logging/graphing uptime informations is a better way to go.
That said, many graphing systems like MRTG, Munin or Cacti use the excellent rrdtool by Tobi Oetiker to does exactly what you want: It stores data for i.e. one minute intervals for a day (1440 values), 5 minute averages for the last week, hourly averages for a month etc. After a day, it will overwrite the old values from the daily database (hence the name: Round Robin Database tool).
I completely agree with SvenW, nonetheless i think you can do that with Zabbix (other monitoring systems may also). Setting information gathering intervals as small as 5 seconds seems feasible, the Zabbix HouseKeeper daemon process will do the trends computing afterwards based on the retention parameters you have put on the monitored item.
You might consider to have a look at jmx4perl which comes with a very powerful Nagios Plugin check_jmx4perl. For you use case, especially the history mode of the jmx4perl-Agent might be interesting. It allows for keeping a configurable number of recently queried values within the agent's memory and return them on each request. With this data, an average value can be easily calculated without the need of client side storage.
Currently check_jmx4perl uses this history mode to monitor on increase rates (e.g. how fast the memory gets allocated), for calculating averages, there's nothing out of the box yet. But this would be a nice addition to check_jmx4perl, so I will consider this for one of the next releases. Nevertheless, you need to adjust your Nagios polling interval With the help of so called bulk request you can fetch the value of all timers for a certain server at once, though.
It is still questionable whether a 5s polling interval makes sense. For your uses case, it would be much better to install an MBean (maybe within a dedicated servlet) with an internal scheduling (thread), which queries the timers internally with such a high frequency and exposes only the average as a single JMX attribute which then can be queried. That shouldn't be that hard to code.