Suppose I wanted to monitor 1,000 hosts. For each host, there is 100 or more variables I want to monitor: ping, disk IO/latency, RAM free/swap/etc, and so on. 100,000 data points every 5-10 minutes, stored for 5 years.
What system scales this large?
What if I had 10x the number of hosts? What would you select then?
You'll need to answer a few more questions before we can really give you a suggestion. For starters, are you wanting to store raw data for 5 years? Or is rolled-up data good enough? This matters more than you might think, and this feature alone may determine what your options are.
When you're talking about a 5 year time span, you're almost always talking about trending information that's going to rolled up and you'll lose precision over time. If you don't roll up the data, you're dealing with a monstrous volume of data and very few systems (both software and hardware) will be able to handle it.
Luckily, that's why RRDtool and Round Robin Databases (RRD) was invented. If you don't recognize it, that's okay. You may not know the name, but if you're looking at open source tools, you'll see practically everything built on top of it. Almost any open source program that is trending data over time and giving you pretty graphs is probably using RRDtool under the hood. RRDtool creates fixed sized databases that automatically roll up data and stores fixed precision to specified limits. For example, you might have it store 30 days worth of data at 5 minute precision, 90 days worth of data at 30 minute precision, 180 days worth of data at 1 hour precision, 365 days of data at 1 day precision, 3 years of data at 1 week precision, and 10 years of data at 1 month precision. It's all configurable, and ever time you add a new data point, it calculates the roll up data.
Now, once you figure out for sure what your data retention requirements are, you need to figure out how you're planning to monitor the systems. If there's a wide variety of devices, especially if there's a lot of network devices, SNMP is the standard. Also, there's a lot of devices that can't be monitored by anything other than SNMP, so at least some level of SNMP support is important (examples are UPS's, generators, printers, etc). If you have a lot of servers, you may want to go with an agent based system where you install a monitoring agent on each device to be monitored. This will often give you more detailed information, but significantly increases the management overhead required.
Next, you need to know what your projected growth is beyond "what handles X and what handles 10 times X". Even at the listed 1k hosts, 1k is a hugely different beast than 10k hosts. Lots of systems will handle 1k, but when you approach 10k, many times you'll need a distributed system to share the load. Also, you mention 100 variables per system that you want to monitor. . . are you sure about that? There's not all that many monitoring systems that support monitoring that many variables. That's a lot of information to be pulling from each device.
Finally, you need to consider much more than the monitoring system when you start approaching large scales. Pulling back 100 variable data bits form 1k (or 10k) devices with a 5 minute resolution is going to require some pretty serious bandwidth. Be prepared for that, or you could find that your monitoring system is negatively impacting your network. This is particularly important if you have your systems spread across multiple sites and you're crossing WAN links.
There are a few Open Source systems that stake a claim to being competitive in this large network monitoring scale, but not many. Nagios has been around for a long time and has been known to monitor 1k+ systems. Zenoss offers both an open source core product and a commercially supported product, and is attempting to challenge some of the "big hitters". Zabbix is fully open source with the company backing it offering support.
When it comes to the large companies with thousands of devices/systems that need monitoring, though, the biggest players are CA's Spectrum/eHealth/Unicenter, IBM's Tivoli suite, HP's OpenView. Each of these can handle huge scales, but also come with huge price tags.
Note: My Day job is the implementation and maintenance of network monitoring tools, where we monitor over 5k network devices and 8k servers. Finding tools that work well at these scales is hard.
Nagios seems this is the default answer to these type of questions but there are some installations on this scale using it.
On top of scaling well it's flexible and easy to customize.
I'd say either Nagios or Zenoss:
Nagios http://www.nagios.org
Zenoss http://www.zenoss.com
Either one should be able to handle your requirements if configured properly.
At work, we use Opsview for this. It's built on Nagios, and handles recording data and whatnot. Monitoring requests are handled by a cluster of supervisory nodes, and report to a master. This can be handy if you have multiple data centers, but we mainly use it for redundancy and load balancing. I thought it used RRDtool, but it seems to use MySQL.
However, your request is a bit ridiculous. First off, 5 years of data may exceed the lifetime of a given individual host. Secondly, you didn't mention anything about querying that data. Do you just want aggregate numbers to estimate provisioning? Do you discard the data when the host fails? Do you even want to drill down into specific hosts? Storing all samples for five years will be a bear to process, let alone store.
Next, the amount of data you're storing is on the order of 80 MB per host per year, under the assumption that you actually fitting 100 samples into 800 bytes. (RRD needs around 8 bytes per sample). The entire system would consume 80 GB per year and will be pain to query. 10x that and you'd need Google's help. If you do something stupid like record the results of
ps
, woe unto you.Seriously Tom, just tell us what Google invented this time, or get your damn company to write what you need on MapReduce and BigTable. At Google's scale, seriously re-engineering formats like RRD to better fit the redundancy in your data might be the best plan.
check out this slashdot thread for tons of suggestions ;)
http://ask.slashdot.org/story/09/07/08/210241/What-Would-You-Want-In-a-Large-Scale-Monitoring-System
We use Zabbix to monitor 150 hosts and 10 servers.
It should handle your needs.
I also would suggest Nagios but I'm really not sure if it will store data for 5 years, as I've never run it on one machine for that long. Other than that, I see no reason not to use it.