Just started at a new company and one my first assignments is to look for alternatives to their inhouse monitoring system.
Their current solution is a .Net application that checks various devices over the WAN (since they are an IT-consulting firm that provides 24/7 support/"maintenance"). Devices range from routers/switches/printers to MS servers and services.
After reading countless posts on the site and googling extensively it seems the consensus is that some sort of Nagios/Munin mix is the way to go.
Which brings me to my question(s):
A) Is it possible to have a Nagios server running locally at the company and monitor various external sites over WAN? (They dont want a local Nagios server on each site as most sites are relatively small (10-25 hosts) and the number of sites is quite large (75-100)).
B) If so how would the agents contact the Nagios backend ? Through SSH? HTTP?
C) Aside from the fact that it would be susceptible for WAN-link failures, what would the immediate drawbacks of such a solution be?
Any feedback is appreciated, and I apologize in advance for any misconceptions as im quite new to the industry.
Monitoring over a WAN is possible, but is generally not ideal. This is because if the WAN link goes down or blips all checks will fail and you are blind to what is happening in the remote location. You also have increased latency making it less useful for LAN View performance measurements. That being said if you are going this way you probably want to set up dependencies so you don't get flooded with alerts when the WAN link has issues.
The most common way I have seen communication between a monitoring system and its monitored services is to have a site-to-site VPN tunnel. Then communication is no different from the local network. Also, Nagios is often Pull based (although it doesn't have to be). So Nagios contacts the services and servers it monitors, not the other way around.
Lastly, a more ideal solution is to use a distributed monitoring setup, with Nagios one option is described in http://nagios.sourceforge.net/docs/3_0/distributed.html .
It kind of depends what you are going to be monitoring over the wan. For the most part if you are only doing ping checks, services checks, disk checks etc and stick to nagios's default 5 min checking time i cant see it causing you an issue.
Again, depending on what you are checking depends on what it is going to talk over. If you are checking windows hosts you can just use WMI queries and not even need an agent running on the box.
This is certainly possible, via several different methods.
If the "distributed setup" is out of the question, then you need to do at least one of the following:
I would suggest #3, because it requires the least firewall hole-poking, and also simplifies configuation. It's sort of a slimmed-down version of the distributed setup, in that it doesn't require a full Nagios instance at each site.
To do this, you can set up NRPE (or use check_by_ssh) and have this "proxy" run all of the other checks against the other hosts on the network. This has the added benefit of the performance data that you get back being relative to the proxy, so it won't be affected by WAN lag.
Also, you can then use parent/child setups to make every host at the remote site a child of its proxy, to reduce false-positive notifications. You might also want to make all of the services dependent on a check_nrpe (or check_ssh) service of the proxy. See the network reachability docs for more info.
No matter which method you go with, it's very important that you adjust default timeouts appropriately, to account for the added lag of going across the WAN links.