I consult in a small business environment where I have two HyperV hosts (with <10 VMs) + a couple other servers.
I recently had an issue where one of the HyperV hosts had a CPU issue and it came down, bringing most of my non-critical VMs with it, plus a free piece of software that I use for network & system monitoring and availability. Because of this, and the fact that iDRAC locked up also, I did not get any alerts about the crash.
So I am wondering how I can (cheaply) get a redundant availability monitoring system in place--Is is as simple as running Nagios or Zenoss (or whatever) on two different HyperV hosts?
It just seems like running more than one copy of Nagios/Zenoss/etc could be expensive and have high overhead.
Thoughts?
Yes.
Redundancy means having more than one of a critical component. Monitoring is a critical component, therefore you need more than one monitoring host. To solve the immediate problem you mention in your question you just need to set up a second canary on a separate host.
(Note that it doesn't have to be a full-on duplicate of your monitoring environment if continuity in monitoring isn't critical to you: It could just be a nagios check or similar to ensure that the main host is up and running).
More complex solutions include monitoring failover, and you may want also to consider an external monitor (there are lots of services like Pingdom that offer this) to monitor your more critical customer-facing systems as well, but first-glance impressions are that both of these are overkill for what you want -- you just need to be told if your monitoring system has died.
See How do you monitor a monitoring server?
In a nutshell, get an external monitoring system to monitor your local monitoring - if you can expose this to the web. That can be as simple as hosting a website on your internal monitoring servier, which is monitored by a good 3rd party service.
spicework is a free monitoring tool which can help you monitor your non-critical VMs easily.
Why not setup something like Pingdom to monitor the monitoring host?