I've installed Sun Grid Engine on 10 nodes, and one virtual master host.
Now I have to monitor all the resources prior to launching it into production, but I don't know which is the best way. I've tried using xml-qstat, but it seems unstable.
Any tips or suggestions?
Anyone got experience on this?
thanks.
You could use Ganglia. We use Ganglia with 1000s of nodes at the Holland Computing Center and for the most part, it seems to work fairly well, especially if you are looking for historical graphs. Nagios is used for active monitoring.
If I am understanding you correctly you need to monitor bunch of grid servers. What kind of monitoring do you have in mind? Perhaps something like Nagios with some additional scripting could fit your needs?
There is an example over here.
Just for the record, also Munin (http://munin-monitoring.org/) is very nice.
It sounds like you're more interested in metrics than uptime or availability. Circonus (http://circonus.com/) is a good fit here. You can correlate virtually any metrics, which can be imported over the Resmon XML DTD.