I have a simple four node Oracle VM environment. A management server running in vmware, a nfs server for shared storage and two Oracle VM servers running the actual hypervisor.
For some reason the node running the pool master service will suddenly reboot for no obvious reason. I'm fairly sure it's a software issue, possibly a cluster watchdog of some sort. Just to be clear, it's the vm server/hypervisor that reboots, not the guest machines.
Have anyone seen similar issues, or have any suggestions as to where I should start looking for the root cause?
I don't see anything suspicious in the /var/log/ovs*/ logs, any other place I shold look?
The documentation from Oracle leaves a little something to be desired.
I'm not sure if you have the nice fancy graphs that come with the VM Management or not. If you do they do provide a decent amount of insight into what the memory, cpu and disks are doing. Perhaps there might be some correlation? From there you can start looking at top and ps to see what exactly is running, and in use, when the server bounces.
Also can you set the servers into debug mode? Do they support that?
I hope this helps get you started at the very least.
Turns out the nodes were not communicating correctly, due to the node hostname being listed on the loopback address in /etc/hosts. The cluster services would silently force a reboot to protect shared storage.
Are you using ocfs2? if so increase the ocfs2 timeout in /etc/sysconfig/o2cb.conf