I've been on hold now for an hour waiting for VMware support and am betting serverfault can beat them to the answer!
I am running ESX 4.0 and 4.1 on 6 HP blades, using FibreChannel LUN storage. We did some FC network maintenance over the weekend and took down 2 of the 4 paths the ESX hosts have to the storage array (EMC Clariion). When this happened, all 6 ESX hosts shut down all of their VMs.
I saw the messages like this in events:
Path redundancy to storage device naa.600.... degraded. Path vmhba0:.... down. 2 remaining active paths Affected datastores: ....
this was expected. then 3 minutes later:
Guest OS shutdown for vm1
(this was by the vpxuser)
vm1 is powered off
(user "User")
why would it do this if there were still good paths? I don't see any setting like this anywhere. thanks!
As we figured out in the comments, this seemed to be and actually was HA isolation response.
To provide a bit more value to the answer: to avoid such mishaps, I recommend setting up another network path for HA by configuring a service console (ESX)/management port (ESXi) that would utilize a path completely separate from your main network stack (vSwitch, pNICs, physical switch, UPS, power circuit).