We are currently working across our environment and disabling all ways that an HP server can automatically reboot. This is in response to a massive outage which caused our servers to begin flapping, causing a service outage for several million customers. The request from "on high" is to have the servers shut down, but not reboot until a human can manually guide them back online when the "coast is clear" (we have several geographically redundant sites).
So far, I have identified the following possible causes:
- HP ASR automatically reboots a host. This can be disabled by switching off the ASR timer.
- Disable automatic-power-on in iLO. I believe this is only triggered when power is removed, and then re-applied to the host.
However, I assume there is yet another configuration that is applied when one of the server sensors passes a critical threshold, for example if the ambient temperature sensor exceeds 40 degrees C. That should absolutely shut down a host, but I'm unsure where the configuration lies to disable the automatic reboot after the ambient temperature drops. Or is this also controlled by HP ASR?
I just want to ensure that there aren't any scenarios that I have forgotten that could bite us in the butt in production.
Any help would be appreciated.