We are currently working across our environment and disabling all ways that an HP server can automatically reboot. This is in response to a massive outage which caused our servers to begin flapping, causing a service outage for several million customers. The request from "on high" is to have the servers shut down, but not reboot until a human can manually guide them back online when the "coast is clear" (we have several geographically redundant sites).
So far, I have identified the following possible causes:
- HP ASR automatically reboots a host. This can be disabled by switching off the ASR timer.
- Disable automatic-power-on in iLO. I believe this is only triggered when power is removed, and then re-applied to the host.
However, I assume there is yet another configuration that is applied when one of the server sensors passes a critical threshold, for example if the ambient temperature sensor exceeds 40 degrees C. That should absolutely shut down a host, but I'm unsure where the configuration lies to disable the automatic reboot after the ambient temperature drops. Or is this also controlled by HP ASR?
I just want to ensure that there aren't any scenarios that I have forgotten that could bite us in the butt in production.
Any help would be appreciated.
The cleanest approach to this is to control your environment.
The ambient temperature thresholds for these server platforms are well documented.
Focus on keeping your facility and environment within those thresholds. (repeating myself?)
If you have the number of customers described, this task falls on your facilities and/or datacenter team, right?
On the local server level, your only other parameter is the BIOS Thermal Shutdown option.
If you're experiencing this type of issue, it's rarely sudden and unexpected.. at least to the point where you have time to automate power-off of the affected systems via ILO.