Our new server has been running basically fine for a few months. Twice, however, it shut itself down for no apparent reason.
The most recent occurrence was at 11:41pm a few days ago. The event logs show nothing untoward, and the last entry is a fairly mundane audit entry in the Security log. The UPS log shows no power issues. Nothing in particular was running, as it was after hours. Except of course the nightly backup, which starts at 10pm. The backup log also shows nothing interesting and just stops in the middle of the backup. Although the server is configured to write a kernel dump and restart, there is no memory dump and the system did not restart. It's an HP Proliant ML330 G6 Series server.
When the server was restarted manually the following morning, the following events were logged:
Log Name: System
Source: EventLog
Date: 4/16/2011 8:20:22 AM
Event ID: 6008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The previous system shutdown at 11:41:26 PM on 4/15/2011 was unexpected.
and
Log Name: System
Source: Microsoft-Windows-Kernel-Power
Date: 4/16/2011 8:20:00 AM
Event ID: 41
Task Category: (63)
Level: Critical
Keywords: (2)
User: SYSTEM
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The system has rebooted without cleanly shutting down first. This error could be
caused if the system stopped responding, crashed, or lost power unexpectedly.
and
Log Name: System
Source: USER32
Date: 4/16/2011 8:22:34 AM
Event ID: 1076
Task Category: None
Level: Warning
Keywords: Classic
User: XXXXXXXXXXXXXXX\Administrator
Computer: XXXXXXXX.xxxxxxxxxxxxxxxx.local
Description:
The reason supplied by user XXXXXXXXXXXXXXX\Administrator for the last unexpected
shutdown of this computer is: Other Failure: System Unresponsive
Reason Code: 0x8000005
Problem ID:
Bugcheck String:
Comment:
I've spent some time researching this and found very little of use. Anyone have any ideas?
UPDATE: Here are the relevant portions of the iLO2 log:
305 04/15/2011 23:42:00 Server reset.
306 04/15/2011 23:42:00 Server power removed.
307 04/15/2011 23:42:00 iLO 2 network link down.
308 04/15/2011 23:42:00 iLO 2 network link up at 100 Mbps.
309 04/16/2011 08:17:00 Server power restored.
UPDATE: I increased the size of the paging file to allow for full kernel dumps, so if it's really a Windows crash, I'll be able to see what happened - the next time it happens.
UPDATE: The server firmware was already up to date.
UPDATE: There were a lot of updates available for drivers and system software. I've installed most of them and now I'm just waiting to see if the problem happens again.
UPDATE 2018Jun06: after six years of trouble-free operation, this problem has returned, occurring twice in the last week or so. I'm looking into the possibility that the front panel and its wiring are faulty.
UPDATE 2018Nov30: Finally swapped out the front panel cable assembly, but the problem still occurs. Next up is the power supply.
It's most likely a faulty power switch/LED cable kit. My ML310 G5 was doing the same thing, and that is what fixed the problem. Apparently, it is a known issue with HP.
459186-001-02 HEWLETT-PACKARD PROLIANT ML310 G5 SYSTEM FRONT LED TO SYS/BRD CABLE P/N: 459186-001-02 - HEWLETT-PACKARD ORIGINALS
I had this EXACT issue happening on my Server 2008 R2 box. It turns out that the Xeon 5000 series CPUS, which your machine does use, have an issue with 2008 R2 and Hyper-V role. I'm going out on a limb here and assuming you have the Hyper-V role installed, based on the issue being identical to the one I was having.
There is a hotfix from Microsoft available HERE. I installed it on my system, and it has been trouble free since then.
I'm going to go waaaaaaay out on a limb here, and say that you might need a firmware update. Source. We had something similar with our DL380 G6 a while back.
Is the machine overheating? Check the fans and vents for dust bunnies.
Do you have the HP management agent software installed? You mention Windows event logs and backup logs but not the "hardware" logs. You need to look there too because spontaneous shutdowns might be related to a hardware issue that you won't be able to see info about anywhere else.
If that really was a system crash, you would have found an event such as this in the System log:
Also, being configured to save a kernel dump and then reboot, the server would have done just that.
The absence of such an event log and of a subsequent reboot means the shutdown was caused by an external event (power missing, hardware fault...). Also, your ILO logs seem to confirm that a power failure was the actual reason.