We have a Dell PowerEdge 2950 running Windows Server 2003 R2, Enterprise x64 with Service Pack 2 installed.
Recently, we've been experiencing multiple STOP errors occurring with that server. Fortunately it is in place as a fail over machine so it is not currently affecting our production environment. The error that shows up in the server log is this:
Event Type: Error
Event Source: System Error
Event Category: (102)
Event ID: 1003
Description:
Error code 000000000000009c, parameter1 0000000000000004,
parameter2 fffffadf90881240, parameter3 00000000f2000000,
parameter4 0000000000060151.
So far the best I've been able to track down is that the 9C error is some sort of generic hardware problem. The other parameters have been no use in narrowing this one down.
There have been no hardware changes since the machine was brought into service last year. It has a twin box that is identical (the primary that this one acts as a fail over for) that is not experiencing the behavior. The last software change was on 4/16/2009 when several security updates were applied. The blue screens started happening on 5/9/2009.
Are there any diagnostics that may help with tis problem?
See Kazna3's answer at http://www.d-a-l.com/archive/index.php/t-49205.html He/she writes:
In other words, your hardware is likely borked. Possibly a brown-out, or high heat. Just because a component is solid-state doesn't mean it can't fail. Eg: RAM fails all the time - there's a reason it ships in static-resistant bags.
Do you have physical access to the machine? Does the status LCD give an error code when this happens, or does it seem oblivious?
If you have OpenManage installed, you're a leg-up already. Check the OpenManage logs to see if it has logged any hardware errors. OpenManage alsoincludes a pretty full-featured diagnostic suite. Check out http://www.dell.com/downloads/global/power/ps1q06-20050259-Thathireddy.pdf for an explanation on using it. Dell's support usually has you run a couple CLI diagnostic tests, so it may be best to get in contact with them.
As a generic step (and to preclude Support asking you to do this), update your BIOS and Embedded Server Management BMC firmware.
Replace your CPU if you have a spare.
Also, it may sound strange but if you have a DRAC installed, remove it. I had a 2850 that was giving CPU error codes (E07F0), freezing randomly, and occasionally failing to boot. Swapping out the DRAC corrected it and it's been in problem free ever since.
If none of this works, it's time to give Dell a call. This is 100% below the OS layer.
See Microsoft KB 939315 - storport driver can cause this.....did you see the error on reboot or shutdown or simply while running?