I have an HP DL380e Gen8 with P420 RAID controller it was powered on 24h per day at my previous job for 7 Months without any issues running a few VMs. After changing job I had the server at home for about 3 Months turned off. I've turned it on today after adding a 10Gbe Network card (HP NC523SFP) the server booted fine, I logged-in on the OS (Centos 7) and everything looks fine. After about 45 minutes I heard the server FAN spin at 100% and then back to normal, I've attached a monitor and I've a red screen of death with an NMI error.. Checking on the ILO the error refer to PCI-E Slot 1 Raiser card 1 which is where the P420 controller is attached (10Gbe is on Slot 3 riser card 1). I was thinking that the issue was caused by the 10Gbe but after removing it the server still gets the red screen of death. I've also tried move the 10Gbe on the PCI-E Slot on the other side of the riser card but nothing changed, I've also tried to remove the Smart Cache module with the battery and move the P420 on Slot 3. What can I check? The only thing I didn't tried is to boot without the HDDs attached and/or backplane cables removed from the P420. It's possible that having the 10Gbe card on the same side of the riser card have broken the P420 controller? When booted the first time with the 10Gbe I remember there was a message about boot disk/rom option that I haven't seen before which I completely ignored..
RAID controllers do fail quite often, so I wouldn't be too surprised if it just decided to die on you.
The easiest way to troubleshoot these things is to start with the minimum boot configuration, which you can usually find in most vendor's service manuals. This is essentially 1 CPU, 1 stick of RAM, and nothing else attached. Then you start adding components until you get a similar error again, and that way you find the faulty component.
Also keep in mind that cables are semi-active components, I've seen service techs swap motherboards and RAID controllers when the faulty component was a PCI riser or a SAS cable.
Remove the system board from the metal tray. Remove the heatsink from the Southbridge chipset. Scrape all the petrified thermal paste off the chip and heatisnk. put some decent quality thermal paste on the chip. Replace the heatsink. Pop the system board back on the metal tray and reassemble the server. The problem should have gone now and you will be able to see the B320i RAID in the config manager. (this is the adapter I was using).
This has worked on the past two DL380e Gen8 boards with the red screen of death messages.