For a project we have 50 servers all equiped with (generally) the same hardware. The issue we have here is very serious and happens on all machines. Despite a lot of effort and contacting manufacturs and the software developpers everyone points to each other and even refuses to give me a clue about what is going one.
First let me describe the setup. This is 'servergrade' hardware. For my first experience, servergrade is the largest dissappointment in my life.
- SuperMicro X10SDV-8C+-LN2F
- Intel Xeon D-1540 (embedded on the motherboard)
- Custom designed 1U case or SuperMicro original case
- 480 watt server PSU or 200 watt SuperMicro original PSU
- Samsung Evo 850 500 GB SSD
- 32 GB DDR4-2133 ECC or NON-ECC (but not mixed in the same server)
- Asus GT730 4GB DDR3 GPU
- GPU is mounted with a PCIe riser card (not ribbon), nameless from China or SuperMicro original
Running on the system - Windows Server 2012 R2 Enterprise - VMWare Workstation 12 - VM's run GPU intensive tasks - This system is stock, there's not over/underclocking at all
Symptoms - Random BSOD 0x09c (aka Machine_Check_Exception): sometimes the system runs for a week with no problems, sometimes in crashes after just 10 minutes, but most of the times it runs for a few hours.
Already tried/checked:
- BIOS updated to latest version (I would think now that this improved the time for the system to be stable, but that could have been random).
- Windows updated to the latest version.
- VMWare updated to the latest version.
- Swapped all components and tried every different option, even tried a desktop ATX PSU and M.2 SSD.
- Installed all systems from scratch with Ubuntu. I'm not familiar with Linux and have never seen a Linux BSOD and I still didn't since server systems are headless and I tried this in the DC. RESULT: system would hang and after reboot Linux reported XORG crash (GPU related).
- Changed GPU setting in BIOS to 'Above 4G', the rest of the BIOS is factory default.
Also informative:
- Systems are located in a datacenter. Temperature, air, power and network are optimal.
- Temperatures are well below the factory maximum
- We have the exact same software setup running on desktop computers (with desktop hardware). These system can run fine with 1 our of 100 PC's crashing every month.
- I have contacted VMWare, the say this is a hardware issue
- I have contacted SuperMicro, they say nothing really except some things and already tried and also that this could still be a software issue.
We are desperate here. The application we run luckily is sort of redundant. If a server and it's VM's on it drop, it's not such an issue, other servers will take over the load within 5 minutes, but at this rate I am required to be online all day to restart servers.
I have a large hardware knowdledge but this goes past it, I've search on this all day for over a month trying all sorts of different things. The fact that these motherboards are used with hosting providers on a large scale makes me suspect that the board on itself is ok. This is definately not a specific hardware issue for RMA as all 50 boards have the same symptoms. The only thing different with us is the GPU. This in combo with the Linux experiment makes me suspect that this is definately something on the PCIe lane. The GPU itself is stable on desktop mobo's. Despite it's large memory capacity this is a small GPU that does not draw much power. I would suspect the Chinese riser cards, but then again we also use SuperMicro certified risers and they show no improvement at all.
I am very desperate to find a solution here. This will start with determing the exact cause. We are willing to pay a nice bounty to an expert who can analyse some dumps and give us more details (or even better yet, a solution).
Kind regards,
Simon
Well this is super late, i imagine the issue is resolved by this point? Either way 0x9C usually means a MCE hardware fault, Our GPU systems ran linux as a host os which reports these errors a bit more verbose than windows.
Anyways, these were randomly popping up for us on similar hardware made by HP a while back, It ended up being insufficient power delivery to the GPU. Specifically the 75W that's supposed to be supplied by the PCIe port itself.
We confirmed it with a multimeter on a PCIe breakout board. Voltage dropped when both GPU and 10Gbe network cards were being hit hard at the same time. While the motherboard was capable of delivering 75W to the x16 slot, the power delivery section struggled a bit when the other cards were all consuming power.
The riser may be suspect here and dropping voltage on high current loads.
Thanks for your reply. It's now 3 years later. Supermicro has refused to help us in every possible way. We sent multiple machines (exactly as built by us over). According to them they stresstested them for weeks and they never crashed.
As for the riser, the same error occurs with the GPU directly in the slot.
Supermicro keeps putting the blame on VMWare, something I was inclinded to believe untill I got my hands on their new release of the same board. Without any comments from Supermicro the board with the Xeon D-1540 was updated with a Xeon D-1541 just after a few months. The new board is basicly the same asside for the newer CPU (also the same just slightly higher clockspeed). The updated board also features and extra fan header.
These boards no longer crash. On exactly the same load they will run for months without a problem. I even cloned machines here, they run the exact hardware and software of the crashing ones.
This kind of confirms my suspicion. Supermicro knows there's a problem with the boards but does not want to tell me why because I ended up with almost 100 of these boards being useless because of the crashes. Their was never and RMA or fix not even BIOS update for it so it must have been something on the board.
Needless to say, this was my first and last time with Supermicro. This could happen to any brand ofcourse, but the support was sub zero.