I recently built a small cluster for running Solr. The cluster consists of 12 Supermicro Blades running E3-1270V2 with 32GB of ram.
11 of these servers are running fine. One of them crashes on me constantly. When the server crashes it typically produces some output on the terminal. The first time it was:
double fault: 0000 [#1]
Hmm... thats pretty cryptic. Since then I've recreated the problem and gottem some more interesting messages.
Here's another equally cryptic message...
Another interesting wrinkle is that I can fire up sysbench and max out the CPU without it crashing, but it's not until I start Java that it crashes reliably.
I've tried turning off the following CPU features:
- Turbo Mode
- C States
- T States
- XHCI
Is this just a bad CPU?
Many thanks!
I have had this type of experience with Nehalem and Westmere CPUs on HP ProLiant servers. In my case, the server would POST properly and recognize all RAM, but would generate machine-check exceptions tied to a particular slot after application load was applied.
If you haven't already, please try isolating the issue to a particular DIMM or DIMM slot to see if it follows to movement of the module. If the error persists and is tied to a specific slot... I'd suggest examining the CPU socket. Check the motherboard socket of the CPU(s) and take note of bent pins.
This is SuperMicro gear, so I don't know the warranty terms. But hopefully this is only RAM, as that's an easier replacement than a system board.