We just received a brand new Dual CPU server and it keeps crashing with a Kernel Panic shortly after booting, this even happened during the OS setup when it was idle. I was able to get the OS installed and enable mcelog to try and understand what is happening, although I'm not sure what to make of the output. Reading online made me think this might be a defective DIMM on one of the Sockets (1) but I ran memtest with several passes and found no errors. Is it possible this is a Software issue instead? I've already tried 2 OSs and the same thing happened in both, although in Debian/Proxmox it was far more common to happen than in CentOS.
Server Specs:
Dual Intel 8-Core Xeon E5-2620v4
2 x DIMM 32GB DDR4 2400MHz RECC DDR4
MB SuperMicro X10DRL-i
It's not the CPU thermals because those never went above 35ºC during memtest or the OS install. I was also able to run some shorts benchmarks on the CPU before it crashed and the temps were OK.
How can I figure what what is going on here? I can have access to the server for a few minutes before it happens, I've already downloaded the vmcore dump but I'm not sure what to do with it.
Here's the mce log 50 seconds after it booted and then crashed:
[ 56.367615] mce: [Hardware Error]: Machine check events logged
[ 70.420914] mce: [Hardware Error]: Machine check events logged
[ 71.886789] Disabling lock debugging due to kernel taint
[ 71.886894] mce: [Hardware Error]: CPU 24: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.887009] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.887122] mce: [Hardware Error]: TSC 206cc7cd362
[ 71.887184] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 11 microcode b00001d
[ 71.887289] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.889392] mce: [Hardware Error]: CPU 30: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.889489] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.889595] mce: [Hardware Error]: TSC 206cc7cd11d
[ 71.889657] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1d microcode b00001d
[ 71.889760] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.891804] mce: [Hardware Error]: CPU 14: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.891901] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.892007] mce: [Hardware Error]: TSC 206cc7cd10e
[ 71.892068] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1c microcode b00001d
[ 71.892171] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.894217] mce: [Hardware Error]: CPU 13: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.894314] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.894420] mce: [Hardware Error]: TSC 206cc7cd23c
[ 71.894480] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1a microcode b00001d
[ 71.894585] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.896634] mce: [Hardware Error]: CPU 29: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.896730] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.896835] mce: [Hardware Error]: TSC 206cc7cd194
[ 71.896896] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1b microcode b00001d
[ 71.897000] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.899053] mce: [Hardware Error]: CPU 28: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.899150] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.899256] mce: [Hardware Error]: TSC 206cc7cd719
[ 71.899335] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 19 microcode b00001d
[ 71.899438] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.901485] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.901582] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.901687] mce: [Hardware Error]: TSC 206cc7cd720
[ 71.901748] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 18 microcode b00001d
[ 71.901851] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.903934] mce: [Hardware Error]: CPU 10: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.904031] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.904136] mce: [Hardware Error]: TSC 206cc7cd851
[ 71.904197] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 14 microcode b00001d
[ 71.904300] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.906306] mce: [Hardware Error]: CPU 26: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.906403] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.906508] mce: [Hardware Error]: TSC 206cc7cd863
[ 71.906569] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 15 microcode b00001d
[ 71.909482] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.914367] mce: [Hardware Error]: CPU 11: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.917304] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.920287] mce: [Hardware Error]: TSC 206cc7cd515
[ 71.923159] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 16 microcode b00001d
[ 71.926031] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 71.930820] mce: [Hardware Error]: CPU 27: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.933685] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.936557] mce: [Hardware Error]: TSC 206cc7cd449
[ 71.939384] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 17 microcode b00001d
[ 71.944180] mce: [Hardware Error]: CPU 9: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.947059] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.949956] mce: [Hardware Error]: TSC 206cc7cd766
[ 71.952786] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 12 microcode b00001d
[ 71.957580] mce: [Hardware Error]: CPU 25: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.960480] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.963366] mce: [Hardware Error]: TSC 206cc7cd751
[ 71.966210] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 13 microcode b00001d
[ 71.971031] mce: [Hardware Error]: CPU 31: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.973919] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.976817] mce: [Hardware Error]: TSC 206cc7cd7f7
[ 71.979690] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1f microcode b00001d
[ 71.984474] mce: [Hardware Error]: CPU 15: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 71.987371] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 71.990290] mce: [Hardware Error]: TSC 206cc7cd803
[ 71.993151] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 1e microcode b00001d
[ 71.997992] mce: [Hardware Error]: CPU 8: Machine Check Exception: 5 Bank 20: fa00004000020e0f
[ 72.000918] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8138fb97> {intel_idle+0xd7/0x160}
[ 72.003828] mce: [Hardware Error]: TSC 206cc7cd374
[ 72.006692] mce: [Hardware Error]: PROCESSOR 0:406f1 TIME 1487438906 SOCKET 1 APIC 10 microcode b00001d
[ 72.011533] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 72.014436] Kernel panic - not syncing: Fatal machine check
Late reply I know but I completely forgot. It turned out to be one of the CPUs that was improperly placed, or maybe it got loose during shipping. At least that's what the vendor told me as they say they didn't replace a thing.
After they shipped it back everything was working.