We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).
When a node crashes, we sometimes see these strange things on the Supermicro IPMI:
We also saw:
- "No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
- The normal login screen or other normal output from the server, but freezed
What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.
As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:
- A specific VM is causing the issue
- Kernel bug
- Hardware issue regarding our setup
More information about the machines:
- CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
- Supermicro Case with redundant power supplies
- Supermicro X10DRi / X10DRWi with latest BIOS version
- Intel Xeon E5-2630 v3 / v4
- 512 GB DDR4 ECC RAM (Samsung Server RAM)
- 145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
- Software RAID-10 with 8 / 16 SSDs
Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.
Thanks in advance!
This might be a CPU bug. Intel published an errata about this problem and they also provide a microcode update for the E5 v3/v4 CPUs (datecode 20170707). CentOS 7.4 already has a newer microcode version 0xb000021 (in CentOS 7.3 it was 0xb00001e). It may help to exchange the microcode or upgrade to 7.4. I also had a lot of trouble with this system freezes. I exchanged the mainboard (X10DRi), RAM, CPU and powersupply without success. I can't say for sure if this is the solution, because I do not have enough uptime since I updated the microcode. Supermicro still does not provide an updated BIOS with the current Intel microcode. You may get an unofficial prerelease from your distributor for the X10DRI.
A short update on this: After upgrading to the newest LTS kernel (4.4.39) the server is stable. Uptime 19 days now, so I think we got it. Although we do not really know the root cause, we think the CentOS 7 kernel (3.10) might be too old for some very modern hardware. As we can not deliver a helpful error message (like a kernel panic in the best case), we decided to not report this to the CentOS developers.