We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).
When a node crashes, we sometimes see these strange things on the Supermicro IPMI:
We also saw:
- "No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
- The normal login screen or other normal output from the server, but freezed
What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.
As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:
- A specific VM is causing the issue
- Kernel bug
- Hardware issue regarding our setup
More information about the machines:
- CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
- Supermicro Case with redundant power supplies
- Supermicro X10DRi / X10DRWi with latest BIOS version
- Intel Xeon E5-2630 v3 / v4
- 512 GB DDR4 ECC RAM (Samsung Server RAM)
- 145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
- Software RAID-10 with 8 / 16 SSDs
Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.
Thanks in advance!