Ping a Specific Port

Question

smartenbergen

Asked: 2016-12-24 07:10:25 +0800 CST2016-12-24 07:10:25 +0800 CST 2016-12-24 07:10:25 +0800 CST

Server freezes without kernel panic

772

We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).

When a node crashes, we sometimes see these strange things on the Supermicro IPMI:

We also saw:

"No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
The normal login screen or other normal output from the server, but freezed

What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.

As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:

A specific VM is causing the issue
Kernel bug
Hardware issue regarding our setup

More information about the machines:

CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
Supermicro Case with redundant power supplies
Supermicro X10DRi / X10DRWi with latest BIOS version
Intel Xeon E5-2630 v3 / v4
512 GB DDR4 ECC RAM (Samsung Server RAM)
145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
Software RAID-10 with 8 / 16 SSDs

Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.

Thanks in advance!

2 Answers

Voted

Bernhard · Answer 1 · 2017-09-28T23:59:15+08:00

Bernhard

2017-09-28T23:59:15+08:002017-09-28T23:59:15+08:00

This might be a CPU bug. Intel published an errata about this problem and they also provide a microcode update for the E5 v3/v4 CPUs (datecode 20170707). CentOS 7.4 already has a newer microcode version 0xb000021 (in CentOS 7.3 it was 0xb00001e). It may help to exchange the microcode or upgrade to 7.4. I also had a lot of trouble with this system freezes. I exchanged the mainboard (X10DRi), RAM, CPU and powersupply without success. I can't say for sure if this is the solution, because I do not have enough uptime since I updated the microcode. Supermicro still does not provide an updated BIOS with the current Intel microcode. You may get an unofficial prerelease from your distributor for the X10DRI.

2

smartenbergen · Answer 2 · 2017-01-14T05:31:38+08:00

Best Answer

smartenbergen

2017-01-14T05:31:38+08:002017-01-14T05:31:38+08:00

A short update on this: After upgrading to the newest LTS kernel (4.4.39) the server is stable. Uptime 19 days now, so I think we got it. Although we do not really know the root cause, we think the CentOS 7 kernel (3.10) might be too old for some very modern hardware. As we can not deliver a helpful error message (like a kernel panic in the best case), we decided to not report this to the CentOS developers.

0

Server freezes without kernel panic

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?