pks's questions -server

pks

Asked: 2017-02-09 03:51:35 +0800 CST

8 GPU machine freezes

We have a SuperMicro GPU server with:

2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
512GB memory
more than enough disk space
X10DRG-O+-CPU (BIOS Version : 2.0a [current])
X9DRG-O-PCIE PCI-E expander card
8x GTX 1080

It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0. When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.

To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.

There seem to be other [1] people [2] having this issue, but no solution there.

Is anyone having the same experience with this type of machine?

Update: The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU. Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.

[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/

[2] https://devtalk.nvidia.com/default/topic/959699/linux/nvidia-smi-periodically-crashes-system-on-ubuntu-16-04-lts/

8 GPU machine freezes

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?