We have a SuperMicro GPU server with:
- 2x Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
- 512GB memory
- more than enough disk space
- X10DRG-O+-CPU (BIOS Version : 2.0a [current])
- X9DRG-O-PCIE PCI-E expander card
- 8x GTX 1080
It is setup with Ubuntu 16.04.1 LTS, NVIDA driver 367.57 and CUDA-8.0. When it runs, it runs temporarily fine. It is however completely useless with the stock Kernel (v4.4) -- the system almost immediately freezes when doing something non-trivial on any GPU. We therefore suspected a hardware issue, but cooling is fine, and a second almost identical machine (just different maker of the GPUs) shows the exact same behaviour.
To make it run fine for some time, you have to downgrade the Kernel to v3.14.1-trusty (tested almost every version before that one). But there are still random freezes, usually with nothing in the logs. Sometimes the whole machine freezes, other times just any GPU-related processes.
There seem to be other [1] people [2] having this issue, but no solution there.
Is anyone having the same experience with this type of machine?
Update: The machines seem to run stable (regardless of any software) if the cards are inserted only on one side of the PCI-E expander, which means all cards are driven by the same CPU. Another machine however seems to run stable with 8 cards (uptime of about 4 months right now) with Kernel 3.19 after months of having the problems described above. Bizarre.
[1] https://devtalk.nvidia.com/default/topic/958927/gpu-job-fail-/