We use a lot of GPGPU computing (mostly with CUDA, but some OpenCL). Often when users are running code, the code errors out with a memory error on only one of our hosts. I suspect one of the cards is faulty. Sometimes it brings down the whole system and sometimes the program just bombs out.
What are the easiest, fastest, and most thorough ways to fully test GPUs for possible failures?
I know there are programs that are part of nvidia's CUDA SDK:
deviceQuery
nvidia-smi
But I need something much more thorough. Suggestions? Experiences?
The de facto standard seems to be CUDA GPU Memtest. As @c2h5oh mentioned, it looks like it's based on memtest86 test patterns so I'm sure it does a good job. It runs relatively quickly on the high end GPUs I'm testing (30 minutes on a Quadro 6000 and 20 minutes on a Tesla C2075). It runs inside the OS (unlike, memtest) so monitoring is a bit different. You'll probably want to output stdout and stderr to a file to look at later. So consider running it something like this in case you lose your terminal output you can look up what the tests found:
You'll also want to make sure that no one is using the system and/or cards. You can set the GPUs to exclusive mode using:
Here are some of the output from sample runs of both the Quadro and the Tesla in case you're interested in what test info is given:
Google: Memtest + GPU: any of the 3 first results seem to be a valid answer. No personal experience.
http://sourceforge.net/projects/cudagpumemtest/
http://www.softpedia.com/get/Tweak/Memory-Tweak/CUDA-MemTest.shtml
https://simtk.org/home/memtest/