How do you account for GPU compute time on your HPC clusters ?
I have a growing, and quite heterogeneous (SXM4 A100s, PCIe A100s, NVlinked V100s, PCIe V100s, T4s, AMD cards arriving soon etc...), GPU partition on an HPC cluster (mixed hardware Debian servers running OAR scheduler).
Traditionally, we accounted compute time as seconds per core per job. Despite CPU and memory variability between nodes (fat nodes, high speed nodes, standard nodes), the difference was sufficiently small that it didn't impact accounting noticeably, especially in a small university setting.
On GPUs, things change quite a bit. The difference in performance and cost between an SXM4 A100 node and a T4 are quite significant and our current model is probably not going to cut it, moreover as growing university partnerships impose that we host more and more private sector projects which we will have to account for precisely.
I'm exploring how to do this accounting with our current infrastructure but was also wondering what methods by other people operating HPC GPU clusters. If you have any advice as to how to do this or what strategy/tools you have used, I'd be very willing to hear them!
Thanks!
0 Answers