What is the difference between Allocated GPUs and GPU quota in runAI?
I'm trying to find out if it's possible to run a Windows Server with one GPU which is shared between all RDP clients so that people could
- create a session on the server
- start some program with a UI which needs GPU acceleration
- disconnect afterwards while the program stays running and gets full acceleration
- later reconnect to the session
Maybe that's an unusual use case because most things i can find about Windows Server and GPU seem to be about virtualization, f.e. here where it's even mentioned that
if your workload runs directly on physical Windows Server hosts, then you have no need for graphics virtualization; your apps and services already have access to the GPU capabilities and APIs natively supported in Windows Server
which might indicate that is is possible.
I've read about RemoteFX and GPU-Partitioning, f.e. here, but it again looks like this is only for virtualization and i don't care about how fast rdp would update remote screens as long as the running programs get the full acceleration.
Am i searching for the wrong things? Is this even possible?
If it's possible, how would it impact performance when the session is connected and when it's disconnected?
I've been trying to start my existing GCP VM that has an NVIDIA T4 GPU attached to it, for almost a month at this time. It has been working fine before but now I am constantly getting the error message:
The zone '***' does not have enough resources available to fulfill the request. Try a different zone, or try again later."
Which indicates that there are no GPUs available.
Starting any VM with a GPU in another zone does not work either, nor can I start other existing VMs in other projects. Starting VMs without any GPUs attached works perfectly fine however.
All evidence points towards GCP just not having any available GPUs but I cannot believe this would be the case for almost a month at this point.
Any insight into this?
Do I need a GPU on a text and console only server? No GPU as in no iGPU and dGPU. Im going to be using SSH, so I dont need a display out.
Im using Linux, but the OS shouldn't affect the results
We have an API in AWS with a GPU instance that does inference. We have an auto-scaler setup with the minimum and maximum number of instances, but aren’t sure which metric (GPU/CPU usage, RAM usage, average latency, etc) or combination of metrics should be used to determine when a new instance needs to be launched to keep up with incoming requests.
Are there best practices in regards to what metrics should be used in this scenario? Inference in our case is very GPU intensive.