We have an API in AWS with a GPU instance that does inference. We have an auto-scaler setup with the minimum and maximum number of instances, but aren’t sure which metric (GPU/CPU usage, RAM usage, average latency, etc) or combination of metrics should be used to determine when a new instance needs to be launched to keep up with incoming requests.
Are there best practices in regards to what metrics should be used in this scenario? Inference in our case is very GPU intensive.
Amazon CloudWatch Agent adds Support for NVIDIA GPU Metrics
https://aws.amazon.com/about-aws/whats-new/2022/02/amazon-cloudwatch-agent-nvidia-metrics/
So based on these metrics, you'd probably want to monitor: