I'd like to get a feel for how 'efficient' my deployment/jobs are at consuming the resources they request. i.e., if a job that only ends up using 1CPU at peak ends up requesting 320, I'd like to have a dashboard/alert/metric to chase down rogue pods that meet this criteria.
Does such a thing exist? Closest I've found is grafana + promql, but ideally a ready made dashboard or other solution would be great.
I'm running this on an on-premise kubernetes cluster.
So there are many possible approaches to this case. Grafana would be one of them.
First of all you could use resource requests and/or limits. Basically requests are reserved amount of memory or CPU for containers in a pod. Limits obviously limit the amount of resources that can be used. More about it here. You can also use resource quotas to constraint resource usage for each namespace. You can find more information here.
This would be for controlling the resources, which is also important - if you want to go further there are also cluster autoscalers.
Strictly for monitoring you can also use different tools, as you already mention there is Grafana, but also you can use EFK stack. In GKE there is great integration with stackdriver for monitoring the cluster and components, you can also achieve that in AWS.
There is also more tools inside of Kubernetes. For example:
kubectl top pod --all-namespaces
will show youNAMESPACE NAME CPU(cores) MEMORY(bytes)
usage.I also wrote on StackOverflow about similar topic. Hope it will be helpful. You can find the answer here.
And there is also cAdvisor:
Here is an interesting article about how to approach it. I wanted to test one more thing related to collecting this metrics, I will come back if I will find something valuable.