-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Labels
Description
From Shaun:
- a few things that we need to ensure we covered:
- GPUs - Utilisation, Memory utilisation, Errors, temperature, power consumption, mem bandwidth, clock speed, latency
- BareMetal hosts - Memory, CPU, Latency, PCIe bandwidth and Utilisation, Network Bandwidth and utilisation, Errors, Power Consumption
- We also need to ensure that we are monitoring the metrics for the GPU Operators in k8s.
TODO:
- While most of these metrics are already added in feat: add nvidia gpu monitoring dashboard to grafana #257
- e.g. "Errors" metric is not found there.
- Find and add all missing metrics.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Todo