Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Metrics for AI clusters #286

@denis-ryzhkov

Description

@denis-ryzhkov

From Shaun:

  • a few things that we need to ensure we covered:
    • GPUs - Utilisation, Memory utilisation, Errors, temperature, power consumption, mem bandwidth, clock speed, latency
    • BareMetal hosts - Memory, CPU, Latency, PCIe bandwidth and Utilisation, Network Bandwidth and utilisation, Errors, Power Consumption
  • We also need to ensure that we are monitoring the metrics for the GPU Operators in k8s.

TODO:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions