Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ramessesii2
Copy link
Contributor

What

This PR introduces a Grafana dashboard for monitoring NVIDIA GPU metrics via the DCGM Exporter.

image

How

The dashboard is based on the official NVIDIA dcgm-exporter with few additional template key inherited from other existing dashboards in the repo.
The dashboard assumes that the DCGM Exporter metrics are being scraped. This requires a ServiceMonitor resource to be present, which directs to the DCGM Exporter's metrics endpoint.

Note - While the NVIDIA GPU Operator deploys the DCGM Exporter, it does not, by default, create a ServiceMonitor resource for it.

@ramessesii2 ramessesii2 marked this pull request as ready for review May 7, 2025 09:02
@denis-ryzhkov denis-ryzhkov changed the title add nvidia gpu monitoring dashboard to grafana feat: add nvidia gpu monitoring dashboard to grafana May 7, 2025
@denis-ryzhkov denis-ryzhkov merged commit be2d0e0 into k0rdent:main May 7, 2025
5 of 6 checks passed
@github-project-automation github-project-automation bot moved this to Done in k0rdent May 7, 2025
AndrejsPon00 added a commit to AndrejsPon00/kof that referenced this pull request May 28, 2025
* chore: kof v0.3.0 pre-release (k0rdent#236)

* fix: "Insufficient cpu" for istio-regional (k0rdent#237)

* feat: Added ServiceMonitor for alertmanager (k0rdent#239)

* fix: `Label does not exist` on `removeLabel` when multiple triage CI jobs are triggered (k0rdent#245)

* feat: Added `alertmanager` Grafana datasource (k0rdent#243)

* Added `alertmanager` Grafana datasource.
* Deleted `agg-prometheus` datasource as a duplicate of `victoriametrics` datasource.
* Fixed minor issues.
* Workaround for CI error `deployments.apps "addon-controller" not found`

* feat: add control plane monitoring with otel mtls setup (k0rdent#244)

* fix: `mapping values are not allowed in this context` in `ValuesFromHelm` (k0rdent#250)

* chore: Re-releasing kof v0.3.0 to include latest fixes (k0rdent#252)

* Migrate to v1beta1 (k0rdent#256)

* feat: add nvidia gpu monitoring dashboard to grafana (k0rdent#257)

Signed-off-by: Satyam Bhardwaj <[email protected]>

* chore: remove KCM upgrade from KOF upgrade test (k0rdent#260)

* feat: adopting kube-api-server service monitor from opentelemetry-kube-stack (k0rdent#259)

* feat: extend resources customization for grafana and vmcluster (k0rdent#263)

* chore: bump go version to upcomming upgrade (k0rdent#269)

Bump go version to prevent CI falure due outdated version

* chore: bump go version for upcomming upgrade (k0rdent#271)

* fix: istio remote secret creation (k0rdent#270)

* feat: add cluster annotation to customize promxy and datasource http config (k0rdent#276)

* feat: move to victoria-log-cluster (k0rdent#274)

* chore: bump helm charts versions to v1.0.0 (k0rdent#277)

* test: Debug `"helm repo add" requires 2 arguments` in `workflows/release_charts` (k0rdent#280)

* fix: `"helm repo add" requires 2 arguments` in `workflows/release_charts` exposed by fix in yq v4.45.4 (k0rdent#282)

* fix: support modification of resources for all VM services (k0rdent#279)

closes k0rdent#261

misc: fix replicationFactor keys location as well

* chore: kof 1.0.0-rc2 using kcm 1.0.0-rc1 and kcm api v1beta1 (k0rdent#284)

* chore: kof 1.0.0 using kcm 1.0.0 (k0rdent#285)

* feat: add kof-operator internal observability UI

* Fix golang test

* Add toast library

* Remove mutex

* Remove unnecessary target fetch

* Add loading and empty data layout

* Fix console error

* Refactor PrometheusTargets provider

* Cleanup

* Add adopted cluster support

* Refactor prometheus handler

* Apply suggested improvements and fixes

* Fix linter issue

* Fix table layout

* Backend refactoring

* Frontend refactoring

* Fix linter issues

* Add linter check to workflow

* Fix bad date handling

* Add more context to logs

---------

Signed-off-by: Satyam Bhardwaj <[email protected]>
Co-authored-by: Denis Ryzhkov <[email protected]>
Co-authored-by: Aleksei Larkov <[email protected]>
Co-authored-by: Satyam Bhardwaj <[email protected]>
Co-authored-by: Vladimir Kuklin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants