Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[IMPROVEMENT] Make Kubernetes Metrics Server (metrics.k8s.io) integration toggleableΒ #13011

Description

@hookak

Is your improvement request related to a feature? Please describe (πŸ‘ if you like this request)

Several Longhorn Prometheus metrics (CPU/memory usage of nodes, instance manager pods, and manager pods) depend on the Kubernetes Metrics Server API (metrics.k8s.io). However, on clusters where metrics-server is not installed, longhorn-manager keeps issuing these API calls without any awareness that the API is unavailable.

Problems

  • On every Prometheus scrape (typically every 30s), calls to NodeMetricses().List() / PodMetricses().List() fail and produce an Error during scrape warning. These warnings accumulate continuously.
  • While each failure is graceful (collectors log and skip), the result is persistent log noise and unnecessary API traffic in production environments without metrics-server.
  • There is currently no user-facing way to disable this behavior.

Call sites (4 locations)

  • metrics_collector/node_collector.go β€” collectNodeActualCPUMemoryUsage
  • metrics_collector/instance_manager_collector.go β€” collectActualUsage
  • metrics_collector/manager_collector.go β€” Collect
  • controller/setting_controller.go β€” collectResourceUsage (invoked by the upgrade checker when sending cluster usage along with the version check request)

Describe the solution you'd like

  • Setting name: kubernetes-metrics-server-metrics-enabled
  • Type: Bool, Default: true (preserves existing behavior / backward compatible)
  • Category: General

When this setting is false, all four call sites above short-circuit with an early return and skip the metrics-server API call. As a result, only the following 6 metrics stop being exposed:

  • longhorn_node_cpu_usage_millicpu, longhorn_node_memory_usage_bytes
  • longhorn_instance_manager_cpu_usage_millicpu, longhorn_instance_manager_memory_usage_bytes
  • longhorn_manager_cpu_usage_millicpu, longhorn_manager_memory_usage_bytes

All other metrics that do not depend on metrics-server β€” capacity, resource requests, storage, status, counts, gRPC connections, etc. β€” continue to be exposed unchanged. This is a partial, surgical disable rather than turning off entire collectors.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

area/monitoringSystem (cluster, node) or volume metrics, logs, statskind/improvementRequest for improvement of existing functionpriority/0Must be implement or fixed in this release (managed by PO)require/auto-e2e-testRequire adding/updating auto e2e test cases if they can be automatedrequire/docRequire updating the longhorn.io documentationrequire/important-noteUpgrade, Deprecation, Important notes
No fields configured for Improvement.

Projects

Status
Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions