-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support PSI metrics #9608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support PSI metrics #9608
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9608 +/- ##
==========================================
+ Coverage 64.27% 67.01% +2.73%
==========================================
Files 205 206 +1
Lines 28826 28920 +94
==========================================
+ Hits 18529 19380 +851
+ Misses 8649 7883 -766
- Partials 1648 1657 +9 🚀 New features to boost your workflow:
|
WalkthroughAdds PSI (Pressure Stall Information) support: new "pressure" metric option, PSI data types and fields in cgroup stats, six new pressure metric descriptors and registration, Linux generator and non-Linux stub, server integration, docs/config update, and tests verifying six new metrics. Changes
sequenceDiagram
autonumber
participant Cgroup as Cgroup subsystem
participant StatsCol as Stats Collector\n(internal/config/cgmgr)
participant PSI_Types as PSI Types\n(PSIStats/PSIData)
participant Generator as Pressure Generator\n(internal/lib/stats)
participant Server as Metrics Server\n(internal/lib/stats)
participant Registry as Metrics Registry\n(descriptors/metrics)
Cgroup->>StatsCol: emit libctr PSI stats (CPU, Memory, IO)
StatsCol->>PSI_Types: cgroupPSIStats() → populate PSI fields
PSI_Types->>Generator: containerStats with CPU/Memory/IO.PSI
Generator->>Registry: emit 6 counter metrics (stalled/waiting per resource)
Generator-->>Server: return []*Metric
Server->>Registry: include metrics in response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
test/cri-metrics.bats (1)
528-563: Consider skipping test on systems without PSI support.The test looks well-structured and follows existing patterns. However, PSI (Pressure Stall Information) requires kernel support and is typically only available on cgroup v2 with PSI enabled. The test may fail on cgroup v1 systems or systems where PSI is disabled.
Consider adding a skip condition similar to other cgroup v2-specific tests:
@test "container pressure metrics" { + if ! is_cgroup_v2; then + skip "PSI metrics require cgroup v2" + fi + # Also check if PSI is enabled + if [[ ! -f /proc/pressure/cpu ]]; then + skip "PSI is not enabled on this system" + fi + CONTAINER_ENABLE_METRICS="true" CONTAINER_METRICS_PORT=$(free_port) setup_crio
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/crio.conf.5.md(1 hunks)internal/config/cgmgr/stats_linux.go(6 hunks)internal/lib/stats/descriptors.go(1 hunks)internal/lib/stats/metrics.go(1 hunks)internal/lib/stats/pressure_metrics.go(1 hunks)internal/lib/stats/stats_server_linux.go(1 hunks)pkg/config/config.go(1 hunks)test/cri-metrics.bats(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Edit
.mdsource files for documentation, not generated files
Files:
docs/crio.conf.5.md
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Usefmt.Errorfwith%wfor error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming:*_{linux,freebsd}.gofor platform-dependent code
Files:
pkg/config/config.gointernal/lib/stats/pressure_metrics.gointernal/lib/stats/metrics.gointernal/lib/stats/stats_server_linux.gointernal/lib/stats/descriptors.gointernal/config/cgmgr/stats_linux.go
**/*.bats
📄 CodeRabbit inference engine (AGENTS.md)
Use
.batsfile extension for BATS integration test files
Files:
test/cri-metrics.bats
🧬 Code graph analysis (5)
internal/lib/stats/pressure_metrics.go (2)
internal/config/cgmgr/stats_linux.go (3)
CPUStats(56-71)MemoryStats(34-54)DiskIOStats(89-93)vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go (1)
MetricType_COUNTER(679-679)
internal/lib/stats/metrics.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
internal/lib/stats/stats_server_linux.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
internal/config/cgmgr/stats_linux.go (1)
vendor/github.com/opencontainers/cgroups/stats.go (1)
BlkioStats(151-166)
test/cri-metrics.bats (1)
test/helpers.bash (5)
crictl(86-88)free_port(249-251)setup_crio(145-195)start_crio_no_setup(219-228)set_container_pod_cgroup_root(626-640)
🪛 GitHub Check: build-freebsd
internal/lib/stats/pressure_metrics.go
[failure] 61-61:
blkio.PSI undefined (type *cgmgr.DiskIOStats has no field or method PSI)
[failure] 52-52:
blkio.PSI undefined (type *cgmgr.DiskIOStats has no field or method PSI)
[failure] 43-43:
memory.PSI undefined (type *cgmgr.MemoryStats has no field or method PSI)
[failure] 34-34:
memory.PSI undefined (type *cgmgr.MemoryStats has no field or method PSI)
[failure] 25-25:
cpu.PSI undefined (type *cgmgr.CPUStats has no field or method PSI)
[failure] 16-16:
cpu.PSI undefined (type *cgmgr.CPUStats has no field or method PSI)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: unit / amd64 / root
- GitHub Check: build
- GitHub Check: build static / s390x
- GitHub Check: unit / amd64 / rootless
- GitHub Check: build static / arm64
- GitHub Check: security-checks
- GitHub Check: build static / amd64
- GitHub Check: build static / ppc64le
- GitHub Check: unit / arm64 / root
- GitHub Check: docs
- GitHub Check: codeql-build
- GitHub Check: lint
- GitHub Check: binaries / arm64
- GitHub Check: binaries / amd64
🔇 Additional comments (6)
pkg/config/config.go (1)
78-105: LGTM!The new
PressureMetricsconstant follows the established naming pattern and is correctly added toAvailableMetrics. The validation inStatsConfig.Validate()will properly handle this new metric type.docs/crio.conf.5.md (1)
592-593: LGTM!Documentation correctly updated to include "pressure" in the available metrics list, consistent with the code changes.
internal/lib/stats/stats_server_linux.go (1)
313-316: LGTM!The new
PressureMetricscase follows the established pattern for metric generation. The nil check onpressureMetricsbefore appending is consistent with other metric cases.internal/lib/stats/metrics.go (1)
119-126: LGTM!The pressure metrics are correctly registered in
availableMetricDescriptorsfollowing the established pattern for other metric categories.internal/config/cgmgr/stats_linux.go (1)
189-208: LGTM!The
cgroupPSIStatshelper correctly handles nil input and properly maps the cgroupsPSIStatsto the internal representation.test/cri-metrics.bats (1)
48-50: LGTM!The assertion change to
container_last_seenis appropriate since this is an always-on metric, making the test more reliable regardless of which optional metrics are configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/config/cgmgr/stats_linux.go (1)
34-53: Fix PSI unit documentation from nanoseconds to microseconds.The comment on
PSIData.Totalincorrectly documents the unit as "nanoseconds," but PSI total counters from/proc/pressure/*(per Linux kernel documentation) are reported in microseconds. Update all PSI struct field comments across the file to reflect this:type PSIData struct { - // Total is total duration for tasks in the cgroup have waited due to congestion. - // Unit: nanoseconds. + // Total is total duration tasks in the cgroup have waited due to congestion. + // Unit: microseconds (as reported by PSI in /proc/pressure/*). Total uint64This applies to PSI field comments in MemoryStats, CPUStats, DiskIOStats, and any other PSI-related struct definitions in the file.
♻️ Duplicate comments (2)
internal/lib/stats/descriptors.go (1)
298-326: Align pressure metric descriptors’LabelKeyswith the rest of the container metrics.The CPU pressure descriptors specify
LabelKeys: baseLabelKeys, but the memory and IO pressure descriptors omitLabelKeys. All other per-container descriptors in this file consistently define label keys, so these four are inconsistent and may result in missing or mismatched labels for memory/IO pressure metrics.Recommend adding
LabelKeys: baseLabelKeysto keep the label schema uniform:containerPressureMemoryStalledSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_memory_stalled_seconds_total", - Help: "Total time duration no tasks in the container could make progress due to memory congestion.", + Name: "container_pressure_memory_stalled_seconds_total", + Help: "Total time duration no tasks in the container could make progress due to memory congestion.", + LabelKeys: baseLabelKeys, } containerPressureMemoryWaitingSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_memory_waiting_seconds_total", - Help: "Total time duration tasks in the container have waited due to memory congestion.", + Name: "container_pressure_memory_waiting_seconds_total", + Help: "Total time duration tasks in the container have waited due to memory congestion.", + LabelKeys: baseLabelKeys, } containerPressureIOStalledSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_io_stalled_seconds_total", - Help: "Total time duration no tasks in the container could make progress due to IO congestion.", + Name: "container_pressure_io_stalled_seconds_total", + Help: "Total time duration no tasks in the container could make progress due to IO congestion.", + LabelKeys: baseLabelKeys, } containerPressureIOWaitingSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_io_waiting_seconds_total", - Help: "Total time duration tasks in the container have waited due to IO congestion.", + Name: "container_pressure_io_waiting_seconds_total", + Help: "Total time duration tasks in the container have waited due to IO congestion.", + LabelKeys: baseLabelKeys, }This will make pressure metrics align with other container metrics and reduce surprises for metric consumers.
internal/lib/stats/pressure_metrics_linux.go (1)
10-69: Guard against nil PSI data to avoid panics on systems without PSI.
generateContainerPressureMetricsdereferencescpu.PSI,memory.PSI, andblkio.PSIwithout checks. On kernels/cgroup configs where PSI isn’t available, those pointers can benil, which would cause a panic when pressure metrics are enabled.Add an early return when PSI data is missing, so we simply omit pressure metrics instead of crashing:
func generateContainerPressureMetrics(ctr *oci.Container, cpu *cgmgr.CPUStats, memory *cgmgr.MemoryStats, blkio *cgmgr.DiskIOStats) []*types.Metric { + // PSI stats may be nil on systems that don't support pressure stall information. + if cpu == nil || cpu.PSI == nil || memory == nil || memory.PSI == nil || blkio == nil || blkio.PSI == nil { + return nil + } + metrics := []*containerMetric{This matches the existing pattern where callers already handle a
nilslice of metrics gracefully.
🧹 Nitpick comments (1)
test/cri-metrics.bats (1)
528-563: Consider skipping the pressure test when PSI is unavailable.The new
"container pressure metrics"test correctly validates that all six pressure metrics are present and parseable once"pressure"is enabled. However, on hosts without PSI support (no/proc/pressure/{cpu,memory,io}or old kernels / cgroup setups),generateContainerPressureMetricswill legitimately return no metrics and this test will fail.You may want to guard the test with a quick PSI capability check (e.g., skip if
/proc/pressure/cpuis missing) to avoid spurious failures on such environments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
docs/crio.conf.5.md(1 hunks)internal/config/cgmgr/stats_linux.go(6 hunks)internal/lib/stats/descriptors.go(1 hunks)internal/lib/stats/metrics.go(1 hunks)internal/lib/stats/pressure_metrics_linux.go(1 hunks)internal/lib/stats/pressure_metrics_unsupported.go(1 hunks)internal/lib/stats/stats_server_linux.go(1 hunks)pkg/config/config.go(1 hunks)test/cri-metrics.bats(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/lib/stats/metrics.go
🧰 Additional context used
📓 Path-based instructions (3)
**/*.bats
📄 CodeRabbit inference engine (AGENTS.md)
Use
.batsfile extension for BATS integration test files
Files:
test/cri-metrics.bats
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Usefmt.Errorfwith%wfor error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming:*_{linux,freebsd}.gofor platform-dependent code
Files:
internal/lib/stats/pressure_metrics_linux.gointernal/lib/stats/pressure_metrics_unsupported.gointernal/lib/stats/descriptors.gopkg/config/config.gointernal/config/cgmgr/stats_linux.gointernal/lib/stats/stats_server_linux.go
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Edit
.mdsource files for documentation, not generated files
Files:
docs/crio.conf.5.md
🧬 Code graph analysis (5)
test/cri-metrics.bats (1)
test/helpers.bash (1)
crictl(86-88)
internal/lib/stats/pressure_metrics_linux.go (2)
internal/config/cgmgr/stats_linux.go (3)
CPUStats(56-71)MemoryStats(34-54)DiskIOStats(89-93)vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go (1)
MetricType_COUNTER(679-679)
internal/lib/stats/pressure_metrics_unsupported.go (1)
internal/config/cgmgr/stats_linux.go (3)
CPUStats(56-71)MemoryStats(34-54)DiskIOStats(89-93)
internal/config/cgmgr/stats_linux.go (1)
vendor/github.com/opencontainers/cgroups/stats.go (1)
BlkioStats(151-166)
internal/lib/stats/stats_server_linux.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
- GitHub Check: integration / conmon-rs / crun / amd64
- GitHub Check: integration / conmon / crun / arm64
- GitHub Check: integration / userns / crun / amd64
- GitHub Check: integration / conmon / crun / amd64
- GitHub Check: critest / conmon / crun / arm64
- GitHub Check: critest / conmon-rs / crun / arm64
- GitHub Check: critest / conmon-rs / crun / amd64
- GitHub Check: critest / conmon / crun / amd64
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: lint
- GitHub Check: codeql-build
- GitHub Check: docs
- GitHub Check: build static / amd64
- GitHub Check: build
- GitHub Check: security-checks
- GitHub Check: unit / amd64 / rootless
- GitHub Check: unit / amd64 / root
- GitHub Check: unit / arm64 / root
- GitHub Check: build static / s390x
- GitHub Check: build static / ppc64le
- GitHub Check: build static / arm64
🔇 Additional comments (5)
docs/crio.conf.5.md (1)
589-593: Docs and config stay in sync forincluded_pod_metrics.The newly documented
"pressure"value matchesPressureMetricsinpkg/config/config.goandAvailableMetrics, keeping user-facing config in sync with implementation. No further changes needed here.pkg/config/config.go (1)
78-105: Pressure metric wiring in config looks consistent.Adding
PressureMetrics = "pressure"and including it inAvailableMetrics, along with theStatsConfig.Validatecheck, cleanly integrates the new metric type with existing validation and matches the docs.Also applies to: 2213-2223
internal/lib/stats/stats_server_linux.go (1)
264-320: Pressure metrics integration into container metrics loop is correct.The new
case config.PressureMetricsfollows the same pattern as other metric groups, safely appending returned metrics when non-nil and leaving behavior unchanged when pressure metrics are disabled. The actual safety and availability concerns are localized togenerateContainerPressureMetrics.internal/lib/stats/pressure_metrics_unsupported.go (1)
1-15: Non-Linux stub cleanly disables pressure metrics.The
//go:build !linuxstub returningnilis appropriate for platforms without PSI support and keeps non-Linux builds compiling without extra conditionals in callers.test/cri-metrics.bats (1)
32-51: Usingcontainer_last_seeninmetrics_setupis more robust.Switching the “container metrics present” assertion to
container_last_seendecouples setup from any specific optional metric group like network, which fits better now that tests selectively enable metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
internal/config/cgmgr/stats_linux.go (1)
102-105: Update comment to reflect microseconds, not nanoseconds.The comment on line 104 states "Unit: nanoseconds" but Linux PSI reports the total stall time in microseconds. This should be corrected to avoid confusion.
type PSIData struct { // Total is total duration for tasks in the cgroup have waited due to congestion. - // Unit: nanoseconds. + // Unit: microseconds. Total uint64internal/lib/stats/descriptors.go (1)
298-326: Align pressure descriptors’ labels and units with the rest of the metrics.Two separate issues in this block:
Missing
LabelKeyson memory/IO pressure metrics (repeat from earlier review).
All other descriptors in this file, including the CPU pressure ones, specifyLabelKeys(usuallybaseLabelKeys). The four memory/IO descriptors omit them, which is inconsistent and may break label handling if the stats code assumes the standard label set.Recommend adding
LabelKeys: baseLabelKeysto keep them consistent:containerPressureMemoryStalledSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_memory_stalled_seconds_total", - Help: "Total time duration no tasks in the container could make progress due to memory congestion.", + Name: "container_pressure_memory_stalled_seconds_total", + Help: "Total time duration no tasks in the container could make progress due to memory congestion.", + LabelKeys: baseLabelKeys, } containerPressureMemoryWaitingSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_memory_waiting_seconds_total", - Help: "Total time duration tasks in the container have waited due to memory congestion.", + Name: "container_pressure_memory_waiting_seconds_total", + Help: "Total time duration tasks in the container have waited due to memory congestion.", + LabelKeys: baseLabelKeys, } containerPressureIOStalledSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_io_stalled_seconds_total", - Help: "Total time duration no tasks in the container could make progress due to IO congestion.", + Name: "container_pressure_io_stalled_seconds_total", + Help: "Total time duration no tasks in the container could make progress due to IO congestion.", + LabelKeys: baseLabelKeys, } containerPressureIOWaitingSecondsTotal = &types.MetricDescriptor{ - Name: "container_pressure_io_waiting_seconds_total", - Help: "Total time duration tasks in the container have waited due to IO congestion.", + Name: "container_pressure_io_waiting_seconds_total", + Help: "Total time duration tasks in the container have waited due to IO congestion.", + LabelKeys: baseLabelKeys, }
Verify units vs metric names (
*_seconds_total).
Linux PSItotalfields are in microseconds; ifgenerateContainerPressureMetricsis forwarding those raw values, the*_seconds_totalnames and “time duration … in seconds” Help text are inaccurate. Either:
- Convert to seconds in the generator (and, if needed, use a float value type), or
- Rename these metrics (and Help) to
*_microseconds_totalto reflect the real unit.Given this is a new metrics surface, it’s better to get the units right now before users depend on the current naming.
🧹 Nitpick comments (1)
test/cri-metrics.bats (1)
528-563: Consider adding a guard for PSI availability.PSI (Pressure Stall Information) requires cgroup v2 and Linux kernel 4.20+. Without a guard, this test may fail on systems with older kernels or cgroup v1-only configurations, causing false negatives in CI/CD pipelines.
Consider adding a skip condition:
@test "container pressure metrics" { + if ! is_cgroup_v2; then + skip "PSI metrics require cgroup v2" + fi + CONTAINER_ENABLE_METRICS="true" CONTAINER_METRICS_PORT=$(free_port) setup_crioAlternatively, check for PSI availability directly by testing for cgroup PSI files (e.g.,
/sys/fs/cgroup/cpu.pressureor equivalent in the container's cgroup path).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
docs/crio.conf.5.md(1 hunks)internal/config/cgmgr/stats_linux.go(6 hunks)internal/lib/stats/descriptors.go(1 hunks)internal/lib/stats/metrics.go(1 hunks)internal/lib/stats/pressure_metrics_linux.go(1 hunks)internal/lib/stats/pressure_metrics_unsupported.go(1 hunks)internal/lib/stats/stats_server_linux.go(1 hunks)pkg/config/config.go(1 hunks)test/cri-metrics.bats(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/lib/stats/pressure_metrics_unsupported.go
- docs/crio.conf.5.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Usefmt.Errorfwith%wfor error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming:*_{linux,freebsd}.gofor platform-dependent code
Files:
internal/lib/stats/metrics.gointernal/lib/stats/pressure_metrics_linux.gointernal/lib/stats/stats_server_linux.gointernal/lib/stats/descriptors.gopkg/config/config.gointernal/config/cgmgr/stats_linux.go
**/*.bats
📄 CodeRabbit inference engine (AGENTS.md)
Use
.batsfile extension for BATS integration test files
Files:
test/cri-metrics.bats
🧬 Code graph analysis (5)
internal/lib/stats/metrics.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
internal/lib/stats/pressure_metrics_linux.go (2)
internal/config/cgmgr/stats_linux.go (3)
CPUStats(56-71)MemoryStats(34-54)DiskIOStats(89-93)vendor/k8s.io/cri-api/pkg/apis/runtime/v1/api.pb.go (1)
MetricType_COUNTER(679-679)
internal/lib/stats/stats_server_linux.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
test/cri-metrics.bats (1)
test/helpers.bash (3)
crictl(86-88)free_port(249-251)setup_crio(145-195)
internal/config/cgmgr/stats_linux.go (1)
vendor/github.com/opencontainers/cgroups/stats.go (1)
BlkioStats(151-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (44)
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: integration / conmon / crun / amd64
- GitHub Check: integration / conmon-rs / crun / amd64
- GitHub Check: critest / conmon / crun / amd64
- GitHub Check: integration / userns / crun / amd64
- GitHub Check: critest / conmon-rs / crun / arm64
- GitHub Check: integration / conmon / crun / arm64
- GitHub Check: critest / conmon / crun / arm64
- GitHub Check: critest / conmon-rs / crun / amd64
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: docs
- GitHub Check: codeql-build
- GitHub Check: build
- GitHub Check: unit / arm64 / root
- GitHub Check: security-checks
- GitHub Check: build static / arm64
- GitHub Check: build static / ppc64le
- GitHub Check: build static / s390x
- GitHub Check: unit / amd64 / root
- GitHub Check: unit / amd64 / rootless
- GitHub Check: build static / amd64
- GitHub Check: lint
🔇 Additional comments (10)
internal/config/cgmgr/stats_linux.go (3)
52-53: LGTM! PSI fields properly added to stats structs.The
PSI *PSIStatsfields are consistently added toMemoryStats,CPUStats, andDiskIOStats, enabling pressure stall information to be propagated through the stats pipeline.Also applies to: 69-70, 91-93
189-208: LGTM! Well-structured conversion with proper nil handling.The
cgroupPSIStatshelper correctly handles nil input and maps all PSI fields from the cgroups library type to the internal representation.
180-184: LGTM! PSI data properly wired into stats construction.PSI stats are correctly integrated into the stats assembly:
statsFromLibctrMgrsetsDiskIO.PSIfromBlkioStats.PSIcgroupMemStatssetsPSIfrommemStats.PSIcgroupCPUStatssetsPSIfromcpuStats.PSIAlso applies to: 282-282, 295-295
internal/lib/stats/metrics.go (1)
119-126: LGTM! Pressure metrics descriptors properly registered.The six new pressure metrics are correctly added to
availableMetricDescriptorsfollowing the same pattern as other metric categories. This enables the metrics to be exposed whenpressureis included inincluded_pod_metrics.internal/lib/stats/stats_server_linux.go (1)
313-316: LGTM! Pressure metrics case properly integrated.The new
config.PressureMetricscase follows the established pattern of other metric handlers in the switch statement. The nil check before appending ensures graceful handling when PSI data is unavailable.internal/lib/stats/pressure_metrics_linux.go (3)
10-34: LGTM! CPU pressure metrics with proper nil handling.The CPU pressure metrics are correctly guarded with nil checks for both
cpuandcpu.PSI. The mapping ofFull.Totalto "stalled" andSome.Totalto "waiting" aligns with PSI semantics.
36-80: LGTM! Memory and IO pressure metrics properly implemented.Both memory and IO pressure metrics have appropriate nil guards and follow the same pattern as CPU metrics.
82-82: Verify the "cpu" label passed tocomputeContainerMetrics.The hardcoded
"cpu"label appears semantically inconsistent if this function generates memory and IO pressure metrics. Verify whether this label should reflect the metric type (e.g.,"memory","io") or if the current usage is intentional. Check other callers ofcomputeContainerMetricsto confirm the expected pattern for this parameter.test/cri-metrics.bats (1)
48-48: LGTM! Updated assertion uses always-on metric.Changing from
container_network_receive_bytes_totaltocontainer_last_seenis appropriate sincecontainer_last_seenis an always-on metric that will be present regardless of which metric groups are enabled.pkg/config/config.go (1)
78-105: NewPressureMetricsconfig key is wired correctly into validation.The
"pressure"metric selector and its inclusion inAvailableMetricsalign with existing patterns, andStatsConfig.Validatewill correctly accept it. No issues from the config side.
Signed-off-by: Ayato Tokubi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
internal/lib/stats/pressure_metrics_linux.go (1)
10-12: Integer division loses sub-second precision.The conversion from microseconds to seconds using integer division will truncate any value less than 1 second to 0. For example, 500,000 microseconds (0.5 seconds) becomes 0. This could be significant for short-lived containers or low-pressure scenarios where PSI values are small.
The PR description mentions that CRI stats accept
uint64while cAdvisor emits PSI asfloat64. If precision is important, consider emitting values in microseconds and documenting the unit in the metric name (e.g.,container_pressure_cpu_stalled_microseconds_total), or acknowledge this limitation in the code comment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
docs/crio.conf.5.md(1 hunks)internal/config/cgmgr/stats_linux.go(6 hunks)internal/lib/stats/descriptors.go(1 hunks)internal/lib/stats/metrics.go(1 hunks)internal/lib/stats/pressure_metrics_linux.go(1 hunks)internal/lib/stats/pressure_metrics_unsupported.go(1 hunks)internal/lib/stats/stats_server_linux.go(1 hunks)pkg/config/config.go(1 hunks)test/cri-metrics.bats(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/config/config.go
- internal/lib/stats/metrics.go
🧰 Additional context used
📓 Path-based instructions (3)
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Edit
.mdsource files for documentation, not generated files
Files:
docs/crio.conf.5.md
**/*.bats
📄 CodeRabbit inference engine (AGENTS.md)
Use
.batsfile extension for BATS integration test files
Files:
test/cri-metrics.bats
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Usefmt.Errorfwith%wfor error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming:*_{linux,freebsd}.gofor platform-dependent code
Files:
internal/lib/stats/stats_server_linux.gointernal/lib/stats/pressure_metrics_linux.gointernal/lib/stats/pressure_metrics_unsupported.gointernal/config/cgmgr/stats_linux.gointernal/lib/stats/descriptors.go
🧬 Code graph analysis (4)
test/cri-metrics.bats (1)
test/helpers.bash (5)
crictl(86-88)free_port(249-251)setup_crio(145-195)start_crio_no_setup(219-228)set_container_pod_cgroup_root(626-640)
internal/lib/stats/stats_server_linux.go (1)
pkg/config/config.go (1)
PressureMetrics(90-90)
internal/lib/stats/pressure_metrics_unsupported.go (1)
internal/config/cgmgr/stats_linux.go (3)
CPUStats(56-71)MemoryStats(34-54)DiskIOStats(89-93)
internal/config/cgmgr/stats_linux.go (1)
vendor/github.com/opencontainers/cgroups/stats.go (1)
BlkioStats(151-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (43)
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: critest / conmon-rs / crun / arm64
- GitHub Check: integration / userns / crun / amd64
- GitHub Check: integration / conmon / crun / arm64
- GitHub Check: integration / conmon / crun / amd64
- GitHub Check: critest / conmon / crun / amd64
- GitHub Check: integration / conmon-rs / crun / amd64
- GitHub Check: critest / conmon-rs / crun / amd64
- GitHub Check: critest / conmon / crun / arm64
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: lint
- GitHub Check: docs
- GitHub Check: security-checks
- GitHub Check: build
- GitHub Check: unit / amd64 / rootless
- GitHub Check: build static / amd64
- GitHub Check: unit / arm64 / root
- GitHub Check: build static / ppc64le
- GitHub Check: build static / s390x
- GitHub Check: codeql-build
- GitHub Check: build static / arm64
- GitHub Check: unit / amd64 / root
🔇 Additional comments (16)
docs/crio.conf.5.md (1)
592-592: Documentation update looks good.The "pressure" value is correctly added to the list of available
included_pod_metricsoptions, consistent with the newPressureMetricsconstant inpkg/config/config.go.internal/lib/stats/stats_server_linux.go (1)
313-316: Integration follows established patterns.The pressure metrics case correctly follows the same pattern as other metric types (CPU, Memory, DiskIO), including the nil check before appending. The parameters passed match the
PSIfields available in the respective stat structs.internal/lib/stats/pressure_metrics_unsupported.go (1)
1-15: Non-Linux stub correctly implemented.The build constraint
//go:build !linuxensures this stub is used on non-Linux platforms. The function signature matches the Linux implementation, and returningnilis the appropriate behavior since PSI is a Linux-only feature. This follows the coding guidelines for platform-specific file naming.test/cri-metrics.bats (2)
48-48: Appropriate change to use always-on metric.Using
container_last_seeninstead ofcontainer_network_receive_bytes_totalis a better choice for the setup assertion since it's an always-on metric that doesn't depend on theincluded_pod_metricsconfiguration.
528-563: Test covers all six pressure metrics, but may need a skip condition for systems without PSI support.The test correctly verifies the presence of all six pressure metrics. However, PSI (Pressure Stall Information) requires either cgroup v2, or cgroup v1 with PSI explicitly enabled (kernel parameter
psi=1). On systems where PSI is not available, this test may fail because the metrics won't be generated (the Linux implementation returnsnilwhenPSIisnil).Consider adding a skip condition similar to the cgroup v2 checks in other tests. Verify that the helper function
is_cgroup_v2exists and that checking for/proc/pressure/cpuis the appropriate method to detect PSI availability on your target systems.internal/lib/stats/pressure_metrics_linux.go (4)
22-43: Nil safety correctly implemented for CPU PSI metrics.The nil checks for both
cpuandcpu.PSIprevent panics on systems without PSI support. This addresses the concern from the previous review.
45-66: Memory PSI metrics properly guarded.Consistent nil safety pattern applied for memory PSI metrics.
68-89: IO PSI metrics properly guarded.Consistent nil safety pattern applied for IO PSI metrics.
91-91: Verify the hardcoded "cpu" label is intentional.The third parameter to
computeContainerMetricsis"cpu", which appears to be a label applied to all pressure metrics. Since this function generates metrics for CPU, memory, and IO pressure, using"cpu"as the label for all of them may be confusing or incorrect.Please verify this is the intended behavior. If each metric type should have its own label, consider restructuring to call
computeContainerMetricsseparately for each resource type.internal/config/cgmgr/stats_linux.go (6)
52-53: LGTM!The PSI field additions to MemoryStats, CPUStats, and DiskIOStats are consistent and appropriately use pointer types to handle the optional nature of PSI data.
Also applies to: 69-70, 92-93
95-112: LGTM!The PSI types are well-structured with clear documentation. The unit comment correctly states microseconds.
189-208: LGTM!The
cgroupPSIStatshelper correctly handles nil input and performs a clean mapping from the cgroups library types to the local PSI types.
180-186: LGTM!DiskIO PSI is correctly wired using the
cgroupPSIStatshelper.
280-283: LGTM!Memory PSI is correctly wired, consistent with the pattern used for other PSI fields.
293-296: LGTM!CPU PSI is correctly wired, completing the consistent PSI integration across all stat types.
internal/lib/stats/descriptors.go (1)
297-330: LGTM!The six new pressure metric descriptors are well-defined with consistent naming patterns, appropriate help text, and proper
LabelKeysassignments. The naming follows Prometheus conventions with the_seconds_totalsuffix for counter metrics.
|
@cri-o/cri-o-maintainers PTAL |
| PSI *PSIStats | ||
| } | ||
|
|
||
| type PSIStats struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we pretty much fully duplicating here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should just use the libctr version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should reinstate this discussion. I totally agree with you, but fixing this will include more changes not related to PSI (even if we fix only PSI, we need new _unsupported.go files in /oci).
I can follow up if it's ok, or do you want me to change this before merging this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 no this is fine, sorry I forgot
|
/approve LGTM @cri-o/cri-o-maintainers PTAL |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bitoku, haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@cri-o/cri-o-maintainers PTAL |
|
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
building on #9565
Which issue(s) this PR fixes:
Special notes for your reviewer:
Apparently only uint64 is supported in CRI stats.
In cAdvisor, it emits in float64 (converted to microseconds to seconds).
Do we have a way to do in the same way as cAdvisor? In this implementation it converts it in seconds, but the value below the decimal point is ignored.
ref: https://github.com/google/cadvisor/pull/3649/files#diff-583dd1a38478c42e7ee4f90a9c3dfb5fd8a07b82f57d4ed24fa6a98a5951a4e7R1754
Does this PR introduce a user-facing change?
Summary by CodeRabbit
New Features
Platform
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.