Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@willianpaixao
Copy link
Contributor

@willianpaixao willianpaixao commented Nov 21, 2025

What type of PR is this?

/kind feature

What this PR does / Why we need it:

This PR enables kubectl exec and crictl exec to access containers during their graceful termination period. Previously, these commands would hang indefinitely when attempting to access containers in the "Terminating" state.

The root cause was that the container stop loop held an exclusive operation lock (opLock) for the entire graceful shutdown duration, which could be minutes with high terminationGracePeriodSeconds values. The exec code path required this same lock to check container status, creating a deadlock scenario.

The fix implements granular locking in the stop loop, only holding locks around individual runtime operations rather than for the entire stop duration. This allows exec operations to proceed during the graceful termination window while maintaining thread safety.

Which issue(s) this PR fixes:

Fixes #7160

Special notes for your reviewer:

Key changes:

  1. server/container_exec.go: Replaced UpdateContainerStatus() + State() check with c.Living() to avoid opLock dependency (matches ExecSync pattern)
  2. internal/oci/runtime_oci.go: Removed global opLock from StopLoopForContainer() and added granular locking around individual runtime calls (pause/resume, kill, status checks)

The change allows execs during graceful termination (between SIGTERM and kill loop) but correctly blocks them once the kill loop begins.

Does this PR introduce a user-facing change?:

Fixed kubectl exec and crictl exec commands hanging when accessing containers in the Terminating state. These commands now work correctly throughout the container shutdown period.

Testing:

Manual testing logs showing successful exec during termination:

CRI-O logs
INFO[2025-11-21T20:28:51.173942561+01:00] Pulling image: quay.io/quay/busybox:latest    id=23bdf144-dd71-43bd-9613-3712cc680b65 name=/runtime.v1.ImageService/PullImage
INFO[2025-11-21T20:28:51.17440329+01:00] Trying to access "quay.io/quay/busybox:latest" 
INFO[2025-11-21T20:28:52.949670864+01:00] Pulled image: quay.io/quay/busybox@sha256:92f3298bf80a1ba949140d77987f5de081f010337880cd771f7e7fc928f8c74d  id=23bdf144-dd71-43bd-9613-3712cc680b65 name=/runtime.v1.ImageService/PullImage
INFO[2025-11-21T20:28:52.962772532+01:00] Running pod sandbox: default/busybox-pod/POD  id=7525cb69-928b-4dee-b9fa-8e1f3fa52941 name=/runtime.v1.RuntimeService/RunPodSandbox
INFO[2025-11-21T20:28:52.962806243+01:00] Allowed annotations are specified for workload [io.containers.trace-syscall io.kubernetes.cri-o.Devices] 
INFO[2025-11-21T20:28:52.999253033+01:00] Ran pod sandbox e211f799afdc43b3b7455fed448476f818c5c857cfe3983be5a272f7bba5308f with infra container: default/busybox-pod/POD  id=7525cb69-928b-4dee-b9fa-8e1f3fa52941 name=/runtime.v1.RuntimeService/RunPodSandbox
INFO[2025-11-21T20:28:53.012440737+01:00] Creating container: //                        id=ba521307-4396-4ac3-9236-0bcf82dce774 name=/runtime.v1.RuntimeService/CreateContainer
INFO[2025-11-21T20:28:53.012501751+01:00] Allowed annotations are specified for workload [io.containers.trace-syscall io.kubernetes.cri-o.Devices] 
INFO[2025-11-21T20:28:53.014505914+01:00] Allowed annotations are specified for workload [io.containers.trace-syscall io.kubernetes.cri-o.Devices] 
INFO[2025-11-21T20:28:53.014998016+01:00] Allowed annotations are specified for workload [io.containers.trace-syscall io.kubernetes.cri-o.Devices] 
INFO[2025-11-21T20:28:53.042561101+01:00] Created container b7d01f7e2df0fe168b2bb853b3cf9e56a222ea006f00066ae721b4982df03ee5: //  id=ba521307-4396-4ac3-9236-0bcf82dce774 name=/runtime.v1.RuntimeService/CreateContainer
INFO[2025-11-21T20:28:53.053949058+01:00] Starting container: b7d01f7e2df0fe168b2bb853b3cf9e56a222ea006f00066ae721b4982df03ee5  id=4926d6b5-603a-4eec-b91c-0a5045c1832c name=/runtime.v1.RuntimeService/StartContainer
INFO[2025-11-21T20:28:53.055139909+01:00] Started container                             PID=114359 containerID=b7d01f7e2df0fe168b2bb853b3cf9e56a222ea006f00066ae721b4982df03ee5 description=// id=4926d6b5-603a-4eec-b91c-0a5045c1832c name=/runtime.v1.RuntimeService/StartContainer sandboxID=e211f799afdc43b3b7455fed448476f818c5c857cfe3983be5a272f7bba5308f
INFO[2025-11-21T20:28:55.100217021+01:00] Stopping container: b7d01f7e2df0fe168b2bb853b3cf9e56a222ea006f00066ae721b4982df03ee5 (timeout: 3600s)  id=bd041593-11a0-4524-82b6-309110da0761 name=/runtime.v1.RuntimeService/StopContainer
crictl exec output
$ crictl exec -it $CONTAINER_ID /bin/sh
DEBU[0000] Get runtime connection                       
DEBU[0000] Using runtime connection timeout: 5m0s       
DEBU[0000] Get image connection                         
DEBU[0000] ExecRequest: container_id:"8cba5e5cdf45b3b3f4322b84ab40856e81aa6f63ac27b03c30a416521738365b"  cmd:"/bin/sh"  tty:true  stdin:true  stdout:true 
DEBU[0000] ExecResponse: url:"http://127.0.0.1:38111/exec/OtbCYE8x" 
DEBU[0000] Exec URL: http://127.0.0.1:38111/exec/OtbCYE8x 
DEBU[0000] StreamOptions: {0xc000130028 0xc000130030 0xc000130038 true <nil>}
/ #

Unit tests added in internal/oci/container_test.go verify:

  • AddExecPID accepts exec processes during graceful termination
  • AddExecPID rejects exec processes once kill loop begins
  • DeleteExecPID safely removes tracked processes
  • KillExecPIDs terminates processes with appropriate signals
  • SetAsDoneStopping correctly cleans up resources

Summary by CodeRabbit

  • Bug Fixes

    • More reliable exec registration and PID tracking during container lifecycle; execs can be started during the graceful-shutdown window but are rejected once forced-kill begins.
    • Improved handling of exec attempts when containers are not alive.
  • New Features

    • PTY-backed exec startup now integrates with the improved atomic exec registration flow.
  • Tests

    • Large new test suite covering exec behavior during termination and expanded PID/lifecycle tests.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 21, 2025

Hi @willianpaixao. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@willianpaixao willianpaixao marked this pull request as ready for review November 21, 2025 19:46
Copilot AI review requested due to automatic review settings November 21, 2025 19:46
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2025
@openshift-ci openshift-ci bot requested review from bitoku and littlejawa November 21, 2025 19:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables exec commands to be executed in containers during their graceful termination phase. Previously, containers blocked exec operations as soon as they entered the stopping state. Now, exec is allowed during the initial graceful shutdown period (when SIGTERM is sent) and only blocked once the force-kill loop begins (when SIGKILL is used).

Key changes:

  • Modified AddExecPID() to check stopKillLoopBegun instead of stopping, allowing exec during graceful termination
  • Refactored StopLoopForContainer() lock management to use shorter-lived lock acquisitions, enabling concurrent operations during graceful termination
  • Simplified container state checking in Exec() to use the Living() method

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
server/container_exec.go Simplified container state validation by replacing manual state checking with Living() method call
internal/oci/runtime_oci.go Refactored lock management in StopLoopForContainer() to use shorter critical sections, allowing concurrent exec operations during graceful termination
internal/oci/container.go Changed AddExecPID() to check stopKillLoopBegun instead of stopping, enabling exec registration during graceful shutdown phase

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Nov 24, 2025

Codecov Report

❌ Patch coverage is 78.26087% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.96%. Comparing base (53edaa6) to head (dbbb93c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9614      +/-   ##
==========================================
+ Coverage   63.87%   63.96%   +0.09%     
==========================================
  Files         205      205              
  Lines       28699    28724      +25     
==========================================
+ Hits        18331    18374      +43     
+ Misses       8742     8725      -17     
+ Partials     1626     1625       -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@saschagrunert
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 24, 2025
@willianpaixao
Copy link
Contributor Author

/assign sohankunkerkar

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 25, 2025
@sohankunkerkar
Copy link
Member

@willianpaixao Thanks for working on this! The feature looks useful, but I have some concerns about potential race conditions that I think we should verify with integration tests before merging. Would you be okay adding one? You can refer to https://github.com/cri-o/cri-o/tree/main/test for more information.

@willianpaixao
Copy link
Contributor Author

@sohankunkerkar integration tests added!

Copy link
Member

@sohankunkerkar sohankunkerkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
@cri-o/cri-o-maintainers PTAL

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 27, 2025
@sohankunkerkar sohankunkerkar removed the lgtm Indicates that a PR is ready to be merged. label Nov 27, 2025
@coderabbitai
Copy link

coderabbitai bot commented Dec 11, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Introduces an atomic start-and-register flow for exec commands via a new ExecStarter interface and Container.StartExecCmd; allows execs during graceful termination until the container's kill loop begins, updates runtime and PTY paths to use the new pattern, and adds extensive tests plus a bats suite for exec-termination scenarios.

Changes

Cohort / File(s) Change Summary
Container core & tests
internal/oci/container.go, internal/oci/container_test.go
Replace AddExecPID with StartExecCmd(ExecStarter, shouldKill) (int, error) and add ExecStarter interface; start-and-register executed under stopLock, track execPIDs by returned PID; expand tests for StartExecCmd, DeleteExecPID, KillExecPIDs, SetAsDoneStopping, and watcher synchronization.
Runtime exec integration
internal/oci/runtime_oci.go
Add execCmdWrapper implementing ExecStarter; replace direct cmd.Start()/AddExecPID usage in ExecContainer/ExecSyncContainer with StartExecCmd, use returned PID for cgroup moves, and ensure PID cleanup on failures.
PTY / TTY path
internal/oci/oci_unix.go
Add ptyStarter implementing ExecStarter; defer PTY creation to starter, use StartExecCmd to start tty commands, obtain PTY via starter.Pty(), and remove prior manual PID management.
Server exec eligibility & tests
server/container_exec.go, server/container_exec_test.go
Use Living() check to permit exec when container is created/running/stopping; update tests and add mock ExecStarter scenarios for running, stopping (graceful), and stopped states.
Termination test suite
test/exec_termination.bats
New comprehensive bats tests exercising crictl exec behaviors during graceful termination: concurrent and sequential execs, long/short commands, TTY scenarios, SIGTERM interactions, and restart cases.
CI config
contrib/test/ci/vars.yml
Add test/exec_termination.bats to kata_skip_files with comment about VM timing differences.

Sequence Diagram

sequenceDiagram
    participant Client as Client / crictl/kubectl
    participant Server as Server (container_exec)
    participant Container as Container (internal/oci)
    participant ExecProc as ExecStarter (exec.Cmd / ptyStarter)
    participant PIDMap as execPIDs

    Note over Client,Server: Exec request during graceful termination
    Client->>Server: Exec request
    Server->>Container: Living() check
    Container-->>Server: true

    Server->>Container: StartExecCmd(execStarter, shouldKill)
    activate Container
    Container->>Container: acquire stopLock
    alt killLoopBegun == true
        Container-->>Server: error ("cannot start exec: container is being killed")
    else
        Container->>ExecProc: Start()
        activate ExecProc
        ExecProc-->>Container: started (pid)
        deactivate ExecProc
        Container->>PIDMap: register pid -> shouldKill
        Container-->>Server: pid (success)
    end
    Container->>Container: release stopLock
    deactivate Container

    Note over Container,PIDMap: Later, when kill loop begins
    Container->>PIDMap: iterate and signal registered pids (where shouldKill)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Pay attention to:
    • StartExecCmd locking and race conditions around killLoopBegun and stopLock.
    • Correct PID sourcing (ExecStarter.GetPid) and ensuring DeleteExecPID on all error paths.
    • cgroup move calls now using returned PID in runtime_oci.go.
    • PTY lifecycle in ptyStarter and tty resizing paths.
    • Timing-sensitive tests in test/exec_termination.bats and expanded container tests.

Poem

🐇 I hopped where execs once hit a wall,
I wrapped PIDs neat, made starts atomic and small.
Graceful windows open, kill loops now aware—
Commands may dance while I watch and care.
Hooray for clean starts! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main objective: enabling exec operations to containers during graceful termination, which directly addresses the primary change across all modified files.
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #7160: enables exec during graceful termination, eliminates lock-induced deadlocks via granular locking, preserves termination semantics by blocking exec once kill loop begins, and adds comprehensive unit tests.
Out of Scope Changes check ✅ Passed All code changes remain focused on the stated objective of enabling exec during graceful termination. No extraneous refactoring or unrelated feature implementations are observed across the modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7279d3 and 4009c44.

📒 Files selected for processing (8)
  • contrib/test/ci/vars.yml (1 hunks)
  • internal/oci/container.go (1 hunks)
  • internal/oci/container_test.go (3 hunks)
  • internal/oci/oci_unix.go (3 hunks)
  • internal/oci/runtime_oci.go (4 hunks)
  • server/container_exec.go (1 hunks)
  • server/container_exec_test.go (2 hunks)
  • test/exec_termination.bats (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Use fmt.Errorf with %w for error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming: *_{linux,freebsd}.go for platform-dependent code

Files:

  • server/container_exec.go
  • internal/oci/runtime_oci.go
  • internal/oci/oci_unix.go
  • internal/oci/container.go
  • internal/oci/container_test.go
  • server/container_exec_test.go
**/*_test.go

📄 CodeRabbit inference engine (AGENTS.md)

Use *_test.go naming convention for unit test files

Files:

  • internal/oci/container_test.go
  • server/container_exec_test.go
**/*.bats

📄 CodeRabbit inference engine (AGENTS.md)

Use .bats file extension for BATS integration test files

Files:

  • test/exec_termination.bats
🧠 Learnings (6)
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.go : Propagate context.Context through function calls in Go code

Applied to files:

  • server/container_exec.go
📚 Learning: 2025-12-17T13:38:34.646Z
Learnt from: bitoku
Repo: cri-o/cri-o PR: 9667
File: server/container_create.go:1233-1236
Timestamp: 2025-12-17T13:38:34.646Z
Learning: In the cri-o/cri-o repository, protobuf-generated Get* methods for k8s.io/cri-api types are nil-safe: if the receiver is nil, GetX() returns the zero value instead of panicking. Do not add explicit nil checks before chaining calls on such getters. Apply this guidance to all Go code that uses these generated getters across the codebase.

Applied to files:

  • server/container_exec.go
  • internal/oci/runtime_oci.go
  • internal/oci/oci_unix.go
  • internal/oci/container.go
  • internal/oci/container_test.go
  • server/container_exec_test.go
📚 Learning: 2025-12-18T13:28:24.244Z
Learnt from: bitoku
Repo: cri-o/cri-o PR: 9676
File: internal/lib/stats/cgroup_stats_unsupported.go:1-7
Timestamp: 2025-12-18T13:28:24.244Z
Learning: In the cri-o/cri-o repository, for platform-specific types guarded by Go build tags (for example //go:build !linux), implement empty structs for unsupported platforms to permit compilation and clearly indicate the feature is not available rather than mirroring the Linux struct with unpopulated fields. Apply this pattern to all relevant platform-specific files across the codebase (i.e., any file under build-taged sections that should compile on all targets but lacks full implementation for some platforms).

Applied to files:

  • server/container_exec.go
  • internal/oci/runtime_oci.go
  • internal/oci/oci_unix.go
  • internal/oci/container.go
  • internal/oci/container_test.go
  • server/container_exec_test.go
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Use relative test paths (e.g., `version.bats` not `test/version.bats`) when running integration tests

Applied to files:

  • contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.bats : Use `.bats` file extension for BATS integration test files

Applied to files:

  • contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Run integration tests with `sudo -E ./test/test_runner.sh` not direct BATS execution

Applied to files:

  • contrib/test/ci/vars.yml
🧬 Code graph analysis (4)
internal/oci/runtime_oci.go (3)
pkg/config/config.go (1)
  • MonitorExecCgroupContainer (75-75)
internal/config/cgmgr/cgmgr_linux.go (1)
  • MoveProcessToContainerCgroup (174-195)
internal/config/cgmgr/cgmgr_unsupported.go (1)
  • MoveProcessToContainerCgroup (45-47)
internal/oci/container_test.go (1)
internal/oci/container.go (1)
  • Container (44-96)
server/container_exec_test.go (2)
internal/oci/container.go (1)
  • ContainerState (119-139)
internal/oci/oci.go (2)
  • ContainerStateRunning (31-31)
  • ContainerStateStopped (33-33)
test/exec_termination.bats (1)
test/helpers.bash (4)
  • setup_test (7-77)
  • cleanup_test (367-400)
  • start_crio (232-236)
  • crictl (86-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
🔇 Additional comments (15)
contrib/test/ci/vars.yml (1)

58-58: LGTM! Clear reasoning for skipping this test in Kata.

The comment appropriately explains that VM-based runtimes (like Kata) handle exec during termination differently than native runc/crun, justifying the skip.

server/container_exec.go (1)

55-57: LGTM! Lock-free Living() check enables exec during graceful termination.

This change removes the dependency on UpdateContainerStatus() which required opLock, allowing exec operations to proceed during the container's graceful termination window. The Living() method is designed to be lock-free per the codebase's architecture.

Based on learnings and past review comments: Living() was designed to not race against opLock, making this approach safe.

internal/oci/oci_unix.go (2)

21-40: LGTM! Well-designed ptyStarter abstraction.

The ptyStarter type cleanly wraps PTY startup to implement the ExecStarter interface, storing the PTY file descriptor for later use. The three methods (Start, GetPid, Pty) provide a clean separation of concerns.


57-66: LGTM! Atomic exec startup integrated correctly.

The ttyCmd function now uses StartExecCmd with the ptyStarter, ensuring atomic registration of the exec PID. The deferred DeleteExecPID cleanup ensures proper lifecycle management, and accessing the PTY via starter.Pty() maintains the correct ordering of operations.

server/container_exec_test.go (2)

15-26: LGTM! Simple and effective mock implementation.

The mockExecStarter provides a minimal implementation of the ExecStarter interface suitable for testing without external dependencies.


78-129: LGTM! Comprehensive test coverage for exec lifecycle.

These three test cases effectively cover the key scenarios:

  1. Exec succeeds when container is running
  2. Exec is allowed during graceful termination (before kill loop begins)
  3. Exec fails when container is stopped

The test at lines 97-113 directly validates the core feature: allowing exec during graceful termination by using StartExecCmd while stopKillLoopBegun is still false.

internal/oci/container_test.go (4)

681-817: LGTM! Comprehensive test coverage for PID lifecycle management.

Excellent test coverage for DeleteExecPID and KillExecPIDs:

  • Tests handle edge cases (non-existent PIDs, already-dead PIDs, PID 0 safety)
  • Tests verify idempotent operations (multiple deletes, clearing the map)
  • Tests verify signal differentiation (SIGINT vs SIGKILL based on shouldKill flag)

This thorough testing ensures the PID lifecycle management is robust and safe.


819-896: LGTM! Critical tests for atomic exec startup.

These tests validate the core feature of the PR:

  1. Lines 820-839: Correctly blocks new execs when kill loop has begun
  2. Lines 841-857: Allows exec during graceful termination (before kill loop)
  3. Lines 859-874: Propagates start errors appropriately
  4. Lines 876-895: Verifies atomic PID registration on success

This ensures the window for exec during graceful termination is correctly implemented.


898-1021: LGTM! Thorough tests for watcher lifecycle.

These tests comprehensively verify the SetAsDoneStopping behavior:

  • Handles no watchers gracefully
  • Closes the timeout channel (and panics on double-close as expected)
  • Notifies all registered watchers
  • Clears the watchers slice

The context-aware timing tests ensure proper synchronization during container stop.


1042-1066: LGTM! Well-designed test helpers.

The mockExecStarter with configurable startFunc provides flexibility for testing various scenarios (success, failure, etc.), and the addTestExecPID helper properly encapsulates the StartExecCmd usage pattern for tests.

internal/oci/container.go (1)

869-896: LGTM! Atomic exec startup prevents race conditions.

The new StartExecCmd method addresses the critical requirement from past reviews: running exec and registering its PID atomically. Key design points:

  1. Lines 874-877: Checks stopKillLoopBegun before starting to prevent exec during kill loop
  2. Line 880: Calls cmd.Start() while holding stopLock to ensure atomicity
  3. Lines 884-886: Atomically registers PID and returns it to the caller

The ExecStarter interface (lines 891-896) provides a clean abstraction for testing without depending on exec.Cmd directly.

Based on coding guidelines: This uses interface-based design and dependency injection patterns in Go code as recommended.

test/exec_termination.bats (1)

1-248: LGTM! Comprehensive integration test suite for exec during termination.

This test suite thoroughly validates the PR's core feature with 10 test cases covering:

Success scenarios:

  • Lines 14-32: Basic exec during graceful termination
  • Lines 34-60: Multiple concurrent execs during graceful termination
  • Lines 84-101: ExecSync during graceful termination
  • Lines 103-126: TTY exec during graceful termination
  • Lines 143-161: Short commands completing during grace period
  • Lines 163-185: Long-running exec started before termination continuing
  • Lines 206-229: Sequential execs during graceful termination

Failure/edge scenarios:

  • Lines 62-82: Long exec killed when kill loop starts
  • Lines 128-141: Exec fails after container fully stopped
  • Lines 187-204: Exec rejected after grace period ends
  • Lines 231-247: SIGTERM handling during termination
  • Lines 249-273: Container restart rejection during exec

The tests use proper synchronization with sleep/wait and verify both successful and failed exec scenarios. This aligns with the test cases requested in the PR comments.

Based on learnings: Integration tests should be run with sudo -E ./test/test_runner.sh and use relative test paths.

internal/oci/runtime_oci.go (3)

104-115: LGTM! Clean adapter for ExecStarter interface.

The execCmdWrapper provides a minimal adapter to make exec.Cmd compatible with the ExecStarter interface, enabling the atomic start-and-register flow in StartExecCmd.


558-562: LGTM! ExecContainer integrated with atomic exec startup.

ExecContainer now uses StartExecCmd for atomic PID registration and defers cleanup via DeleteExecPID(pid). This ensures:

  1. Exec commands can't start once the kill loop begins
  2. PID is registered atomically during Start()
  3. Cleanup happens on all exit paths

722-795: LGTM! ExecSyncContainer properly integrated with cleanup on failure.

The integration of StartExecCmd in ExecSyncContainer includes proper error handling:

  1. Line 722: Atomic start and PID registration
  2. Lines 757-759: Cleanup PID registration on failure in the deferred error handler
  3. Line 769: Uses returned PID for cgroup operations
  4. Line 795: Final cleanup after Wait() completes

This ensures no PID leaks occur even when errors happen during setup or execution.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
server/container_exec_test.go (1)

84-98: Test placement and scope mismatch.

This test directly calls testContainer.AddExecPID() rather than exercising the full StreamService.Exec path, yet it's placed within the "StreamServer: Exec" describe block. Consider either:

  1. Moving this test to a separate describe block focused on container-level exec PID management, or
  2. Changing this test to call testStreamService.Exec() during graceful termination to test the full integration.

The current test validates AddExecPID behavior but doesn't verify that the full exec flow works during graceful termination at the server level.

test/exec_termination.bats (2)

84-101: Potentially redundant with "exec during graceful termination" test.

This test ("execsync during graceful termination") appears functionally identical to the first test on lines 14-32. Both use crictl exec --sync with a 10-second stop timeout. Consider either:

  1. Removing this test as it's redundant, or
  2. Differentiating it by testing a different aspect of execsync (e.g., timeout behavior, error handling).

249-273: Test name is misleading.

The test is named "container restart rejected during exec" but it actually tests that a stopped container cannot be restarted (which is standard crictl behavior, not specific to exec). The exec is incidental to the test's assertion.

Consider renaming to something like "container stop terminates running exec" or moving the restart assertion to a separate, more focused test.

internal/oci/container_test.go (5)

680-732: AddExecPID tests cover main states but only via error presence

These cases exercise the key state combinations (normal, stopping before kill loop, kill loop begun, multiple PIDs), which is good. However they only assert on error and not on the internal exec PID tracking (count, shouldKill flag, etc.). If you want stronger guarantees around PID bookkeeping, consider adding tests that indirectly observe the map via KillExecPIDs behavior (with a stubbed kill function) rather than just success/failure of AddExecPID.


734-791: DeleteExecPID tests don’t actually prove deletion semantics

DeleteExecPID is validated mostly by “no panic” and the ability to call AddExecPID again. Given AddExecPID likely overwrites entries in a map keyed by PID, re‑adding the same PID succeeding doesn’t confirm the previous entry was removed. If you care about the semantics here, consider:

  • Driving KillExecPIDs after selective deletes and asserting only the expected PIDs are targeted (via an injectable kill function), or
  • Exposing a small test‑only helper to inspect the exec PID map.

As written, the tests are still useful for safety (idempotent deletes, non‑existent PID handling) but weaker than the comments suggest.


793-870: KillExecPIDs tests mostly assert “no panic”, some comments oversell coverage

These cases ensure KillExecPIDs is robust for empty sets, non‑existent PIDs, and mixed shouldKill flags, which is valuable. Two nits:

  • The “skip PID 0” test never actually registers PID 0, so it just revalidates the “empty/no‑PIDs” path instead of the special‑case logic the comment describes.
  • None of the tests verify that shouldKill truly maps to different signals; they only check that the function returns without panic.

If you want higher confidence, it might be worth making the kill primitive injectable so tests can assert exactly which PIDs and signals were requested.


872-949: StartExecCmd tests don’t fully validate “atomic PID registration” claim

The coverage for kill‑loop gating and error propagation looks good. However, the “should register PID atomically on success” case only checks that:

  • StartExecCmd returns the expected PID, and
  • a subsequent DeleteExecPID(54321) does not panic.

Since DeleteExecPID is a no‑op for unknown PIDs, this doesn’t actually prove that the PID was registered. If atomic registration is critical, consider a stronger assertion path (e.g., invoking KillExecPIDs with a stubbed kill function and verifying it sees the PID).


951-1074: SetAsDoneStopping tests encode panic behavior and use timing‑based synchronization

A few concerns here:

  • “should close stop timeout channel” asserts that a second SetAsDoneStopping() call must panic. That hard‑codes non‑idempotent API semantics into tests; if you ever want to make SetAsDoneStopping safe to call multiple times, these tests will block you.
  • The watcher tests rely on time.Sleep(10 * time.Millisecond) to let goroutines register, which can be flaky on slow CI. Using explicit synchronization (e.g., a ready channel or WaitGroup) would be more robust.
  • “should clear the watchers slice” contains only comments; there’s no assertion or second call to actually validate that watchers were cleared.

None of this breaks current behavior, but you may want to tighten these tests to avoid future flakiness and to make the intended API contract around SetAsDoneStopping explicit.

internal/oci/runtime_oci.go (1)

995-995: Extraneous standalone comment marker

The standalone // at this line in the stop loop goroutine adds no information and can be dropped to avoid noise.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6e5b12 and c7279d3.

📒 Files selected for processing (7)
  • contrib/test/ci/vars.yml (1 hunks)
  • internal/oci/container.go (1 hunks)
  • internal/oci/container_test.go (3 hunks)
  • internal/oci/runtime_oci.go (6 hunks)
  • server/container_exec.go (1 hunks)
  • server/container_exec_test.go (2 hunks)
  • test/exec_termination.bats (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Use fmt.Errorf with %w for error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming: *_{linux,freebsd}.go for platform-dependent code

Files:

  • server/container_exec_test.go
  • server/container_exec.go
  • internal/oci/container_test.go
  • internal/oci/container.go
  • internal/oci/runtime_oci.go
**/*_test.go

📄 CodeRabbit inference engine (AGENTS.md)

Use *_test.go naming convention for unit test files

Files:

  • server/container_exec_test.go
  • internal/oci/container_test.go
**/*.bats

📄 CodeRabbit inference engine (AGENTS.md)

Use .bats file extension for BATS integration test files

Files:

  • test/exec_termination.bats
🧠 Learnings (4)
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.go : Propagate context.Context through function calls in Go code

Applied to files:

  • server/container_exec.go
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Use relative test paths (e.g., `version.bats` not `test/version.bats`) when running integration tests

Applied to files:

  • contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.bats : Use `.bats` file extension for BATS integration test files

Applied to files:

  • contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Run integration tests with `sudo -E ./test/test_runner.sh` not direct BATS execution

Applied to files:

  • contrib/test/ci/vars.yml
🧬 Code graph analysis (3)
server/container_exec_test.go (1)
internal/oci/oci.go (2)
  • ContainerStateRunning (32-32)
  • ContainerStateStopped (34-34)
test/exec_termination.bats (1)
test/helpers.bash (4)
  • setup_test (7-77)
  • cleanup_test (367-400)
  • start_crio (232-236)
  • crictl (86-88)
internal/oci/runtime_oci.go (2)
internal/config/cgmgr/cgmgr_linux.go (1)
  • MoveProcessToContainerCgroup (163-184)
internal/config/cgmgr/cgmgr_unsupported.go (1)
  • MoveProcessToContainerCgroup (45-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
  • GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
🔇 Additional comments (15)
contrib/test/ci/vars.yml (1)

58-58: LGTM!

The skip entry is appropriately placed with a clear comment explaining why VM-based runtimes have different exec-during-termination behavior. This follows the existing pattern in the file.

server/container_exec.go (1)

55-57: LGTM - Living() check aligns with ExecSync pattern.

The switch from UpdateContainerStatus() + State() to Living() removes the opLock dependency that was causing exec to hang during graceful termination. This matches the pattern used in ExecSyncContainer and allows exec during the graceful termination window.

The previous review discussion about potential race conditions on Living() was addressed via integration tests as requested by the maintainer.

server/container_exec_test.go (2)

65-82: LGTM!

The test correctly validates that Living() passes for running containers. Using both SetState and SetStateAndSpoofPid ensures the container has proper PID state for the Living() check. The assertion that the error does not contain the Living() failure message is the right approach since the actual runtime exec will fail in the test environment.


100-114: LGTM!

The test correctly validates that Exec fails with the expected error message when the container is stopped. This ensures the Living() check properly rejects exec requests for non-running containers.

internal/oci/container.go (3)

876-880: Good fix for the core issue.

Changing the guard from c.stopping to c.stopKillLoopBegun is the key change that enables exec during graceful termination. The updated error message accurately reflects that execs are only blocked once SIGKILL is being sent, not during the SIGTERM grace period.


887-907: Atomic start-and-register pattern addresses the race condition.

This method correctly solves the race condition identified in past reviews where exec could run without its PID being tracked. By holding stopLock across both cmd.Start() and PID registration, the operation is atomic with respect to termination state changes.

One consideration: cmd.Start() is called while holding stopLock. This should be safe since exec.Cmd.Start() typically returns quickly after forking, but worth noting that any blocking in the ExecStarter.Start() implementation would delay other stopLock operations.


909-914: Clean interface design.

The ExecStarter interface provides good abstraction for testing and decouples the container logic from exec.Cmd specifics. The comment explains the purpose well.

test/exec_termination.bats (4)

1-12: Good test file structure.

The setup and teardown functions correctly use the standard helpers setup_test and cleanup_test from the test framework. Based on learnings, using .bats extension for integration tests is correct.


14-32: Core test case looks good.

This test validates the primary use case from issue #7160 - that crictl exec succeeds during the graceful termination window. The 10-second timeout provides sufficient grace period, and the 0.5-second sleep allows the stop to be initiated before testing exec.


187-204: Good edge case test for kill loop boundary.

This test validates that exec is rejected once the grace period ends and the kill loop starts. The 1-second timeout followed by 2-second sleep ensures the kill loop has begun. The || true on the final wait is appropriate since the stop may complete with an error status.


62-82: No action needed—the Fedora image includes bash.

The container_sleep.json references quay.io/crio/fedora-crio-ci:latest, a Fedora-based image that includes /bin/bash by default. All four test cases using /bin/bash (lines 67, 168, 236, 254) will execute correctly.

internal/oci/container_test.go (1)

1095-1109: mockExecStarter is a clean, focused test double

The mock neatly captures Start() calls and exposes a fixed PID, matching the ExecStarter contract and keeping StartExecCmd tests simple and deterministic. No issues here.

internal/oci/runtime_oci.go (3)

104-116: *execCmdWrapper correctly adapts exec.Cmd to ExecStarter

This wrapper is minimal and does the right thing by delegating Start() and reading cmd.Process.Pid in GetPid(). It implicitly relies on callers invoking Start() before GetPid(), which is enforced by StartExecCmd in this codebase, so the abstraction looks sound.


541-547: ExecContainer’s use of StartExecCmd + deferred DeleteExecPID looks correct

Switching the non‑TTY exec path to:

  • Start the command via c.StartExecCmd(&execCmdWrapper{cmd: execCmd}, true), and
  • defer c.DeleteExecPID(pid),

gives you atomic registration relative to the kill loop while ensuring cleanup happens exactly once per successful StartExecCmd call. This integrates well with the existing pipe + Wait logic and doesn’t alter the I/O semantics.


691-704: ExecSyncContainer PID handling and cgroup move are consistent with StartExecCmd

The new flow:

  • Uses c.StartExecCmd(&execCmdWrapper{cmd: cmd}, false) to atomically start and register the exec PID.
  • On start failure, returns an ExecSyncError without touching PID state (delegated to StartExecCmd).
  • On later failures inside the inner scope, kills/waits the process and calls c.DeleteExecPID(pid) in the defer.
  • On success, calls MoveProcessToContainerCgroup(containerPid, pid) using the same PID returned by StartExecCmd, then later waitErr := cmd.Wait() and finally c.DeleteExecPID(pid).

This gives a single registration per start, with one matching delete on every success/failure path, and keeps the cgroup update tied to the same PID that’s tracked for kill‑loop purposes. The behavior looks coherent.

Also applies to: 729-735, 742-743

@willianpaixao willianpaixao requested a review from bitoku December 11, 2025 20:16
@willianpaixao
Copy link
Contributor Author

@coderabbitai generate docstrings

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

✅ Actions performed

Initiated docstring generation; will generate only if new commits exist.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Note

Docstrings generation - SUCCESS
Generated docstrings for this pull request at #9678

coderabbitai bot added a commit that referenced this pull request Dec 18, 2025
Docstrings generation was requested by @willianpaixao.

* #9614 (comment)

The following files were modified:

* `test/exec_termination.bats`
@haircommander
Copy link
Member

couple of nits, great work here

Copy link
Contributor

@bitoku bitoku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patience. overall lgtm.

// If the PID is conmon, shouldKill should be false, as we should not call SIGKILL on conmon.
// If it is an exec session, shouldKill should be true, as we can't guarantee the exec process
// will have a SIGINT handler.
func (c *Container) AddExecPID(pid int, shouldKill bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed? I think we can remove this.

Comment on lines +104 to +116
// execCmdWrapper wraps exec.Cmd to implement the ExecStarter interface.
type execCmdWrapper struct {
cmd *exec.Cmd
}

func (w *execCmdWrapper) Start() error {
return w.cmd.Start()
}

func (w *execCmdWrapper) GetPid() int {
return w.cmd.Process.Pid
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is for testing. Can we extend this interface? It should be cleaner.

// CommandRunner is an interface for executing commands.
// It gives the option to change the way commands are run server-wide.
type CommandRunner interface {
Command(string, ...string) *exec.Cmd
CommandContext(context.Context, string, ...string) *exec.Cmd
CombinedOutput(string, ...string) ([]byte, error)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding Start() and GetPid() to CommandRunner would conflate command creation with command execution, violating single responsibility. Additionally, CommandRunner is a global singleton used throughout the codebase for command creation, changing its contract would have wider implications.

The ExecStarter interface is minimal (2 methods), focused, and only used within internal/oci/ for the specific purpose of atomic exec PID registration during container termination. I believe keeping it separate is the cleaner design.

Happy to discuss alternative approaches if you have a different design in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. We can follow up if necessary.

This change enables kubectl exec and crictl exec to access containers
that are in the "Terminating" state during graceful shutdown periods.

Previously, exec operations would hang when a pod was terminating because
the stop loop held the container's opLock for the entire stop duration
(which could be minutes with high terminationGracePeriodSeconds). The
exec path needed this lock to call UpdateContainerStatus(), creating a
blocking condition.

Changes made:

1. server/container_exec.go:
   - Replace UpdateContainerStatus() + State() check with c.Living()
   - This matches the ExecSync pattern and avoids opLock entirely
   - Removes dependency on oci package

2. internal/oci/runtime_oci.go (StopLoopForContainer):
   - Remove global opLock held for entire function duration
   - Add granular locking around individual runtime operations:
     * Paused state checks and resume commands
     * Kill/SIGTERM commands
     * Periodic Living() checks in goroutine (RLock)
     * ProcessState() checks in blocked timer (RLock)
     * Final cleanup in defer block (Lock)
   - Lock is now only held during runtime calls, not wait periods

Fixes: cri-o#7160

Signed-off-by: Willian Paixao <[email protected]>
Add test coverage for container exec process management, focusing on
the graceful termination window. Tests verify:
- AddExecPID accepts exec processes during graceful termination but rejects
them once the kill loop begins
- DeleteExecPID safely handles removal of tracked processes
- KillExecPIDs properly terminates exec processes with appropriate signals
- SetAsDoneStopping correctly notifies watchers and cleans up resources

These tests ensure exec processes can be started during the graceful
termination period (between SIGTERM and kill loop) while preventing execs
after the container enters the kill loop, addressing the gap in container
lifecycle management.

Signed-off-by: Willian Paixao <[email protected]>
Add integration test coverage for container exec functionality
during the graceful termination period.

The test suite validates:
- Basic exec succeeds during graceful termination window
- Multiple concurrent execs complete successfully
- Long-running execs are killed when grace period expires
- ExecSync and TTY exec modes work during graceful termination
- Sequential execs execute properly during termination
- Exec commands are rejected after container stops
- Exec commands are rejected after grace period ends
- Pre-existing exec processes continue during graceful termination
- Signal handling (SIGTERM) works correctly for exec processes
- Container restart is rejected while exec is running

Signed-off-by: Willian Paixao <[email protected]>
Addresses a race condition where an exec process could start but not be
registered in execPIDs before the kill loop begins, leaving orphan
processes.

- Add StartExecCmd() that atomically checks kill loop state, starts the
command, and registers the PID under a single lock
- Add ExecStarter interface for testability
- Update ExecContainer and ExecSyncContainer to use atomic method

Signed-off-by: Willian Paixao <[email protected]>
@willianpaixao
Copy link
Contributor Author

Hey guys, thanks for the help. I've addressed everything I could find. The comments are scattered and it's getting messy at this point. So I'm sorry if I forgot to change something. Just comment again and I will tackle it. Wish you all a Merry Christmas.

@bitoku
Copy link
Contributor

bitoku commented Dec 22, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 22, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit bcc6495 into cri-o:main Dec 22, 2025
26 checks passed
@willianpaixao willianpaixao deleted the terminating-pod branch December 22, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Container cannot be accessed via kubectl exec when pod is in Terminating state

5 participants