-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: allow exec to containers during graceful termination #9614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow exec to containers during graceful termination #9614
Conversation
|
Hi @willianpaixao. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables exec commands to be executed in containers during their graceful termination phase. Previously, containers blocked exec operations as soon as they entered the stopping state. Now, exec is allowed during the initial graceful shutdown period (when SIGTERM is sent) and only blocked once the force-kill loop begins (when SIGKILL is used).
Key changes:
- Modified
AddExecPID()to checkstopKillLoopBeguninstead ofstopping, allowing exec during graceful termination - Refactored
StopLoopForContainer()lock management to use shorter-lived lock acquisitions, enabling concurrent operations during graceful termination - Simplified container state checking in
Exec()to use theLiving()method
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| server/container_exec.go | Simplified container state validation by replacing manual state checking with Living() method call |
| internal/oci/runtime_oci.go | Refactored lock management in StopLoopForContainer() to use shorter critical sections, allowing concurrent exec operations during graceful termination |
| internal/oci/container.go | Changed AddExecPID() to check stopKillLoopBegun instead of stopping, enabling exec registration during graceful shutdown phase |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9614 +/- ##
==========================================
+ Coverage 63.87% 63.96% +0.09%
==========================================
Files 205 205
Lines 28699 28724 +25
==========================================
+ Hits 18331 18374 +43
+ Misses 8742 8725 -17
+ Partials 1626 1625 -1 🚀 New features to boost your workflow:
|
|
/ok-to-test |
|
/assign sohankunkerkar |
|
@willianpaixao Thanks for working on this! The feature looks useful, but I have some concerns about potential race conditions that I think we should verify with integration tests before merging. Would you be okay adding one? You can refer to https://github.com/cri-o/cri-o/tree/main/test for more information. |
|
@sohankunkerkar integration tests added! |
d70e4cd to
dbbb93c
Compare
sohankunkerkar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cri-o/cri-o-maintainers PTAL
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughIntroduces an atomic start-and-register flow for exec commands via a new ExecStarter interface and Container.StartExecCmd; allows execs during graceful termination until the container's kill loop begins, updates runtime and PTY paths to use the new pattern, and adds extensive tests plus a bats suite for exec-termination scenarios. Changes
Sequence DiagramsequenceDiagram
participant Client as Client / crictl/kubectl
participant Server as Server (container_exec)
participant Container as Container (internal/oci)
participant ExecProc as ExecStarter (exec.Cmd / ptyStarter)
participant PIDMap as execPIDs
Note over Client,Server: Exec request during graceful termination
Client->>Server: Exec request
Server->>Container: Living() check
Container-->>Server: true
Server->>Container: StartExecCmd(execStarter, shouldKill)
activate Container
Container->>Container: acquire stopLock
alt killLoopBegun == true
Container-->>Server: error ("cannot start exec: container is being killed")
else
Container->>ExecProc: Start()
activate ExecProc
ExecProc-->>Container: started (pid)
deactivate ExecProc
Container->>PIDMap: register pid -> shouldKill
Container-->>Server: pid (success)
end
Container->>Container: release stopLock
deactivate Container
Note over Container,PIDMap: Later, when kill loop begins
Container->>PIDMap: iterate and signal registered pids (where shouldKill)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (8)
🧰 Additional context used📓 Path-based instructions (3)**/*.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*_test.go📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.bats📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (6)📚 Learning: 2025-12-03T18:27:19.593ZApplied to files:
📚 Learning: 2025-12-17T13:38:34.646ZApplied to files:
📚 Learning: 2025-12-18T13:28:24.244ZApplied to files:
📚 Learning: 2025-12-03T18:27:19.593ZApplied to files:
📚 Learning: 2025-12-03T18:27:19.593ZApplied to files:
📚 Learning: 2025-12-03T18:27:19.593ZApplied to files:
🧬 Code graph analysis (4)internal/oci/runtime_oci.go (3)
internal/oci/container_test.go (1)
server/container_exec_test.go (2)
test/exec_termination.bats (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
🔇 Additional comments (15)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6e2e0eb to
c7279d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
server/container_exec_test.go (1)
84-98: Test placement and scope mismatch.This test directly calls
testContainer.AddExecPID()rather than exercising the fullStreamService.Execpath, yet it's placed within the"StreamServer: Exec"describe block. Consider either:
- Moving this test to a separate describe block focused on container-level exec PID management, or
- Changing this test to call
testStreamService.Exec()during graceful termination to test the full integration.The current test validates
AddExecPIDbehavior but doesn't verify that the full exec flow works during graceful termination at the server level.test/exec_termination.bats (2)
84-101: Potentially redundant with "exec during graceful termination" test.This test ("execsync during graceful termination") appears functionally identical to the first test on lines 14-32. Both use
crictl exec --syncwith a 10-second stop timeout. Consider either:
- Removing this test as it's redundant, or
- Differentiating it by testing a different aspect of execsync (e.g., timeout behavior, error handling).
249-273: Test name is misleading.The test is named "container restart rejected during exec" but it actually tests that a stopped container cannot be restarted (which is standard
crictlbehavior, not specific to exec). The exec is incidental to the test's assertion.Consider renaming to something like "container stop terminates running exec" or moving the restart assertion to a separate, more focused test.
internal/oci/container_test.go (5)
680-732: AddExecPID tests cover main states but only via error presenceThese cases exercise the key state combinations (normal, stopping before kill loop, kill loop begun, multiple PIDs), which is good. However they only assert on
errorand not on the internal exec PID tracking (count,shouldKillflag, etc.). If you want stronger guarantees around PID bookkeeping, consider adding tests that indirectly observe the map viaKillExecPIDsbehavior (with a stubbed kill function) rather than just success/failure ofAddExecPID.
734-791: DeleteExecPID tests don’t actually prove deletion semantics
DeleteExecPIDis validated mostly by “no panic” and the ability to callAddExecPIDagain. GivenAddExecPIDlikely overwrites entries in a map keyed by PID, re‑adding the same PID succeeding doesn’t confirm the previous entry was removed. If you care about the semantics here, consider:
- Driving
KillExecPIDsafter selective deletes and asserting only the expected PIDs are targeted (via an injectable kill function), or- Exposing a small test‑only helper to inspect the exec PID map.
As written, the tests are still useful for safety (idempotent deletes, non‑existent PID handling) but weaker than the comments suggest.
793-870: KillExecPIDs tests mostly assert “no panic”, some comments oversell coverageThese cases ensure
KillExecPIDsis robust for empty sets, non‑existent PIDs, and mixedshouldKillflags, which is valuable. Two nits:
- The “skip PID 0” test never actually registers PID 0, so it just revalidates the “empty/no‑PIDs” path instead of the special‑case logic the comment describes.
- None of the tests verify that
shouldKilltruly maps to different signals; they only check that the function returns without panic.If you want higher confidence, it might be worth making the kill primitive injectable so tests can assert exactly which PIDs and signals were requested.
872-949: StartExecCmd tests don’t fully validate “atomic PID registration” claimThe coverage for kill‑loop gating and error propagation looks good. However, the “should register PID atomically on success” case only checks that:
StartExecCmdreturns the expected PID, and- a subsequent
DeleteExecPID(54321)does not panic.Since
DeleteExecPIDis a no‑op for unknown PIDs, this doesn’t actually prove that the PID was registered. If atomic registration is critical, consider a stronger assertion path (e.g., invokingKillExecPIDswith a stubbed kill function and verifying it sees the PID).
951-1074: SetAsDoneStopping tests encode panic behavior and use timing‑based synchronizationA few concerns here:
- “should close stop timeout channel” asserts that a second
SetAsDoneStopping()call must panic. That hard‑codes non‑idempotent API semantics into tests; if you ever want to makeSetAsDoneStoppingsafe to call multiple times, these tests will block you.- The watcher tests rely on
time.Sleep(10 * time.Millisecond)to let goroutines register, which can be flaky on slow CI. Using explicit synchronization (e.g., a ready channel or WaitGroup) would be more robust.- “should clear the watchers slice” contains only comments; there’s no assertion or second call to actually validate that watchers were cleared.
None of this breaks current behavior, but you may want to tighten these tests to avoid future flakiness and to make the intended API contract around
SetAsDoneStoppingexplicit.internal/oci/runtime_oci.go (1)
995-995: Extraneous standalone comment markerThe standalone
//at this line in the stop loop goroutine adds no information and can be dropped to avoid noise.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
contrib/test/ci/vars.yml(1 hunks)internal/oci/container.go(1 hunks)internal/oci/container_test.go(3 hunks)internal/oci/runtime_oci.go(6 hunks)server/container_exec.go(1 hunks)server/container_exec_test.go(2 hunks)test/exec_termination.bats(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Use interface-based design and dependency injection patterns in Go code
Propagate context.Context through function calls in Go code
Usefmt.Errorfwith%wfor error wrapping in Go code
Use logrus with structured fields for logging in Go code
Add comments explaining 'why' not 'what' in Go code
Use platform-specific file naming:*_{linux,freebsd}.gofor platform-dependent code
Files:
server/container_exec_test.goserver/container_exec.gointernal/oci/container_test.gointernal/oci/container.gointernal/oci/runtime_oci.go
**/*_test.go
📄 CodeRabbit inference engine (AGENTS.md)
Use
*_test.gonaming convention for unit test files
Files:
server/container_exec_test.gointernal/oci/container_test.go
**/*.bats
📄 CodeRabbit inference engine (AGENTS.md)
Use
.batsfile extension for BATS integration test files
Files:
test/exec_termination.bats
🧠 Learnings (4)
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.go : Propagate context.Context through function calls in Go code
Applied to files:
server/container_exec.go
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Use relative test paths (e.g., `version.bats` not `test/version.bats`) when running integration tests
Applied to files:
contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Applies to **/*.bats : Use `.bats` file extension for BATS integration test files
Applied to files:
contrib/test/ci/vars.yml
📚 Learning: 2025-12-03T18:27:19.593Z
Learnt from: CR
Repo: cri-o/cri-o PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-03T18:27:19.593Z
Learning: Run integration tests with `sudo -E ./test/test_runner.sh` not direct BATS execution
Applied to files:
contrib/test/ci/vars.yml
🧬 Code graph analysis (3)
server/container_exec_test.go (1)
internal/oci/oci.go (2)
ContainerStateRunning(32-32)ContainerStateStopped(34-34)
test/exec_termination.bats (1)
test/helpers.bash (4)
setup_test(7-77)cleanup_test(367-400)start_crio(232-236)crictl(86-88)
internal/oci/runtime_oci.go (2)
internal/config/cgmgr/cgmgr_linux.go (1)
MoveProcessToContainerCgroup(163-184)internal/config/cgmgr/cgmgr_unsupported.go (1)
MoveProcessToContainerCgroup(45-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-aarch64:fedora-rawhide
- GitHub Check: rpm-build:centos-stream-9-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-aarch64:fedora-rawhide
- GitHub Check: rpm-build:fedora-rawhide-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-x86_64:fedora-rawhide
- GitHub Check: rpm-build:fedora-43-aarch64:fedora-rawhide
🔇 Additional comments (15)
contrib/test/ci/vars.yml (1)
58-58: LGTM!The skip entry is appropriately placed with a clear comment explaining why VM-based runtimes have different exec-during-termination behavior. This follows the existing pattern in the file.
server/container_exec.go (1)
55-57: LGTM - Living() check aligns with ExecSync pattern.The switch from
UpdateContainerStatus()+State()toLiving()removes theopLockdependency that was causing exec to hang during graceful termination. This matches the pattern used inExecSyncContainerand allows exec during the graceful termination window.The previous review discussion about potential race conditions on
Living()was addressed via integration tests as requested by the maintainer.server/container_exec_test.go (2)
65-82: LGTM!The test correctly validates that
Living()passes for running containers. Using bothSetStateandSetStateAndSpoofPidensures the container has proper PID state for theLiving()check. The assertion that the error does not contain the Living() failure message is the right approach since the actual runtime exec will fail in the test environment.
100-114: LGTM!The test correctly validates that
Execfails with the expected error message when the container is stopped. This ensures theLiving()check properly rejects exec requests for non-running containers.internal/oci/container.go (3)
876-880: Good fix for the core issue.Changing the guard from
c.stoppingtoc.stopKillLoopBegunis the key change that enables exec during graceful termination. The updated error message accurately reflects that execs are only blocked once SIGKILL is being sent, not during the SIGTERM grace period.
887-907: Atomic start-and-register pattern addresses the race condition.This method correctly solves the race condition identified in past reviews where exec could run without its PID being tracked. By holding
stopLockacross bothcmd.Start()and PID registration, the operation is atomic with respect to termination state changes.One consideration:
cmd.Start()is called while holdingstopLock. This should be safe sinceexec.Cmd.Start()typically returns quickly after forking, but worth noting that any blocking in theExecStarter.Start()implementation would delay otherstopLockoperations.
909-914: Clean interface design.The
ExecStarterinterface provides good abstraction for testing and decouples the container logic fromexec.Cmdspecifics. The comment explains the purpose well.test/exec_termination.bats (4)
1-12: Good test file structure.The setup and teardown functions correctly use the standard helpers
setup_testandcleanup_testfrom the test framework. Based on learnings, using.batsextension for integration tests is correct.
14-32: Core test case looks good.This test validates the primary use case from issue #7160 - that
crictl execsucceeds during the graceful termination window. The 10-second timeout provides sufficient grace period, and the 0.5-second sleep allows the stop to be initiated before testing exec.
187-204: Good edge case test for kill loop boundary.This test validates that exec is rejected once the grace period ends and the kill loop starts. The 1-second timeout followed by 2-second sleep ensures the kill loop has begun. The
|| trueon the final wait is appropriate since the stop may complete with an error status.
62-82: No action needed—the Fedora image includes bash.The container_sleep.json references
quay.io/crio/fedora-crio-ci:latest, a Fedora-based image that includes/bin/bashby default. All four test cases using/bin/bash(lines 67, 168, 236, 254) will execute correctly.internal/oci/container_test.go (1)
1095-1109: mockExecStarter is a clean, focused test doubleThe mock neatly captures
Start()calls and exposes a fixed PID, matching theExecStartercontract and keeping StartExecCmd tests simple and deterministic. No issues here.internal/oci/runtime_oci.go (3)
104-116: *execCmdWrapper correctly adapts exec.Cmd to ExecStarterThis wrapper is minimal and does the right thing by delegating
Start()and readingcmd.Process.PidinGetPid(). It implicitly relies on callers invokingStart()beforeGetPid(), which is enforced byStartExecCmdin this codebase, so the abstraction looks sound.
541-547: ExecContainer’s use of StartExecCmd + deferred DeleteExecPID looks correctSwitching the non‑TTY exec path to:
- Start the command via
c.StartExecCmd(&execCmdWrapper{cmd: execCmd}, true), anddefer c.DeleteExecPID(pid),gives you atomic registration relative to the kill loop while ensuring cleanup happens exactly once per successful
StartExecCmdcall. This integrates well with the existing pipe + Wait logic and doesn’t alter the I/O semantics.
691-704: ExecSyncContainer PID handling and cgroup move are consistent with StartExecCmdThe new flow:
- Uses
c.StartExecCmd(&execCmdWrapper{cmd: cmd}, false)to atomically start and register the exec PID.- On start failure, returns an
ExecSyncErrorwithout touching PID state (delegated toStartExecCmd).- On later failures inside the inner scope, kills/waits the process and calls
c.DeleteExecPID(pid)in the defer.- On success, calls
MoveProcessToContainerCgroup(containerPid, pid)using the same PID returned byStartExecCmd, then laterwaitErr := cmd.Wait()and finallyc.DeleteExecPID(pid).This gives a single registration per start, with one matching delete on every success/failure path, and keeps the cgroup update tied to the same PID that’s tracked for kill‑loop purposes. The behavior looks coherent.
Also applies to: 729-735, 742-743
|
@coderabbitai generate docstrings |
✅ Actions performedInitiated docstring generation; will generate only if new commits exist. |
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @willianpaixao. * #9614 (comment) The following files were modified: * `test/exec_termination.bats`
|
couple of nits, great work here |
bitoku
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patience. overall lgtm.
internal/oci/container.go
Outdated
| // If the PID is conmon, shouldKill should be false, as we should not call SIGKILL on conmon. | ||
| // If it is an exec session, shouldKill should be true, as we can't guarantee the exec process | ||
| // will have a SIGINT handler. | ||
| func (c *Container) AddExecPID(pid int, shouldKill bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed? I think we can remove this.
| // execCmdWrapper wraps exec.Cmd to implement the ExecStarter interface. | ||
| type execCmdWrapper struct { | ||
| cmd *exec.Cmd | ||
| } | ||
|
|
||
| func (w *execCmdWrapper) Start() error { | ||
| return w.cmd.Start() | ||
| } | ||
|
|
||
| func (w *execCmdWrapper) GetPid() int { | ||
| return w.cmd.Process.Pid | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is for testing. Can we extend this interface? It should be cleaner.
cri-o/utils/cmdrunner/cmdrunner.go
Lines 13 to 19 in d19a9d6
| // CommandRunner is an interface for executing commands. | |
| // It gives the option to change the way commands are run server-wide. | |
| type CommandRunner interface { | |
| Command(string, ...string) *exec.Cmd | |
| CommandContext(context.Context, string, ...string) *exec.Cmd | |
| CombinedOutput(string, ...string) ([]byte, error) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding Start() and GetPid() to CommandRunner would conflate command creation with command execution, violating single responsibility. Additionally, CommandRunner is a global singleton used throughout the codebase for command creation, changing its contract would have wider implications.
The ExecStarter interface is minimal (2 methods), focused, and only used within internal/oci/ for the specific purpose of atomic exec PID registration during container termination. I believe keeping it separate is the cleaner design.
Happy to discuss alternative approaches if you have a different design in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. We can follow up if necessary.
This change enables kubectl exec and crictl exec to access containers
that are in the "Terminating" state during graceful shutdown periods.
Previously, exec operations would hang when a pod was terminating because
the stop loop held the container's opLock for the entire stop duration
(which could be minutes with high terminationGracePeriodSeconds). The
exec path needed this lock to call UpdateContainerStatus(), creating a
blocking condition.
Changes made:
1. server/container_exec.go:
- Replace UpdateContainerStatus() + State() check with c.Living()
- This matches the ExecSync pattern and avoids opLock entirely
- Removes dependency on oci package
2. internal/oci/runtime_oci.go (StopLoopForContainer):
- Remove global opLock held for entire function duration
- Add granular locking around individual runtime operations:
* Paused state checks and resume commands
* Kill/SIGTERM commands
* Periodic Living() checks in goroutine (RLock)
* ProcessState() checks in blocked timer (RLock)
* Final cleanup in defer block (Lock)
- Lock is now only held during runtime calls, not wait periods
Fixes: cri-o#7160
Signed-off-by: Willian Paixao <[email protected]>
Add test coverage for container exec process management, focusing on the graceful termination window. Tests verify: - AddExecPID accepts exec processes during graceful termination but rejects them once the kill loop begins - DeleteExecPID safely handles removal of tracked processes - KillExecPIDs properly terminates exec processes with appropriate signals - SetAsDoneStopping correctly notifies watchers and cleans up resources These tests ensure exec processes can be started during the graceful termination period (between SIGTERM and kill loop) while preventing execs after the container enters the kill loop, addressing the gap in container lifecycle management. Signed-off-by: Willian Paixao <[email protected]>
Add integration test coverage for container exec functionality during the graceful termination period. The test suite validates: - Basic exec succeeds during graceful termination window - Multiple concurrent execs complete successfully - Long-running execs are killed when grace period expires - ExecSync and TTY exec modes work during graceful termination - Sequential execs execute properly during termination - Exec commands are rejected after container stops - Exec commands are rejected after grace period ends - Pre-existing exec processes continue during graceful termination - Signal handling (SIGTERM) works correctly for exec processes - Container restart is rejected while exec is running Signed-off-by: Willian Paixao <[email protected]>
Signed-off-by: Willian Paixao <[email protected]>
Addresses a race condition where an exec process could start but not be registered in execPIDs before the kill loop begins, leaving orphan processes. - Add StartExecCmd() that atomically checks kill loop state, starts the command, and registers the PID under a single lock - Add ExecStarter interface for testability - Update ExecContainer and ExecSyncContainer to use atomic method Signed-off-by: Willian Paixao <[email protected]>
Signed-off-by: Willian Paixao <[email protected]>
c7279d3 to
4009c44
Compare
|
Hey guys, thanks for the help. I've addressed everything I could find. The comments are scattered and it's getting messy at this point. So I'm sorry if I forgot to change something. Just comment again and I will tackle it. Wish you all a Merry Christmas. |
|
/lgtm |
What type of PR is this?
/kind feature
What this PR does / Why we need it:
This PR enables
kubectl execandcrictl execto access containers during their graceful termination period. Previously, these commands would hang indefinitely when attempting to access containers in the "Terminating" state.The root cause was that the container stop loop held an exclusive operation lock (
opLock) for the entire graceful shutdown duration, which could be minutes with highterminationGracePeriodSecondsvalues. The exec code path required this same lock to check container status, creating a deadlock scenario.The fix implements granular locking in the stop loop, only holding locks around individual runtime operations rather than for the entire stop duration. This allows exec operations to proceed during the graceful termination window while maintaining thread safety.
Which issue(s) this PR fixes:
Fixes #7160
Special notes for your reviewer:
Key changes:
UpdateContainerStatus() + State()check withc.Living()to avoid opLock dependency (matches ExecSync pattern)StopLoopForContainer()and added granular locking around individual runtime calls (pause/resume, kill, status checks)The change allows execs during graceful termination (between SIGTERM and kill loop) but correctly blocks them once the kill loop begins.
Does this PR introduce a user-facing change?:
Testing:
Manual testing logs showing successful exec during termination:
CRI-O logs
crictl exec output
Unit tests added in
internal/oci/container_test.goverify:AddExecPIDaccepts exec processes during graceful terminationAddExecPIDrejects exec processes once kill loop beginsDeleteExecPIDsafely removes tracked processesKillExecPIDsterminates processes with appropriate signalsSetAsDoneStoppingcorrectly cleans up resourcesSummary by CodeRabbit
Bug Fixes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.