-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Introduce InternalWipe #4767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce InternalWipe #4767
Conversation
07a2683 to
4bf32c7
Compare
Codecov Report
@@ Coverage Diff @@
## master #4767 +/- ##
==========================================
- Coverage 43.00% 42.92% -0.08%
==========================================
Files 107 107
Lines 9809 9891 +82
==========================================
+ Hits 4218 4246 +28
- Misses 5140 5196 +56
+ Partials 451 449 -2 |
4bf32c7 to
d1ed125
Compare
|
/retest |
docs/crio.conf.5.md
Outdated
|
|
||
| **internal_wipe**=false | ||
| Whether CRI-O should wipe containers after a reboot on server startup. | ||
| If set to false, 'crio wipe' will cleanup the containers instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see prior
| func NewResourceCleaner() *ResourceCleaner { | ||
| func NewResourceCleaner(ctx context.Context) *ResourceCleaner { | ||
| return &ResourceCleaner{ | ||
| ctx: ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're supposed to save ctx as a member variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to be able to use log.Errof in Cleanup(). Do you think we should pass the context on Cleanup(ctx) as alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's better to pass ctx to Cleanup directly.
67bb590 to
df76fae
Compare
|
/retest |
df76fae to
d00c6db
Compare
01a851f to
f7a9746
Compare
|
@haircommander added another commit to run the cleanup in parallel |
|
/retest |
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
this logically groups two related functions: stopping network and cleaning up namespaces We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first Signed-off-by: Peter Hunt <[email protected]>
WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>
if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>
This commit refactors a couple of things: - rename ContainerStop to StopContainer (more consistent) - reuses StopContainer more - make StopContainer not query for the container again (every place it's called already has it) - drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant) in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped" Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
49f9aff to
20354c7
Compare
|
/retest |
|
/test ci/prow/e2e-gcp |
|
@haircommander: The specified target(s) for
Use
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/test e2e-gcp |
|
/lgtm |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@haircommander: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/cherry-pick release-1.21 |
|
@haircommander: new pull request created: #4884 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
What this PR does / why we need it:
On node reboot, there are situations where CNI plugins aren't able to cleanup the resources they acquire. They rely on the caller of CNI add to send a corresponding CNI del.
The problem is, CRI-O was previously not able to send the CNI del, because
crio wiperemoved the evidence of the container's (and pod's) existence. As such, on every reboot, some CNI plugins have the ability to leak resources.This PR is the first step in remediating it. It moves the responsibility of wiping images and containers from
crio wipeto server startup. This behavior is gated by a config valueinternal_wipe.It leaves the responsibility of checking whether the node shutdown and was able to sync to
crio wipe, as well as leavescrio wipe -fwhich has proven useful to wipe the state of a node.We also adapt the tests to the new behavior (previously, we were wiping the pause container, which we no longer do. I think this is the correct behavior)
Which issue(s) this PR fixes:
fixes half of #4727
Special notes for your reviewer:
built on top of #4763
Does this PR introduce a user-facing change?