Introduce InternalWipe #4767

haircommander · 2021-04-15T20:08:16Z

What type of PR is this?

/kind api-change
/kind bug
/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

On node reboot, there are situations where CNI plugins aren't able to cleanup the resources they acquire. They rely on the caller of CNI add to send a corresponding CNI del.

The problem is, CRI-O was previously not able to send the CNI del, because crio wipe removed the evidence of the container's (and pod's) existence. As such, on every reboot, some CNI plugins have the ability to leak resources.

This PR is the first step in remediating it. It moves the responsibility of wiping images and containers from crio wipe to server startup. This behavior is gated by a config value internal_wipe.

It leaves the responsibility of checking whether the node shutdown and was able to sync to crio wipe, as well as leaves crio wipe -f which has proven useful to wipe the state of a node.

We also adapt the tests to the new behavior (previously, we were wiping the pause container, which we no longer do. I think this is the correct behavior)

Which issue(s) this PR fixes:

fixes half of #4727

Special notes for your reviewer:

built on top of #4763

Does this PR introduce a user-facing change?

Add the config field `internal_wipe` which moves the responsibility of wiping containers after a reboot and images after an upgrade from the external binary `crio wipe` to the main crio server. This has a handful of advantages, the main one being crio is now better able to cleanup CNI resources after a reboot.

codecov · 2021-04-15T20:13:35Z

Codecov Report

Merging #4767 (4749f83) into master (6732636) will decrease coverage by 0.07%.
The diff coverage is 37.54%.

❗ Current head 4749f83 differs from pull request most recent head 20354c7. Consider uploading reports for the commit 20354c7 to get more accurate results

@@            Coverage Diff             @@
##           master    #4767      +/-   ##
==========================================
- Coverage   43.00%   42.92%   -0.08%     
==========================================
  Files         107      107              
  Lines        9809     9891      +82     
==========================================
+ Hits         4218     4246      +28     
- Misses       5140     5196      +56     
+ Partials      451      449       -2

haircommander · 2021-04-15T20:32:21Z

/retest

docs/crio.8.md

TomSweeneyRedHat · 2021-04-15T22:34:59Z

docs/crio.conf.5.md


+**internal_wipe**=false
+  Whether CRI-O should wipe containers after a reboot on server startup.
+  If set to false, 'crio wipe' will cleanup the containers instead.


haircommander · 2021-04-16T13:06:55Z

internal/resourcestore/resourcecleaner.go

-func NewResourceCleaner() *ResourceCleaner {
+func NewResourceCleaner(ctx context.Context) *ResourceCleaner {
 	return &ResourceCleaner{
+		ctx:   ctx,


I don't think we're supposed to save ctx as a member variable

The main reason is to be able to use log.Errof in Cleanup(). Do you think we should pass the context on Cleanup(ctx) as alternative?

Yes, I think it's better to pass ctx to Cleanup directly.

haircommander · 2021-04-16T20:29:47Z

/retest

haircommander · 2021-04-16T21:36:34Z

ref BZ https://bugzilla.redhat.com/show_bug.cgi?id=1948047 https://bugzilla.redhat.com/show_bug.cgi?id=1948137

saschagrunert · 2021-04-19T09:14:17Z

@haircommander added another commit to run the cleanup in parallel

saschagrunert · 2021-04-19T11:02:05Z

/retest

Signed-off-by: Peter Hunt <[email protected]>

this logically groups two related functions: stopping network and cleaning up namespaces We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first Signed-off-by: Peter Hunt <[email protected]>

WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>

if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>

This commit refactors a couple of things: - rename ContainerStop to StopContainer (more consistent) - reuses StopContainer more - make StopContainer not query for the container again (every place it's called already has it) - drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant) in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped" Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

haircommander · 2021-05-06T14:01:40Z

/retest

haircommander · 2021-05-07T17:54:56Z

/test ci/prow/e2e-gcp

openshift-ci-robot · 2021-05-07T17:54:58Z

@haircommander: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test kata-containers
/test e2e-agnostic
/test e2e-gcp
/test images
/test ami_fedora
/test ami_rhel
/test critest_fedora
/test critest_rhel
/test e2e_crun
/test e2e_cgroupv2
/test e2e_features_fedora
/test e2e_features_rhel
/test e2e_fedora
/test e2e_rhel
/test integration_crun
/test integration_cgroupv2
/test integration_fedora
/test integration_rhel

Use /test all to run the following jobs:

kata-containers-crio-PR
pull-ci-cri-o-cri-o-master-e2e-agnostic
pull-ci-cri-o-cri-o-master-e2e-gcp
pull-ci-cri-o-cri-o-master-images
test_pull_request_crio_critest_fedora
test_pull_request_crio_critest_rhel
test_pull_request_crio_e2e_crun_fedora
test_pull_request_crio_e2e_crun_fedora_cgroupv2
test_pull_request_crio_e2e_features_fedora
test_pull_request_crio_e2e_features_rhel
test_pull_request_crio_e2e_fedora
test_pull_request_crio_e2e_rhel
test_pull_request_crio_integration_crun_fedora
test_pull_request_crio_integration_crun_fedora_cgroupv2
test_pull_request_crio_integration_fedora
test_pull_request_crio_integration_rhel

Details

In response to this:

/test ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2021-05-07T17:55:02Z

/test e2e-gcp

mrunalp · 2021-05-07T18:03:45Z

/lgtm

openshift-bot · 2021-05-07T23:10:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-08T00:10:58Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-08T01:12:28Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-08T01:46:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-05-08T02:31:03Z

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2	`20354c7`	link	`/test e2e_cgroupv2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-05-08T02:46:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-08T02:58:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

haircommander · 2021-05-10T15:39:07Z

/cherry-pick release-1.21

openshift-cherrypick-robot · 2021-05-10T15:39:56Z

@haircommander: new pull request created: #4884

Details

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander requested review from mrunalp and runcom as code owners April 15, 2021 20:08

openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 15, 2021

openshift-ci-robot requested review from fidencio and rhatdan April 15, 2021 20:08

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2021

haircommander force-pushed the more-cni-del branch from 07a2683 to 4bf32c7 Compare April 15, 2021 20:10

haircommander mentioned this pull request Apr 15, 2021

CNI DEL not called on node reboot #4727

Closed

haircommander force-pushed the more-cni-del branch from 4bf32c7 to d1ed125 Compare April 15, 2021 20:25

TomSweeneyRedHat reviewed Apr 15, 2021

View reviewed changes

docs/crio.8.md Outdated Show resolved Hide resolved

TomSweeneyRedHat reviewed Apr 15, 2021

View reviewed changes

haircommander commented Apr 16, 2021

View reviewed changes

haircommander force-pushed the more-cni-del branch 3 times, most recently from 67bb590 to df76fae Compare April 16, 2021 19:36

haircommander force-pushed the more-cni-del branch from df76fae to d00c6db Compare April 16, 2021 21:16

saschagrunert force-pushed the more-cni-del branch 6 times, most recently from 01a851f to f7a9746 Compare April 19, 2021 09:12

haircommander added 7 commits May 6, 2021 08:13

test: add test for delayed cleanup of network on restart

ce6527b

Signed-off-by: Peter Hunt <[email protected]>

resourcestore: run cleanup in parallel

9e15b55

Signed-off-by: Peter Hunt <[email protected]>

server: don't unconditionally fail on sandbox cleanup

b2eb270

WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>

sandbox: fix race with cleanup

72385f4

if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>

sandbox remove: unmount shm before removing infra container

20354c7

Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the more-cni-del branch from 49f9aff to 20354c7 Compare May 6, 2021 12:13

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2021

openshift-merge-robot merged commit b96ea25 into cri-o:master May 8, 2021

openshift-cherrypick-robot mentioned this pull request May 10, 2021

[release-1.21] Introduce InternalWipe #4884

Merged

haircommander mentioned this pull request May 12, 2021

storage: use volatile containers #4885

Merged

openshift-ci bot mentioned this pull request May 22, 2021

Bug 1948137: crio: enable internal_wipe option openshift/machine-config-operator#2574

Merged

haircommander mentioned this pull request Jun 17, 2021

[1.20] Introduce *all of* internal wipe #5014

Merged

haircommander mentioned this pull request Jun 25, 2021

[1.19] Introduce *all of* internal wipe #5032

Closed

Introduce InternalWipe #4767

Introduce InternalWipe #4767

Uh oh!

Conversation

haircommander commented Apr 15, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

codecov bot commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

haircommander commented Apr 15, 2021

Uh oh!

Uh oh!

TomSweeneyRedHat Apr 15, 2021

Choose a reason for hiding this comment

Uh oh!

haircommander Apr 16, 2021

Choose a reason for hiding this comment

Uh oh!

saschagrunert Apr 19, 2021

Choose a reason for hiding this comment

Uh oh!

kolyshkin Apr 23, 2021

Choose a reason for hiding this comment

Uh oh!

haircommander commented Apr 16, 2021

Uh oh!

haircommander commented Apr 16, 2021

Uh oh!

saschagrunert commented Apr 19, 2021

Uh oh!

saschagrunert commented Apr 19, 2021

Uh oh!

haircommander commented May 6, 2021

Uh oh!

haircommander commented May 7, 2021

Uh oh!

openshift-ci-robot commented May 7, 2021

Uh oh!

haircommander commented May 7, 2021

Uh oh!

mrunalp commented May 7, 2021

Uh oh!

openshift-bot commented May 7, 2021

Uh oh!

openshift-bot commented May 8, 2021

Uh oh!

openshift-bot commented May 8, 2021

Uh oh!

openshift-bot commented May 8, 2021

Uh oh!

openshift-ci bot commented May 8, 2021

Uh oh!

openshift-bot commented May 8, 2021

Uh oh!

openshift-bot commented May 8, 2021

Uh oh!

haircommander commented May 10, 2021

Uh oh!

openshift-cherrypick-robot commented May 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

codecov bot commented Apr 15, 2021 •

edited

Loading