Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind api-change
/kind bug
/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

On node reboot, there are situations where CNI plugins aren't able to cleanup the resources they acquire. They rely on the caller of CNI add to send a corresponding CNI del.

The problem is, CRI-O was previously not able to send the CNI del, because crio wipe removed the evidence of the container's (and pod's) existence. As such, on every reboot, some CNI plugins have the ability to leak resources.

This PR is the first step in remediating it. It moves the responsibility of wiping images and containers from crio wipe to server startup. This behavior is gated by a config value internal_wipe.

It leaves the responsibility of checking whether the node shutdown and was able to sync to crio wipe, as well as leaves crio wipe -f which has proven useful to wipe the state of a node.

We also adapt the tests to the new behavior (previously, we were wiping the pause container, which we no longer do. I think this is the correct behavior)

Which issue(s) this PR fixes:

fixes half of #4727

Special notes for your reviewer:

built on top of #4763

Does this PR introduce a user-facing change?

Add the config field `internal_wipe` which moves the responsibility of wiping containers after a reboot and images after an upgrade from the external binary `crio wipe` to the main crio server. This has a handful of advantages, the main one being crio is now better able to cleanup CNI resources after a reboot.

@openshift-ci-robot openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Apr 15, 2021
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 15, 2021
@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #4767 (4749f83) into master (6732636) will decrease coverage by 0.07%.
The diff coverage is 37.54%.

❗ Current head 4749f83 differs from pull request most recent head 20354c7. Consider uploading reports for the commit 20354c7 to get more accurate results

@@            Coverage Diff             @@
##           master    #4767      +/-   ##
==========================================
- Coverage   43.00%   42.92%   -0.08%     
==========================================
  Files         107      107              
  Lines        9809     9891      +82     
==========================================
+ Hits         4218     4246      +28     
- Misses       5140     5196      +56     
+ Partials      451      449       -2     

@haircommander
Copy link
Member Author

/retest


**internal_wipe**=false
Whether CRI-O should wipe containers after a reboot on server startup.
If set to false, 'crio wipe' will cleanup the containers instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see prior

func NewResourceCleaner() *ResourceCleaner {
func NewResourceCleaner(ctx context.Context) *ResourceCleaner {
return &ResourceCleaner{
ctx: ctx,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're supposed to save ctx as a member variable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is to be able to use log.Errof in Cleanup(). Do you think we should pass the context on Cleanup(ctx) as alternative?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's better to pass ctx to Cleanup directly.

@haircommander haircommander force-pushed the more-cni-del branch 3 times, most recently from 67bb590 to df76fae Compare April 16, 2021 19:36
@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author

@saschagrunert saschagrunert force-pushed the more-cni-del branch 6 times, most recently from 01a851f to f7a9746 Compare April 19, 2021 09:12
@saschagrunert
Copy link
Member

@haircommander added another commit to run the cleanup in parallel

@saschagrunert
Copy link
Member

/retest

this logically groups two related functions: stopping network and cleaning up namespaces

We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first

Signed-off-by: Peter Hunt <[email protected]>
WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded

Signed-off-by: Peter Hunt <[email protected]>
if we are cleaning up a sandbox in parallel, we may run into situations where
we cleanup the infra directory before we stop the sandbox's network.

That should not be fatal

Signed-off-by: Peter Hunt <[email protected]>
This commit refactors a couple of things:
- rename ContainerStop to StopContainer (more consistent)
- reuses StopContainer more
- make StopContainer not query for the container again (every place it's called already has it)
- drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant)

in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped"

Signed-off-by: Peter Hunt <[email protected]>
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label May 6, 2021
@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author

/test ci/prow/e2e-gcp

@openshift-ci-robot
Copy link

@haircommander: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test kata-containers
  • /test e2e-agnostic
  • /test e2e-gcp
  • /test images
  • /test ami_fedora
  • /test ami_rhel
  • /test critest_fedora
  • /test critest_rhel
  • /test e2e_crun
  • /test e2e_cgroupv2
  • /test e2e_features_fedora
  • /test e2e_features_rhel
  • /test e2e_fedora
  • /test e2e_rhel
  • /test integration_crun
  • /test integration_cgroupv2
  • /test integration_fedora
  • /test integration_rhel

Use /test all to run the following jobs:

  • kata-containers-crio-PR
  • pull-ci-cri-o-cri-o-master-e2e-agnostic
  • pull-ci-cri-o-cri-o-master-e2e-gcp
  • pull-ci-cri-o-cri-o-master-images
  • test_pull_request_crio_critest_fedora
  • test_pull_request_crio_critest_rhel
  • test_pull_request_crio_e2e_crun_fedora
  • test_pull_request_crio_e2e_crun_fedora_cgroupv2
  • test_pull_request_crio_e2e_features_fedora
  • test_pull_request_crio_e2e_features_rhel
  • test_pull_request_crio_e2e_fedora
  • test_pull_request_crio_e2e_rhel
  • test_pull_request_crio_integration_crun_fedora
  • test_pull_request_crio_integration_crun_fedora_cgroupv2
  • test_pull_request_crio_integration_fedora
  • test_pull_request_crio_integration_rhel
Details

In response to this:

/test ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Member Author

/test e2e-gcp

@mrunalp
Copy link
Member

mrunalp commented May 7, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 7, 2021
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 8, 2021

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2 20354c7 link /test e2e_cgroupv2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit b96ea25 into cri-o:master May 8, 2021
@haircommander
Copy link
Member Author

/cherry-pick release-1.21

@openshift-cherrypick-robot

@haircommander: new pull request created: #4884

Details

In response to this:

/cherry-pick release-1.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants