Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

this PR picks commits from ahem #4722 #4796 #4767 #4900 #4929 #4966 #4100 #5006

It introduces the InternalWipe feature, and which fixes a bug where CNI del (nor any container cleanup function) is retried if it fails (specifically on server startup

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add the config field `internal_wipe` which moves the responsibility of wiping containers after a reboot and images after an upgrade from the external binary `crio wipe` to the main crio server. This has a handful of advantages, the main one being crio is now better able to cleanup CNI resources after a reboot.

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 17, 2021
@openshift-ci openshift-ci bot requested review from fidencio and giuseppe June 17, 2021 20:02
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2021
@saschagrunert
Copy link
Member

Integration tests are not green'ish

@haircommander
Copy link
Member Author

alright I believe I've fixed the tests

@haircommander
Copy link
Member Author

/test e2e-aws

Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

haircommander and others added 12 commits June 24, 2021 14:21
so all tests can properly access the runtime root without hardcoding

Signed-off-by: Peter Hunt <[email protected]>
as well as fix a bug where the resource store would cleanup the resource out of order

Signed-off-by: Peter Hunt <[email protected]>
or else the cleanup will always fail on cleanup from context deadline

Signed-off-by: Peter Hunt <[email protected]>
We may encounter situations where sandboxes gets killed. In this case,
we now try to cleanup the network on removal of the sandbox to ensure
that no resources (like networks) left stale on the machine.

Signed-off-by: Sascha Grunert <[email protected]>
as it was duplicated pretty much with DeleteContainer

Signed-off-by: Peter Hunt <[email protected]>
which allows us to not have to query for the image or pod each time we want to stop or remove it

Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
and update tests to use containers with an image

Signed-off-by: Peter Hunt <[email protected]>
this logically groups two related functions: stopping network and cleaning up namespaces

We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first

Signed-off-by: Peter Hunt <[email protected]>
WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded

Signed-off-by: Peter Hunt <[email protected]>
if we are cleaning up a sandbox in parallel, we may run into situations where
we cleanup the infra directory before we stop the sandbox's network.

That should not be fatal

Signed-off-by: Peter Hunt <[email protected]>
This commit refactors a couple of things:
- rename ContainerStop to StopContainer (more consistent)
- reuses StopContainer more
- make StopContainer not query for the container again (every place it's called already has it)
- drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant)

in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped"

Signed-off-by: Peter Hunt <[email protected]>
Before, we would not actually call CNI DEL on sandboxes that failed to restore.
This was because we iterated through a map of *successfully* restored sandboxed, and checked if they failed.

Instead, we need to rework LoadSandbox a bit to return a sandbox that was loaded.
Even if the restore fails, we can pass the shell of a sandbox to the CNI plugin to best-effort cleanup

Signed-off-by: Peter Hunt <[email protected]>
as it should not be a fatal error, as it indicates we've already cleaned up

Signed-off-by: Peter Hunt <[email protected]>
Before we were sequentially calling CNI del before seeing that they failed and calling them as a cleanup func.

The problem is this significantly slows down server startup. A process that used to take effectively no times takes a couple of minutes.

Instead, we immediately register the cleanup funcs and call Cleanup() in a separate goroutine, and proceeed to allow the server to start

Signed-off-by: Peter Hunt <[email protected]>
as we don't know if the server previously started with namespaces managed, and then switched to have them not be managed

Signed-off-by: Peter Hunt <[email protected]>
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 25, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants