[1.20] Introduce all of internal wipe #5014

haircommander · 2021-06-17T20:02:27Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

this PR picks commits from ahem #4722 #4796 #4767 #4900 #4929 #4966 #4100 #5006

It introduces the InternalWipe feature, and which fixes a bug where CNI del (nor any container cleanup function) is retried if it fails (specifically on server startup

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add the config field `internal_wipe` which moves the responsibility of wiping containers after a reboot and images after an upgrade from the external binary `crio wipe` to the main crio server. This has a handful of advantages, the main one being crio is now better able to cleanup CNI resources after a reboot.

saschagrunert · 2021-06-18T06:21:56Z

Integration tests are not green'ish

haircommander · 2021-06-22T20:01:47Z

alright I believe I've fixed the tests

haircommander · 2021-06-23T16:10:14Z

/test e2e-aws

saschagrunert

LGTM

so all tests can properly access the runtime root without hardcoding Signed-off-by: Peter Hunt <[email protected]>

as well as fix a bug where the resource store would cleanup the resource out of order Signed-off-by: Peter Hunt <[email protected]>

or else the cleanup will always fail on cleanup from context deadline Signed-off-by: Peter Hunt <[email protected]>

We may encounter situations where sandboxes gets killed. In this case, we now try to cleanup the network on removal of the sandbox to ensure that no resources (like networks) left stale on the machine. Signed-off-by: Sascha Grunert <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

as it was duplicated pretty much with DeleteContainer Signed-off-by: Peter Hunt <[email protected]>

which allows us to not have to query for the image or pod each time we want to stop or remove it Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

and update tests to use containers with an image Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Sascha Grunert <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

this logically groups two related functions: stopping network and cleaning up namespaces We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first Signed-off-by: Peter Hunt <[email protected]>

WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>

if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>

This commit refactors a couple of things: - rename ContainerStop to StopContainer (more consistent) - reuses StopContainer more - make StopContainer not query for the container again (every place it's called already has it) - drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant) in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped" Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before, we would not actually call CNI DEL on sandboxes that failed to restore. This was because we iterated through a map of *successfully* restored sandboxed, and checked if they failed. Instead, we need to rework LoadSandbox a bit to return a sandbox that was loaded. Even if the restore fails, we can pass the shell of a sandbox to the CNI plugin to best-effort cleanup Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

as it should not be a fatal error, as it indicates we've already cleaned up Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before we were sequentially calling CNI del before seeing that they failed and calling them as a cleanup func. The problem is this significantly slows down server startup. A process that used to take effectively no times takes a couple of minutes. Instead, we immediately register the cleanup funcs and call Cleanup() in a separate goroutine, and proceeed to allow the server to start Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

as we don't know if the server previously started with namespaces managed, and then switched to have them not be managed Signed-off-by: Peter Hunt <[email protected]>

saschagrunert

/lgtm

openshift-ci · 2021-06-25T07:03:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander requested review from mrunalp and runcom as code owners June 17, 2021 20:02

openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Jun 17, 2021

openshift-ci bot requested review from fidencio and giuseppe June 17, 2021 20:02

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2021

haircommander force-pushed the internal-wipe-1.20 branch from 6211abd to 31193c9 Compare June 17, 2021 20:10

saschagrunert approved these changes Jun 24, 2021

View reviewed changes

haircommander and others added 12 commits June 24, 2021 14:21

test: add runtime() function

630f276

so all tests can properly access the runtime root without hardcoding Signed-off-by: Peter Hunt <[email protected]>

resourcestore: introduce ResourceCleaner

176cd4a

as well as fix a bug where the resource store would cleanup the resource out of order Signed-off-by: Peter Hunt <[email protected]>

server: use background context for network stop

2cb0840

or else the cleanup will always fail on cleanup from context deadline Signed-off-by: Peter Hunt <[email protected]>

server: reuse container removal code for infra

2e21735

Signed-off-by: Peter Hunt <[email protected]>

storage: remove RemovePodSandbox function

621d296

as it was duplicated pretty much with DeleteContainer Signed-off-by: Peter Hunt <[email protected]>

server: breakup stop/remove all functions with internal helpers

3194c02

which allows us to not have to query for the image or pod each time we want to stop or remove it Signed-off-by: Peter Hunt <[email protected]>

config: add InternalWipe

286c407

Signed-off-by: Peter Hunt <[email protected]>

crio wipe: add support for internal_wipe

5e6e0df

Signed-off-by: Peter Hunt <[email protected]>

server: add support for internal_wipe

4f481e5

Signed-off-by: Peter Hunt <[email protected]>

test: add test for internal_wipe

4674b87

and update tests to use containers with an image Signed-off-by: Peter Hunt <[email protected]>

Add resource cleaner retry functionality

d5465df

Signed-off-by: Sascha Grunert <[email protected]>

haircommander force-pushed the internal-wipe-1.20 branch from b6e1586 to 30fcb67 Compare June 24, 2021 18:22

haircommander added 4 commits June 24, 2021 15:17

server: move newPodNetwork to a more logical place

df61d59

Signed-off-by: Peter Hunt <[email protected]>

server: get hooks after we've check if a sandbox is already stopped

4863486

Signed-off-by: Peter Hunt <[email protected]>

test: add test for delayed cleanup of network on restart

8af122e

Signed-off-by: Peter Hunt <[email protected]>

resourcestore: run cleanup in parallel

2048b2f

Signed-off-by: Peter Hunt <[email protected]>

haircommander added 16 commits June 24, 2021 15:17

server: don't unconditionally fail on sandbox cleanup

7b21821

WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>

sandbox: fix race with cleanup

62cbb3f

if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>

sandbox remove: unmount shm before removing infra container

2b0cac3

Signed-off-by: Peter Hunt <[email protected]>

move internal wipe to only wipe images

dacd69a

Signed-off-by: Peter Hunt <[email protected]>

ignore storage.ErrNotAContainer

a961c0d

Signed-off-by: Peter Hunt <[email protected]>

test: adapt crio wipe tests to handle new behavior

b46e733

Signed-off-by: Peter Hunt <[email protected]>

container_server: fix nsJoining

433d87a

Signed-off-by: Peter Hunt <[email protected]>

storage: succeed in DeleteContainer if container is unknown

a611724

Signed-off-by: Peter Hunt <[email protected]>

server: don't repeatedly error with no such id

d294d7e

as it should not be a fatal error, as it indicates we've already cleaned up Signed-off-by: Peter Hunt <[email protected]>

server: reduce log verbosity on restore

b9a58ad

Signed-off-by: Peter Hunt <[email protected]>

fix lint by removing dead code

7d6a05a

Signed-off-by: Peter Hunt <[email protected]>

lib: unconditionally attempt to restore namespaces

f65e248

as we don't know if the server previously started with namespaces managed, and then switched to have them not be managed Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the internal-wipe-1.20 branch from 30fcb67 to f65e248 Compare June 24, 2021 19:18

saschagrunert approved these changes Jun 25, 2021

View reviewed changes

openshift-ci bot assigned saschagrunert Jun 25, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2021

openshift-merge-robot merged commit 0d0f863 into cri-o:release-1.20 Jun 25, 2021

openshift-ci bot mentioned this pull request Aug 30, 2021

[release-1.20] BZ#2010831 Fix missing quantile in latency_microseconds_total metrics #5265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1.20] Introduce all of internal wipe #5014

[1.20] Introduce all of internal wipe #5014

Uh oh!

haircommander commented Jun 17, 2021

Uh oh!

saschagrunert commented Jun 18, 2021

Uh oh!

haircommander commented Jun 22, 2021

Uh oh!

haircommander commented Jun 23, 2021

Uh oh!

saschagrunert left a comment

Uh oh!

saschagrunert left a comment

Uh oh!

openshift-ci bot commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[1.20] Introduce *all of* internal wipe #5014

[1.20] Introduce *all of* internal wipe #5014

Uh oh!

Conversation

haircommander commented Jun 17, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

saschagrunert commented Jun 18, 2021

Uh oh!

haircommander commented Jun 22, 2021

Uh oh!

haircommander commented Jun 23, 2021

Uh oh!

saschagrunert left a comment

Choose a reason for hiding this comment

Uh oh!

saschagrunert left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jun 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[1.20] Introduce all of internal wipe #5014

[1.20] Introduce all of internal wipe #5014