-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[1.20] Introduce *all of* internal wipe #5014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.20] Introduce *all of* internal wipe #5014
Conversation
6211abd to
31193c9
Compare
|
Integration tests are not green'ish |
|
alright I believe I've fixed the tests |
|
/test e2e-aws |
saschagrunert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
so all tests can properly access the runtime root without hardcoding Signed-off-by: Peter Hunt <[email protected]>
as well as fix a bug where the resource store would cleanup the resource out of order Signed-off-by: Peter Hunt <[email protected]>
or else the cleanup will always fail on cleanup from context deadline Signed-off-by: Peter Hunt <[email protected]>
We may encounter situations where sandboxes gets killed. In this case, we now try to cleanup the network on removal of the sandbox to ensure that no resources (like networks) left stale on the machine. Signed-off-by: Sascha Grunert <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
as it was duplicated pretty much with DeleteContainer Signed-off-by: Peter Hunt <[email protected]>
which allows us to not have to query for the image or pod each time we want to stop or remove it Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
and update tests to use containers with an image Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Sascha Grunert <[email protected]>
b6e1586 to
30fcb67
Compare
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
this logically groups two related functions: stopping network and cleaning up namespaces We do this to allow for network stop failures to be retried, as we'll fail to cleanup the network if the network namespace was removed first Signed-off-by: Peter Hunt <[email protected]>
WaitContainerStopped will always fail if the ctx has expired, which it always could be when running the cleanupFuncs on deadline exceeded Signed-off-by: Peter Hunt <[email protected]>
if we are cleaning up a sandbox in parallel, we may run into situations where we cleanup the infra directory before we stop the sandbox's network. That should not be fatal Signed-off-by: Peter Hunt <[email protected]>
This commit refactors a couple of things: - rename ContainerStop to StopContainer (more consistent) - reuses StopContainer more - make StopContainer not query for the container again (every place it's called already has it) - drops the test for StopContainer (it only checks that we query a container, so it's no longer testing anything relevant) in doing so, we also fix a bug where stopping the sandbox would fail because "container is already stopped" Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Before, we would not actually call CNI DEL on sandboxes that failed to restore. This was because we iterated through a map of *successfully* restored sandboxed, and checked if they failed. Instead, we need to rework LoadSandbox a bit to return a sandbox that was loaded. Even if the restore fails, we can pass the shell of a sandbox to the CNI plugin to best-effort cleanup Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
as it should not be a fatal error, as it indicates we've already cleaned up Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Before we were sequentially calling CNI del before seeing that they failed and calling them as a cleanup func. The problem is this significantly slows down server startup. A process that used to take effectively no times takes a couple of minutes. Instead, we immediately register the cleanup funcs and call Cleanup() in a separate goroutine, and proceeed to allow the server to start Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
as we don't know if the server previously started with namespaces managed, and then switched to have them not be managed Signed-off-by: Peter Hunt <[email protected]>
30fcb67 to
f65e248
Compare
saschagrunert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
this PR picks commits from ahem #4722 #4796 #4767 #4900 #4929 #4966 #4100 #5006
It introduces the InternalWipe feature, and which fixes a bug where CNI del (nor any container cleanup function) is retried if it fails (specifically on server startup
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?