-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[1.19] improve timeout handling #4444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1.19] improve timeout handling #4444
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold Putting a hold as we want to make sure this is tested well before merging into 1.19 |
|
/hold |
Codecov Report
@@ Coverage Diff @@
## release-1.19 #4444 +/- ##
================================================
+ Coverage 40.87% 40.89% +0.02%
================================================
Files 114 115 +1
Lines 8744 8795 +51
================================================
+ Hits 3574 3597 +23
- Misses 4807 4834 +27
- Partials 363 364 +1 |
|
@haircommander: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Signed-off-by: Peter Hunt <[email protected]>
If we create the network before we have an infra container, but fail to fully create a sandbox, we attempt to clean up the network. Calling networkStop() causes CRI-O to place a file in the sandbox's infra container's directory, thus allowing us to restore the fact that the network had been stopped The problem is, we don't have a infra container directory, so the call segfaults. Instead, check if the sandbox has finished creating before attempting to create the file. if it hasn't, there will be no sandbox to restore, so we don't really need the temp file. Another option would be to wire it so that the sandbox has access to the infraContainer.Dir() without actually having an infra container. That requires another item in libsandbox.New(), which I find cumbersome. Further, I think sandbox creation code is itching for a refactor, which can include that fix if we find it desireable. In the meantime, this work around is sufficient. Signed-off-by: Peter Hunt <[email protected]>
it doesn't make very much sense to have so many deferred funcs queued, and check retErr each time instead, we can check retErr once, and loop through a slice of cleanupFuncs Signed-off-by: Peter Hunt <[email protected]>
ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>
Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>
Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>
timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>
Older version of this code used to have a goroutine for each resource, which is no longer the case, so remove the obsoleted part of the doc. It is already described elsewhere how the resource is becoming stale and removed. Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
The 10s timeout is not enough sometimes to finish container or pod creation. Increase to 30s to fix occasional flakes, and move to a separate function wait_crio. While at it, - increate conmon sleep and crictl create/runp cancel timeout to 3s; - move create_conmon to setup; - fix ID checks (we're looking for string, not substring); - change a 3m timeout to 150s. Not critical, just nits. Signed-off-by: Kir Kolyshkin <[email protected]>
before, it was possible to segfault when a WatcherForResource was called followed by a Get as we didn't check that the resource was actually put. Fix this Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
We need to specifically register "Describe" functions, but ginkgo doesn't allow us to register multiple ones. Wrap different functionality in different Contexts so they all run. Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
c23f369 to
c31c1fb
Compare
|
also includes #4530 |
|
/override integration_crun |
|
@haircommander: /override requires a failed status context to operate on.
Only the following contexts were expected:
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/override ci/openshift-jenkins/integration_crun |
|
@haircommander: Overrode contexts on behalf of haircommander: ci/openshift-jenkins/integration_crun DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold cancel I believe we've soaked enough in 1.20 |
|
/lgtm |
|
/retest |
|
@haircommander: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
carries #4430 #4258 #4241 #4240
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?