-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Improve timeout handling #4394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve timeout handling #4394
Conversation
|
A week ago I was able to verify this fix worked.
without this patch: with this patch: Note: both clusters did eventually fail (the test case specifically had to overload it to trigger this situation) but with the patches we were able to get more pods running ( |
e0c0bf9 to
65c5f8a
Compare
Codecov Report
@@ Coverage Diff @@
## master #4394 +/- ##
==========================================
+ Coverage 40.50% 40.57% +0.07%
==========================================
Files 116 117 +1
Lines 9330 9407 +77
==========================================
+ Hits 3779 3817 +38
- Misses 5125 5164 +39
Partials 426 426 |
65c5f8a to
97ab357
Compare
|
/retest |
97ab357 to
d01e338
Compare
|
/retest |
1 similar comment
|
/retest |
saschagrunert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, saschagrunert The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d7343e5 to
89ed57b
Compare
|
/retest |
fidencio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander, a few basic comments but looks good in general.
89ed57b to
7166aeb
Compare
|
blocked on #4241 btw |
|
LGTM |
7166aeb to
4e4fe51
Compare
|
/retest |
|
/cherry-pick release-1.20 |
|
@haircommander: once the present PR merges, I will cherry-pick it on top of release-1.20 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/lgtm |
ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>
Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>
Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>
0537e78 to
cfdf40e
Compare
timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>
|
there was a flake with the tests, I tweaked it a bit, I think this should be better |
|
/lgtm |
|
/retest |
1 similar comment
|
/retest |
|
@haircommander: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@haircommander: new pull request created: #4421 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
|
||
| // Put takes a unique resource name (retrieved from the client request, not generated by the server) | ||
| // a newly created resource, and functions to cleanup that newly created resource. | ||
| // It adds the Resource to the ResourceStore, as well as starts a go routine that is responsible for cleaning up the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the part about go routine is obsoleted
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR changes the behavior of container and pod creation to save the progress of a creation request if we timeout, and quickly return that progress if it's re-requested.
Specifically, it adds
This is carrying #4266, plus adding the watcher idiom.
Which issue(s) this PR fixes:
fixes #4221
Special notes for your reviewer:
Does this PR introduce a user-facing change?