[release-1.20] Improve timeout handling #4421

openshift-cherrypick-robot · 2020-12-08T21:23:25Z

This is an automated cherry-pick of #4394

/assign haircommander

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>

Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>

Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

haircommander · 2020-12-08T21:26:41Z

/retest

haircommander · 2020-12-08T21:26:51Z

/lgtm

haircommander · 2020-12-08T21:27:19Z

/approve

openshift-ci-robot · 2020-12-08T21:27:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, openshift-cherrypick-robot

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2020-12-08T21:57:44Z

/retest

codecov · 2020-12-08T23:03:47Z

Codecov Report

Merging #4421 (b9dda02) into release-1.20 (9cfbe39) will increase coverage by 0.07%.
The diff coverage is 44.18%.

@@               Coverage Diff                @@
##           release-1.20    #4421      +/-   ##
================================================
+ Coverage         40.50%   40.57%   +0.07%     
================================================
  Files               116      117       +1     
  Lines              9330     9407      +77     
================================================
+ Hits               3779     3817      +38     
- Misses             5125     5164      +39     
  Partials            426      426

haircommander · 2020-12-09T00:32:22Z

/retest

saschagrunert · 2020-12-09T07:19:45Z

/retest

haircommander · 2020-12-09T13:43:22Z

/retest

haircommander · 2020-12-09T15:58:29Z

/retest

haircommander · 2020-12-09T15:59:05Z

(I recognize getting this in will make the tests flaky, before we cut final 1.20 we should get some version of #4422 in)

haircommander · 2020-12-09T16:50:08Z

/retest

haircommander · 2020-12-09T17:21:13Z

/retest

haircommander · 2020-12-09T19:16:10Z

/retest

haircommander · 2020-12-09T21:16:40Z

/retest

openshift-merge-robot · 2020-12-09T21:46:44Z

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/integration_rhel	`b9dda02`	link	`/test integration_rhel`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

haircommander · 2020-12-10T15:18:55Z

replaced by #4430

haircommander added 6 commits December 8, 2020 21:23

Add unit tests for ResourceCache

4f95a43

Signed-off-by: Peter Hunt <[email protected]>

test: add timeout.bats

b9dda02

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

openshift-cherrypick-robot requested review from mrunalp and runcom as code owners December 8, 2020 21:23

openshift-ci-robot assigned haircommander Dec 8, 2020

openshift-cherrypick-robot mentioned this pull request Dec 8, 2020

Improve timeout handling #4394

Merged

openshift-ci-robot requested review from haircommander and nalind December 8, 2020 21:23

openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 8, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2020

This was referenced Dec 10, 2020

[1.20] handle timeout handling and fix flakes #4429

Closed

[1.20] improve timeout handling and fix flakes #4430

Merged

haircommander closed this Dec 10, 2020

[release-1.20] Improve timeout handling #4421

[release-1.20] Improve timeout handling #4421

Uh oh!

Conversation

openshift-cherrypick-robot commented Dec 8, 2020

Uh oh!

haircommander commented Dec 8, 2020

Uh oh!

haircommander commented Dec 8, 2020

Uh oh!

haircommander commented Dec 8, 2020

Uh oh!

openshift-ci-robot commented Dec 8, 2020

Uh oh!

haircommander commented Dec 8, 2020

Uh oh!

codecov bot commented Dec 8, 2020

Codecov Report

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

saschagrunert commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

haircommander commented Dec 9, 2020

Uh oh!

openshift-merge-robot commented Dec 9, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants