[1.20] improve timeout handling and fix flakes #4430

haircommander · 2020-12-10T15:18:29Z

What type of PR is this?

/kind bug

/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

carries #4394 and #4422
replaces #4421

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

TomSweeneyRedHat · 2020-12-10T15:23:10Z

internal/resourcestore/resourcestore.go

+const sleepTimeBeforeCleanup = 1 * time.Minute
+
+// ResourceStore is a structure that saves information about a recently created resource.
+// Resources can be added and retrieved from the store. A retrieval (Get) also removes the Resource from the store.


a retrieval also removes? That's unusual no?

I suppose, though here we need to keep a single copy of the resource, so we're basically using move semantics. do you have a suggestion for a better method name?

haircommander · 2020-12-10T15:35:51Z

/retest

codecov · 2020-12-10T16:20:08Z

Codecov Report

Merging #4430 (e3dc7e7) into release-1.20 (209938c) will increase coverage by 0.07%.
The diff coverage is 44.18%.

@@               Coverage Diff                @@
##           release-1.20    #4430      +/-   ##
================================================
+ Coverage         40.52%   40.59%   +0.07%     
================================================
  Files               116      117       +1     
  Lines              9327     9404      +77     
================================================
+ Hits               3780     3818      +38     
- Misses             5120     5159      +39     
  Partials            427      427

haircommander · 2020-12-10T17:34:15Z

/retest

haircommander · 2020-12-10T18:29:19Z

/retest

ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>

Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>

Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

Older version of this code used to have a goroutine for each resource, which is no longer the case, so remove the obsoleted part of the doc. It is already described elsewhere how the resource is becoming stale and removed. Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

The 10s timeout is not enough sometimes to finish container or pod creation. Increase to 30s to fix occasional flakes, and move to a separate function wait_crio. While at it, - increate conmon sleep and crictl create/runp cancel timeout to 3s; - move create_conmon to setup; - fix ID checks (we're looking for string, not substring); - change a 3m timeout to 150s. Not critical, just nits. Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

openshift-ci-robot · 2020-12-10T19:02:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander,mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrunalp · 2020-12-10T19:02:23Z

/lgtm

haircommander · 2020-12-10T22:06:43Z

/retest

haircommander requested review from mrunalp and runcom as code owners December 10, 2020 15:18

openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 10, 2020

openshift-ci-robot requested a review from saschagrunert December 10, 2020 15:18

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2020

haircommander mentioned this pull request Dec 10, 2020

[release-1.20] Improve timeout handling #4421

Closed

TomSweeneyRedHat reviewed Dec 10, 2020

View reviewed changes

haircommander and others added 10 commits December 10, 2020 13:54

Add unit tests for ResourceCache

e62cb75

Signed-off-by: Peter Hunt <[email protected]>

test: add timeout.bats

7853a1b

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

test/timeout.bats: fix comments

7ed9d89

Signed-off-by: Kir Kolyshkin <[email protected]>

fix docs

e3dc7e7

Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the handle-timeout-1.20 branch from 9968443 to e3dc7e7 Compare December 10, 2020 18:55

haircommander added this to the 1.20 milestone Dec 10, 2020

mrunalp approved these changes Dec 10, 2020

View reviewed changes

openshift-ci-robot assigned mrunalp Dec 10, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2020

openshift-merge-robot merged commit 62afeec into cri-o:release-1.20 Dec 10, 2020

haircommander mentioned this pull request Dec 17, 2020

[1.19] improve timeout handling #4444

Merged

openshift-ci-robot mentioned this pull request Feb 2, 2021

[1.20] ResourceStore: fix segfault and update tests #4534

Closed

openshift-ci-robot mentioned this pull request Feb 17, 2021

wip: try to get openshift ci suite to pass #4582

Closed

openshift-ci-robot mentioned this pull request Mar 13, 2021

Backport handle irqbalance handling to 1.20 #4655

Closed

openshift-ci bot mentioned this pull request Aug 30, 2021

[release-1.20] BZ#2010831 Fix missing quantile in latency_microseconds_total metrics #5265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1.20] improve timeout handling and fix flakes #4430

[1.20] improve timeout handling and fix flakes #4430

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

TomSweeneyRedHat Dec 10, 2020

Uh oh!

haircommander Dec 10, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

codecov bot commented Dec 10, 2020 •

edited

Loading

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

openshift-ci-robot commented Dec 10, 2020

Uh oh!

mrunalp commented Dec 10, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[1.20] improve timeout handling and fix flakes #4430

[1.20] improve timeout handling and fix flakes #4430

Uh oh!

Conversation

haircommander commented Dec 10, 2020

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

TomSweeneyRedHat Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

haircommander Dec 10, 2020

Choose a reason for hiding this comment

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

codecov bot commented Dec 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

openshift-ci-robot commented Dec 10, 2020

Uh oh!

mrunalp commented Dec 10, 2020

Uh oh!

haircommander commented Dec 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov bot commented Dec 10, 2020 •

edited

Loading