[1.19] improve timeout handling #4444

haircommander · 2020-12-17T22:05:48Z

What type of PR is this?

/kind api-change

/kind bug

/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

carries #4430 #4258 #4241 #4240

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

openshift-ci-robot · 2020-12-17T22:05:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mrunalp · 2020-12-17T22:20:05Z

/hold

Putting a hold as we want to make sure this is tested well before merging into 1.19

mrunalp · 2020-12-17T22:20:37Z

/hold

codecov · 2020-12-17T22:21:01Z

Codecov Report

Merging #4444 (c23f369) into release-1.19 (6377f68) will increase coverage by 0.02%.
The diff coverage is 34.28%.

@@               Coverage Diff                @@
##           release-1.19    #4444      +/-   ##
================================================
+ Coverage         40.87%   40.89%   +0.02%     
================================================
  Files               114      115       +1     
  Lines              8744     8795      +51     
================================================
+ Hits               3574     3597      +23     
- Misses             4807     4834      +27     
- Partials            363      364       +1

openshift-merge-robot · 2020-12-18T01:16:25Z

@haircommander: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/integration_fedora	`c23f369`	link	`/test integration_fedora`
ci/prow/e2e-aws	`c23f369`	link	`/test e2e-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Signed-off-by: Peter Hunt <[email protected]>

If we create the network before we have an infra container, but fail to fully create a sandbox, we attempt to clean up the network. Calling networkStop() causes CRI-O to place a file in the sandbox's infra container's directory, thus allowing us to restore the fact that the network had been stopped The problem is, we don't have a infra container directory, so the call segfaults. Instead, check if the sandbox has finished creating before attempting to create the file. if it hasn't, there will be no sandbox to restore, so we don't really need the temp file. Another option would be to wire it so that the sandbox has access to the infraContainer.Dir() without actually having an infra container. That requires another item in libsandbox.New(), which I find cumbersome. Further, I think sandbox creation code is itching for a refactor, which can include that fix if we find it desireable. In the meantime, this work around is sufficient. Signed-off-by: Peter Hunt <[email protected]>

it doesn't make very much sense to have so many deferred funcs queued, and check retErr each time instead, we can check retErr once, and loop through a slice of cleanupFuncs Signed-off-by: Peter Hunt <[email protected]>

ResourceCache is a structure that keeps track of partially created Pods and Containers. Its features include: - tracking pods and containers after their initial creation times out - automatic garbage collection (after a timer) Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource. However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress, not get rid of it. This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run. When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already successfully created that resource. If so, we return it as if we'd just created it. It also moves the SetCreated call to after the resource is deemed as not having timed out. Hopefully, this reduces the load on already overloaded nodes. Signed-off-by: Peter Hunt <[email protected]>

Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved. This is bad UX, and we're capable of improving it. Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to wait for a resource to be available. Something that is key here is if the resource becomes available while we're watching for it, *we still need to error on this request* This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up), and the kubelet's request could time out, and it could try again. This would cause us to leak a resource. This way, if we get into this situation, there needs to be three requests: first that times out second that discovers the resource is ready, but still errors third that actually retrives that resource and returns it. This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes) Signed-off-by: Peter Hunt <[email protected]>

Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow us to reuse even the longest of CNI creation time. However, this leads to the chance that the CNI plugin runs forever, which is not ideal. Instead, give the sandbox network creation 5 minutes (a minute more than the full request), to improve the odds we have a completed sandbox that can be reused, rather than thrown away. Signed-off-by: Peter Hunt <[email protected]>

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

Older version of this code used to have a goroutine for each resource, which is no longer the case, so remove the obsoleted part of the doc. It is already described elsewhere how the resource is becoming stale and removed. Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

The 10s timeout is not enough sometimes to finish container or pod creation. Increase to 30s to fix occasional flakes, and move to a separate function wait_crio. While at it, - increate conmon sleep and crictl create/runp cancel timeout to 3s; - move create_conmon to setup; - fix ID checks (we're looking for string, not substring); - change a 3m timeout to 150s. Not critical, just nits. Signed-off-by: Kir Kolyshkin <[email protected]>

before, it was possible to segfault when a WatcherForResource was called followed by a Get as we didn't check that the resource was actually put. Fix this Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

We need to specifically register "Describe" functions, but ginkgo doesn't allow us to register multiple ones. Wrap different functionality in different Contexts so they all run. Signed-off-by: Peter Hunt <[email protected]>

Signed-off-by: Peter Hunt <[email protected]>

haircommander · 2021-02-15T18:35:09Z

also includes #4530

haircommander · 2021-03-04T15:57:43Z

/override integration_crun

openshift-ci-robot · 2021-03-04T15:57:44Z

@haircommander: /override requires a failed status context to operate on.
The following unknown contexts were given:

integration_crun

Only the following contexts were expected:

ci/kata-jenkins
ci/openshift-jenkins/critest_fedora
ci/openshift-jenkins/critest_rhel
ci/openshift-jenkins/e2e_crun
ci/openshift-jenkins/e2e_crun_cgroupv2
ci/openshift-jenkins/e2e_features_fedora
ci/openshift-jenkins/e2e_features_rhel
ci/openshift-jenkins/e2e_fedora
ci/openshift-jenkins/e2e_rhel
ci/openshift-jenkins/integration_crun
ci/openshift-jenkins/integration_crun_cgroupv2
ci/openshift-jenkins/integration_fedora
ci/openshift-jenkins/integration_rhel
ci/prow/e2e-aws
ci/prow/images
tide

Details

In response to this:

/override integration_crun

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2021-03-04T15:57:58Z

/override ci/openshift-jenkins/integration_crun

openshift-ci-robot · 2021-03-04T15:59:03Z

@haircommander: Overrode contexts on behalf of haircommander: ci/openshift-jenkins/integration_crun

Details

In response to this:

/override ci/openshift-jenkins/integration_crun

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2021-03-04T17:25:25Z

/hold cancel

I believe we've soaked enough in 1.20

mrunalp · 2021-03-04T17:28:48Z

/lgtm

haircommander · 2021-03-04T18:15:35Z

/retest

openshift-ci · 2021-03-04T19:15:54Z

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2	`c31c1fb`	link	`/test e2e_cgroupv2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

haircommander requested review from mrunalp and runcom as code owners December 17, 2020 22:05

openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 17, 2020

openshift-ci-robot requested review from giuseppe and sboeuf December 17, 2020 22:05

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2020

haircommander and others added 16 commits February 15, 2021 13:34

network: create as early as possible

b9de071

Signed-off-by: Peter Hunt <[email protected]>

server: refactor handling of cleanup funcs

fb368c1

it doesn't make very much sense to have so many deferred funcs queued, and check retErr each time instead, we can check retErr once, and loop through a slice of cleanupFuncs Signed-off-by: Peter Hunt <[email protected]>

Add unit tests for ResourceCache

1a3da77

Signed-off-by: Peter Hunt <[email protected]>

test: add timeout.bats

5342d3e

timeout.bats is a test suite that tests different scenerios regarding to timeouts in sandbox running and container creation. It requires a crictl that knows about the -T option Signed-off-by: Peter Hunt <[email protected]>

test/timeout.bats: fix comments

74eb76d

Signed-off-by: Kir Kolyshkin <[email protected]>

ResourceStore: don't segfault

ce42fa7

before, it was possible to segfault when a WatcherForResource was called followed by a Get as we didn't check that the resource was actually put. Fix this Signed-off-by: Peter Hunt <[email protected]>

ResourceStore: update docs for WatcherForResource

96295e0

Signed-off-by: Peter Hunt <[email protected]>

ResourceStore: update tests to all run

e8f0575

We need to specifically register "Describe" functions, but ginkgo doesn't allow us to register multiple ones. Wrap different functionality in different Contexts so they all run. Signed-off-by: Peter Hunt <[email protected]>

ResourceStore: extend tests to test WatcherForResource

0e18231

Signed-off-by: Peter Hunt <[email protected]>

ResourceStore: add close method

c31c1fb

Signed-off-by: Peter Hunt <[email protected]>

haircommander force-pushed the handle-timeout-1.19 branch from c23f369 to c31c1fb Compare February 15, 2021 18:35

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021

openshift-ci-robot assigned mrunalp Mar 4, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2021

openshift-merge-robot merged commit 3a10ad2 into cri-o:release-1.19 Mar 4, 2021

[1.19] improve timeout handling #4444

[1.19] improve timeout handling #4444

Uh oh!

Conversation

haircommander commented Dec 17, 2020

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

openshift-ci-robot commented Dec 17, 2020

Uh oh!

mrunalp commented Dec 17, 2020

Uh oh!

mrunalp commented Dec 17, 2020

Uh oh!

codecov bot commented Dec 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-merge-robot commented Dec 18, 2020

Uh oh!

haircommander commented Feb 15, 2021

Uh oh!

haircommander commented Mar 4, 2021

Uh oh!

openshift-ci-robot commented Mar 4, 2021

Uh oh!

haircommander commented Mar 4, 2021

Uh oh!

openshift-ci-robot commented Mar 4, 2021

Uh oh!

haircommander commented Mar 4, 2021

Uh oh!

mrunalp commented Mar 4, 2021

Uh oh!

haircommander commented Mar 4, 2021

Uh oh!

openshift-ci bot commented Mar 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Dec 17, 2020 •

edited

Loading