Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind bug

/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

carries #4394 and #4422
replaces #4421

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

umohnani8 and others added 23 commits December 7, 2020 13:55
[1.20] Point to k8s release-1.20 branch for tests
so we don't leak a goroutine

Signed-off-by: Peter Hunt <[email protected]>
…pick-4412-to-release-1.20

[release-1.20] image pull: close progress chan
Signed-off-by: Urvashi Mohnani <[email protected]>
This saves some time (a few seconds per test at least) as we avoid
running setup_test (and cleanup_test has nothing to clean up), and
removes some code duplication.

Signed-off-by: Kir Kolyshkin <[email protected]>
...and add a status check to one case where we use run, to make it
more obvious that `run` is really needed here.

Signed-off-by: Kir Kolyshkin <[email protected]>
Using /dev/loop-control is problematic since it is not supposed
to be read from or written to.

Use /dev/kmsg, and actually enable the write test.

Signed-off-by: Kir Kolyshkin <[email protected]>
…pick-4402-to-release-1.20

[release-1.20] convert shmsize annotation to handler_allowed
…pick-4369-to-release-1.20

[release-1.20] test/devices.bats: nits
[1.20] Bump version to v1.20.0-rc.1
which changed after a k8s release-notes package bump

Signed-off-by: Peter Hunt <[email protected]>
ResourceCache is a structure that keeps track of partially created Pods and Containers.
Its features include:
- tracking pods and containers after their initial creation times out
- automatic garbage collection (after a timer)

Signed-off-by: Peter Hunt <[email protected]>
Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource.

However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress,
not get rid of it.

This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run.
When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already
successfully created that resource. If so, we return it as if we'd just created it.

It also moves the SetCreated call to after the resource is deemed as not having timed out.

Hopefully, this reduces the load on already overloaded nodes.

Signed-off-by: Peter Hunt <[email protected]>
Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved.

This is bad UX, and we're capable of improving it.

Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to
wait for a resource to be available.

Something that is key here is if the resource becomes available while we're watching for it,
*we still need to error on this request*
This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up),
and the kubelet's request could time out, and it could try again. This would cause us to leak a resource.

This way, if we get into this situation, there needs to be three requests:
first that times out
second that discovers the resource is ready, but still errors
third that actually retrives that resource and returns it.

This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes)

Signed-off-by: Peter Hunt <[email protected]>
Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the
network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow
us to reuse even the longest of CNI creation time. However, this leads to the chance that the
CNI plugin runs forever, which is not ideal.

Instead, give the sandbox network creation 5 minutes (a minute more than the full request),
to improve the odds we have a completed sandbox that can be reused, rather than thrown away.

Signed-off-by: Peter Hunt <[email protected]>
timeout.bats is a test suite that tests different scenerios regarding to timeouts in
sandbox running and container creation.

It requires a crictl that knows about the -T option

Signed-off-by: Peter Hunt <[email protected]>
Older version of this code used to have a goroutine for each resource,
which is no longer the case, so remove the obsoleted part of the doc.

It is already described elsewhere how the resource is becoming stale
and removed.

Signed-off-by: Kir Kolyshkin <[email protected]>
The 10s timeout is not enough sometimes to finish container or pod
creation. Increase to 30s to fix occasional flakes, and move to a
separate function wait_crio.

While at it,
 - increate conmon sleep and crictl create/runp cancel timeout to 3s;
 - move create_conmon to setup;
 - fix ID checks (we're looking for string, not substring);
 - change a 3m timeout to 150s. Not critical, just nits.

Signed-off-by: Kir Kolyshkin <[email protected]>
@openshift-ci-robot openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Dec 10, 2020
@openshift-ci-robot
Copy link

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2020
@haircommander
Copy link
Member Author

shoot

@openshift-merge-robot
Copy link
Contributor

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/critest_fedora 9968443 link /test critest_fedora

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants