Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@openshift-cherrypick-robot

This is an automated cherry-pick of #4394

/assign haircommander

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

ResourceCache is a structure that keeps track of partially created Pods and Containers.
Its features include:
- tracking pods and containers after their initial creation times out
- automatic garbage collection (after a timer)

Signed-off-by: Peter Hunt <[email protected]>
Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource.

However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress,
not get rid of it.

This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run.
When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already
successfully created that resource. If so, we return it as if we'd just created it.

It also moves the SetCreated call to after the resource is deemed as not having timed out.

Hopefully, this reduces the load on already overloaded nodes.

Signed-off-by: Peter Hunt <[email protected]>
Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved.

This is bad UX, and we're capable of improving it.

Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to
wait for a resource to be available.

Something that is key here is if the resource becomes available while we're watching for it,
*we still need to error on this request*
This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up),
and the kubelet's request could time out, and it could try again. This would cause us to leak a resource.

This way, if we get into this situation, there needs to be three requests:
first that times out
second that discovers the resource is ready, but still errors
third that actually retrives that resource and returns it.

This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes)

Signed-off-by: Peter Hunt <[email protected]>
Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the
network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow
us to reuse even the longest of CNI creation time. However, this leads to the chance that the
CNI plugin runs forever, which is not ideal.

Instead, give the sandbox network creation 5 minutes (a minute more than the full request),
to improve the odds we have a completed sandbox that can be reused, rather than thrown away.

Signed-off-by: Peter Hunt <[email protected]>
timeout.bats is a test suite that tests different scenerios regarding to timeouts in
sandbox running and container creation.

It requires a crictl that knows about the -T option

Signed-off-by: Peter Hunt <[email protected]>
@openshift-ci-robot openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 8, 2020
@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2020
@haircommander
Copy link
Member

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, openshift-cherrypick-robot

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2020
@haircommander
Copy link
Member

/retest

@codecov
Copy link

codecov bot commented Dec 8, 2020

Codecov Report

Merging #4421 (b9dda02) into release-1.20 (9cfbe39) will increase coverage by 0.07%.
The diff coverage is 44.18%.

@@               Coverage Diff                @@
##           release-1.20    #4421      +/-   ##
================================================
+ Coverage         40.50%   40.57%   +0.07%     
================================================
  Files               116      117       +1     
  Lines              9330     9407      +77     
================================================
+ Hits               3779     3817      +38     
- Misses             5125     5164      +39     
  Partials            426      426              

@haircommander
Copy link
Member

/retest

3 similar comments
@saschagrunert
Copy link
Member

/retest

@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

(I recognize getting this in will make the tests flaky, before we cut final 1.20 we should get some version of #4422 in)

@haircommander
Copy link
Member

/retest

2 similar comments
@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/retest

@openshift-merge-robot
Copy link
Contributor

@openshift-cherrypick-robot: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/integration_rhel b9dda02 link /test integration_rhel

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@haircommander
Copy link
Member

replaced by #4430

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants