Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind api-change

/kind bug

/kind ci
/kind cleanup
/kind dependency-change
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake
/kind other

What this PR does / why we need it:

carries #4430 #4258 #4241 #4240

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug where a timeout in RunPodSandbox or CreateContainer requests caused CRI-O to delete the newly created resource. Now, it saves that resource, until the kubelet re-requests it, thus allowing kubelet and CRI-O to reconcile quicker when nodes are under load.

@openshift-ci-robot openshift-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Dec 17, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2020
@mrunalp
Copy link
Member

mrunalp commented Dec 17, 2020

/hold

Putting a hold as we want to make sure this is tested well before merging into 1.19

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2020
@mrunalp
Copy link
Member

mrunalp commented Dec 17, 2020

/hold

@codecov
Copy link

codecov bot commented Dec 17, 2020

Codecov Report

Merging #4444 (c23f369) into release-1.19 (6377f68) will increase coverage by 0.02%.
The diff coverage is 34.28%.

@@               Coverage Diff                @@
##           release-1.19    #4444      +/-   ##
================================================
+ Coverage         40.87%   40.89%   +0.02%     
================================================
  Files               114      115       +1     
  Lines              8744     8795      +51     
================================================
+ Hits               3574     3597      +23     
- Misses             4807     4834      +27     
- Partials            363      364       +1     

@openshift-merge-robot
Copy link
Contributor

@haircommander: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/integration_fedora c23f369 link /test integration_fedora
ci/prow/e2e-aws c23f369 link /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

haircommander and others added 16 commits February 15, 2021 13:34
If we create the network before we have an infra container, but fail to fully create a sandbox,
we attempt to clean up the network. Calling networkStop() causes CRI-O to place a file in the
sandbox's infra container's directory, thus allowing us to restore the fact that the network had been stopped

The problem is, we don't have a infra container directory, so the call segfaults.

Instead, check if the sandbox has finished creating before attempting to create the file. if it hasn't, there will be
no sandbox to restore, so we don't really need the temp file.

Another option would be to wire it so that the sandbox has access to the infraContainer.Dir() without actually having an infra container.
That requires another item in libsandbox.New(), which I find cumbersome. Further, I think sandbox creation code is itching for a refactor,
which can include that fix if we find it desireable. In the meantime, this work around is sufficient.

Signed-off-by: Peter Hunt <[email protected]>
it doesn't make very much sense to have so many deferred funcs queued, and check retErr each time

instead, we can check retErr once, and loop through a slice of cleanupFuncs

Signed-off-by: Peter Hunt <[email protected]>
ResourceCache is a structure that keeps track of partially created Pods and Containers.
Its features include:
- tracking pods and containers after their initial creation times out
- automatic garbage collection (after a timer)

Signed-off-by: Peter Hunt <[email protected]>
Before when a client's request for a RunPodSandbox or ContainerCreate timed out, CRI-O would clean up the resource.

However, these requests usually fail when the node is under load. In these cases, it would be better to hold onto the progress,
not get rid of it.

This commit uses the previously created ResourceCache to cache the progress of a container creation and sandbox run.
When a duplicate name is detected, before erroring, the server checks in the ResourceCache to see if we've already
successfully created that resource. If so, we return it as if we'd just created it.

It also moves the SetCreated call to after the resource is deemed as not having timed out.

Hopefully, this reduces the load on already overloaded nodes.

Signed-off-by: Peter Hunt <[email protected]>
Even if we use the resource cache as is, the user is still bombarded with messages saying the name is reserved.

This is bad UX, and we're capable of improving it.

Add watcher idiom to resource cache, allowing a handler routine of RunPodSandbox or CreateContainer to
wait for a resource to be available.

Something that is key here is if the resource becomes available while we're watching for it,
*we still need to error on this request*
This is because we could get the resource from the cache, remove it (thus meaning it won't be cleaned up),
and the kubelet's request could time out, and it could try again. This would cause us to leak a resource.

This way, if we get into this situation, there needs to be three requests:
first that times out
second that discovers the resource is ready, but still errors
third that actually retrives that resource and returns it.

This will result in many fewer "name is reserved" errors (one every 2 seconds to one every 4 minutes)

Signed-off-by: Peter Hunt <[email protected]>
Now that we plan on caching the results of a pod sandbox creation, we shouldn't short circut the
network creation. In a perfect world, we'd give the CNI plugin unbounded time, which would allow
us to reuse even the longest of CNI creation time. However, this leads to the chance that the
CNI plugin runs forever, which is not ideal.

Instead, give the sandbox network creation 5 minutes (a minute more than the full request),
to improve the odds we have a completed sandbox that can be reused, rather than thrown away.

Signed-off-by: Peter Hunt <[email protected]>
timeout.bats is a test suite that tests different scenerios regarding to timeouts in
sandbox running and container creation.

It requires a crictl that knows about the -T option

Signed-off-by: Peter Hunt <[email protected]>
Older version of this code used to have a goroutine for each resource,
which is no longer the case, so remove the obsoleted part of the doc.

It is already described elsewhere how the resource is becoming stale
and removed.

Signed-off-by: Kir Kolyshkin <[email protected]>
The 10s timeout is not enough sometimes to finish container or pod
creation. Increase to 30s to fix occasional flakes, and move to a
separate function wait_crio.

While at it,
 - increate conmon sleep and crictl create/runp cancel timeout to 3s;
 - move create_conmon to setup;
 - fix ID checks (we're looking for string, not substring);
 - change a 3m timeout to 150s. Not critical, just nits.

Signed-off-by: Kir Kolyshkin <[email protected]>
before, it was possible to segfault when a WatcherForResource was called followed by a Get
as we didn't check that the resource was actually put. Fix this

Signed-off-by: Peter Hunt <[email protected]>
We need to specifically register "Describe" functions,
but ginkgo doesn't allow us to register multiple ones.

Wrap different functionality in different Contexts so they all run.

Signed-off-by: Peter Hunt <[email protected]>
@haircommander
Copy link
Member Author

also includes #4530

@haircommander
Copy link
Member Author

/override integration_crun

@openshift-ci-robot
Copy link

@haircommander: /override requires a failed status context to operate on.
The following unknown contexts were given:

  • integration_crun

Only the following contexts were expected:

  • ci/kata-jenkins
  • ci/openshift-jenkins/critest_fedora
  • ci/openshift-jenkins/critest_rhel
  • ci/openshift-jenkins/e2e_crun
  • ci/openshift-jenkins/e2e_crun_cgroupv2
  • ci/openshift-jenkins/e2e_features_fedora
  • ci/openshift-jenkins/e2e_features_rhel
  • ci/openshift-jenkins/e2e_fedora
  • ci/openshift-jenkins/e2e_rhel
  • ci/openshift-jenkins/integration_crun
  • ci/openshift-jenkins/integration_crun_cgroupv2
  • ci/openshift-jenkins/integration_fedora
  • ci/openshift-jenkins/integration_rhel
  • ci/prow/e2e-aws
  • ci/prow/images
  • tide
Details

In response to this:

/override integration_crun

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Member Author

/override ci/openshift-jenkins/integration_crun

@openshift-ci-robot
Copy link

@haircommander: Overrode contexts on behalf of haircommander: ci/openshift-jenkins/integration_crun

Details

In response to this:

/override ci/openshift-jenkins/integration_crun

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Member Author

/hold cancel

I believe we've soaked enough in 1.20

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2021
@mrunalp
Copy link
Member

mrunalp commented Mar 4, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 4, 2021
@haircommander
Copy link
Member Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2021

@haircommander: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2 c31c1fb link /test e2e_cgroupv2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 3a10ad2 into cri-o:release-1.19 Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants