Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@fidencio
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

It's been reported by Robert Krawitz that some pods were not terminating when using the VM runtime type. I've tried to reproduce the issue locally and failed. Robert, then, gave me access to one of his macines and his scripts and turns out the situation doesn't happen all the time and, when it happens, it happens usually with an infra pod hanging and with pods containing init pods, and it seems to happen because the VM is shutdown, which causes a ttrpc.ErrClosed report, but we don't handle it properly. This PR tries to fix the issue mentioned.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

@haircommander, I remember in the past we had issues with removing pods on runtime VM side that caused kubelet to get confused. Would you have any tip on how I could ensure this is not happening in this case?

Also, I'm not exactly comfortable with runtime_vm: Don't let wait() return ttrpc.ErrClosed and runtime_vm: StopContainers() should not fail when the VM is shutdown. Maybe we could actually return ttrpc.ErrClosed (not passing this to errdefs.FromGRPC()) and check for it in the second patch. That sounds cleaner to me, but I'd like to hear a feedback from the reviewers.

Does this PR introduce a user-facing change?

None

Passing ttrpc.ErrClosed to errdefs.FromGRPC() will result in an "ttrpc:
closed: unknown" error, which we can't match in any possible way.

Knowing that let's, instead, return errdefs.ErrNotFound, as already done
in updateContainerStatus() so we can properly match the error when it
occurs.

Signed-off-by: Fabiano Fidêncio <[email protected]>
In the goroutine used to monitor whether the container was terminated or
not, we should not fail in case the VM was shutdown, as this is expected
to happen and will cause a ttrpc.ErrClosed.

runtime's wait() function, however, will returns errdefs.ErrNotFound
when a ttrpc.ErrCloses happens in order to avoid returning "ttrpc:
closed: unknown" (see previous commit) and that's the reason we just
check for errdefs.ErrNotFound and do not error out in that case.

Signed-off-by: Fabiano Fidêncio <[email protected]>
If the VM is down when removing the container, ttrpc.ErrClosed would be
returned and we'd return this error up in the chain. However, if the VM
is down, so is the container and we could simply ignore the error
reported, as already done in a few other parts of our code.

Signed-off-by: Fabiano Fidêncio <[email protected]>
@openshift-ci-robot openshift-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Oct 14, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2020
@codecov
Copy link

codecov bot commented Oct 14, 2020

Codecov Report

Merging #4263 into master will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4263      +/-   ##
==========================================
- Coverage   38.59%   38.58%   -0.01%     
==========================================
  Files         111      111              
  Lines        8893     8895       +2     
==========================================
  Hits         3432     3432              
- Misses       5077     5079       +2     
  Partials      384      384              

@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 14, 2020

@fidencio: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws 5f4774e link /test e2e-aws
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@haircommander
Copy link
Member

I believe it was @evanfoster who found the problems with the kubelet interacting with cri-o and kata

/area vm

this LGTM, we could also introduce our own named error to be clearer about the case, but it makes no functional difference

/approve

@openshift-ci-robot openshift-ci-robot added the area/vm Runtime VM related pull requests and issues label Oct 14, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fidencio, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [fidencio,haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@TomSweeneyRedHat
Copy link
Contributor

@fidencio are the pods going into a pending state? Could this possibly address some of the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1884035 - Pods are illegally transitioning back to pending

@fidencio
Copy link
Contributor Author

@fidencio are the pods going into a pending state? Could this possibly address some of the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1884035 - Pods are illegally transitioning back to pending

@TomSweeneyRedHat, no, as far as I could see the pods are left forever in the "Terminating" state because of the ttrpc.ErrClosed not being properly treated. Mind, tho, that this only affects pods using "vm" (as in kata-containers) as their runtime class.

By the way, although I don't think it's related, quite nice linking skills, I must admit! I'll keep that BZ reference in mind in case we face something similar in the near future.

Thanks, @TomSweeneyRedHat!

@fidencio
Copy link
Contributor Author

/retest

if !errors.Is(err, ttrpc.ErrClosed) {
return -1, errdefs.FromGRPC(err)
}
return -1, errdefs.ErrNotFound
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define a custom error here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really on the fence about this.

Defining a custom error seems the cleanest possible way to proceed, but we need to ensure it'll be handled inside our own code. The reason for that is containerd also returning errdefs.ErrNotFound in case the error is ttrpc.ErrClosed, which makes me think that it's some kind of non-written convention.

Would you be okay with a follow-up PR where I'd switch the ErrNotFound by our own defined error, carefully taking into consideration the places where we actually can do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good 👍

@mrunalp
Copy link
Member

mrunalp commented Oct 19, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020
@openshift-merge-robot openshift-merge-robot merged commit 9de5e4f into cri-o:master Oct 19, 2020
@fidencio
Copy link
Contributor Author

/cherry-pick release-1.19

@openshift-cherrypick-robot

@fidencio: new pull request created: #4283

Details

In response to this:

/cherry-pick release-1.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fidencio
Copy link
Contributor Author

/cherry-pick release-1.18

@openshift-cherrypick-robot

@fidencio: #4263 failed to apply on top of branch "release-1.18":

Applying: runtime_vm: Fix updateContainerStatus() logic
Using index info to reconstruct a base tree...
M	internal/oci/runtime_vm.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_vm.go
CONFLICT (content): Merge conflict in internal/oci/runtime_vm.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 runtime_vm: Fix updateContainerStatus() logic
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vm Runtime VM related pull requests and issues dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants