runtime_vm: Fix non terminating pods #4263

fidencio · 2020-10-14T08:15:46Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

It's been reported by Robert Krawitz that some pods were not terminating when using the VM runtime type. I've tried to reproduce the issue locally and failed. Robert, then, gave me access to one of his macines and his scripts and turns out the situation doesn't happen all the time and, when it happens, it happens usually with an infra pod hanging and with pods containing init pods, and it seems to happen because the VM is shutdown, which causes a ttrpc.ErrClosed report, but we don't handle it properly. This PR tries to fix the issue mentioned.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

@haircommander, I remember in the past we had issues with removing pods on runtime VM side that caused kubelet to get confused. Would you have any tip on how I could ensure this is not happening in this case?

Also, I'm not exactly comfortable with runtime_vm: Don't let wait() return ttrpc.ErrClosed and runtime_vm: StopContainers() should not fail when the VM is shutdown. Maybe we could actually return ttrpc.ErrClosed (not passing this to errdefs.FromGRPC()) and check for it in the second patch. That sounds cleaner to me, but I'd like to hear a feedback from the reviewers.

Does this PR introduce a user-facing change?

None

Signed-off-by: Fabiano Fidêncio <[email protected]>

Passing ttrpc.ErrClosed to errdefs.FromGRPC() will result in an "ttrpc: closed: unknown" error, which we can't match in any possible way. Knowing that let's, instead, return errdefs.ErrNotFound, as already done in updateContainerStatus() so we can properly match the error when it occurs. Signed-off-by: Fabiano Fidêncio <[email protected]>

In the goroutine used to monitor whether the container was terminated or not, we should not fail in case the VM was shutdown, as this is expected to happen and will cause a ttrpc.ErrClosed. runtime's wait() function, however, will returns errdefs.ErrNotFound when a ttrpc.ErrCloses happens in order to avoid returning "ttrpc: closed: unknown" (see previous commit) and that's the reason we just check for errdefs.ErrNotFound and do not error out in that case. Signed-off-by: Fabiano Fidêncio <[email protected]>

If the VM is down when removing the container, ttrpc.ErrClosed would be returned and we'd return this error up in the chain. However, if the VM is down, so is the container and we could simply ignore the error reported, as already done in a few other parts of our code. Signed-off-by: Fabiano Fidêncio <[email protected]>

codecov · 2020-10-14T08:57:57Z

Codecov Report

Merging #4263 into master will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4263      +/-   ##
==========================================
- Coverage   38.59%   38.58%   -0.01%     
==========================================
  Files         111      111              
  Lines        8893     8895       +2     
==========================================
  Hits         3432     3432              
- Misses       5077     5079       +2     
  Partials      384      384

openshift-ci-robot · 2020-10-14T11:19:03Z

@fidencio: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws	`5f4774e`	link	`/test e2e-aws`

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

haircommander · 2020-10-14T13:54:15Z

I believe it was @evanfoster who found the problems with the kubelet interacting with cri-o and kata

/area vm

this LGTM, we could also introduce our own named error to be clearer about the case, but it makes no functional difference

/approve

openshift-ci-robot · 2020-10-14T13:54:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fidencio, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fidencio,haircommander]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

TomSweeneyRedHat · 2020-10-14T14:58:58Z

@fidencio are the pods going into a pending state? Could this possibly address some of the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1884035 - Pods are illegally transitioning back to pending

fidencio · 2020-10-14T15:05:44Z

@fidencio are the pods going into a pending state? Could this possibly address some of the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1884035 - Pods are illegally transitioning back to pending

@TomSweeneyRedHat, no, as far as I could see the pods are left forever in the "Terminating" state because of the ttrpc.ErrClosed not being properly treated. Mind, tho, that this only affects pods using "vm" (as in kata-containers) as their runtime class.

By the way, although I don't think it's related, quite nice linking skills, I must admit! I'll keep that BZ reference in mind in case we face something similar in the near future.

Thanks, @TomSweeneyRedHat!

fidencio · 2020-10-14T15:11:07Z

/retest

mrunalp · 2020-10-19T17:57:46Z

internal/oci/runtime_vm.go

+		if !errors.Is(err, ttrpc.ErrClosed) {
+			return -1, errdefs.FromGRPC(err)
+		}
+		return -1, errdefs.ErrNotFound


Should we define a custom error here instead?

I'm really on the fence about this.

Defining a custom error seems the cleanest possible way to proceed, but we need to ensure it'll be handled inside our own code. The reason for that is containerd also returning errdefs.ErrNotFound in case the error is ttrpc.ErrClosed, which makes me think that it's some kind of non-written convention.

Would you be okay with a follow-up PR where I'd switch the ErrNotFound by our own defined error, carefully taking into consideration the places where we actually can do that?

Sounds good 👍

mrunalp · 2020-10-19T18:05:07Z

/lgtm

fidencio · 2020-10-20T06:15:08Z

/cherry-pick release-1.19

openshift-cherrypick-robot · 2020-10-20T06:15:22Z

@fidencio: new pull request created: #4283

Details

In response to this:

/cherry-pick release-1.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fidencio · 2020-10-20T06:15:23Z

/cherry-pick release-1.18

openshift-cherrypick-robot · 2020-10-20T06:15:27Z

@fidencio: #4263 failed to apply on top of branch "release-1.18":

Applying: runtime_vm: Fix updateContainerStatus() logic
Using index info to reconstruct a base tree...
M	internal/oci/runtime_vm.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_vm.go
CONFLICT (content): Merge conflict in internal/oci/runtime_vm.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 runtime_vm: Fix updateContainerStatus() logic
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fidencio added 4 commits October 14, 2020 02:52

runtime_vm: Fix updateContainerStatus() logic

0f2a070

Signed-off-by: Fabiano Fidêncio <[email protected]>

openshift-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels Oct 14, 2020

openshift-ci-robot requested review from haircommander and mrunalp October 14, 2020 08:15

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 14, 2020

openshift-ci-robot added the area/vm Runtime VM related pull requests and issues label Oct 14, 2020

fidencio mentioned this pull request Oct 16, 2020

runtime: proper host cgroups by default kata-containers/kata-containers#972

Closed

mrunalp reviewed Oct 19, 2020

View reviewed changes

openshift-ci-robot assigned mrunalp Oct 19, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 19, 2020

openshift-merge-robot merged commit 9de5e4f into cri-o:master Oct 19, 2020

openshift-cherrypick-robot mentioned this pull request Oct 20, 2020

[release-1.19] runtime_vm: Fix non terminating pods #4283

Merged

runtime_vm: Fix non terminating pods #4263

runtime_vm: Fix non terminating pods #4263

Uh oh!

Conversation

fidencio commented Oct 14, 2020

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

codecov bot commented Oct 14, 2020

Codecov Report

Uh oh!

openshift-ci-robot commented Oct 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haircommander commented Oct 14, 2020

Uh oh!

openshift-ci-robot commented Oct 14, 2020

Uh oh!

TomSweeneyRedHat commented Oct 14, 2020

Uh oh!

fidencio commented Oct 14, 2020

Uh oh!

fidencio commented Oct 14, 2020

Uh oh!

mrunalp Oct 19, 2020

Choose a reason for hiding this comment

Uh oh!

fidencio Oct 19, 2020

Choose a reason for hiding this comment

Uh oh!

mrunalp Oct 19, 2020

Choose a reason for hiding this comment

Uh oh!

mrunalp commented Oct 19, 2020

Uh oh!

fidencio commented Oct 20, 2020

Uh oh!

openshift-cherrypick-robot commented Oct 20, 2020

Uh oh!

fidencio commented Oct 20, 2020

Uh oh!

openshift-cherrypick-robot commented Oct 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

openshift-ci-robot commented Oct 14, 2020 •

edited

Loading