-
Notifications
You must be signed in to change notification settings - Fork 1.1k
runtime_vm: Fix non terminating pods #4263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime_vm: Fix non terminating pods #4263
Conversation
Signed-off-by: Fabiano Fidêncio <[email protected]>
Passing ttrpc.ErrClosed to errdefs.FromGRPC() will result in an "ttrpc: closed: unknown" error, which we can't match in any possible way. Knowing that let's, instead, return errdefs.ErrNotFound, as already done in updateContainerStatus() so we can properly match the error when it occurs. Signed-off-by: Fabiano Fidêncio <[email protected]>
In the goroutine used to monitor whether the container was terminated or not, we should not fail in case the VM was shutdown, as this is expected to happen and will cause a ttrpc.ErrClosed. runtime's wait() function, however, will returns errdefs.ErrNotFound when a ttrpc.ErrCloses happens in order to avoid returning "ttrpc: closed: unknown" (see previous commit) and that's the reason we just check for errdefs.ErrNotFound and do not error out in that case. Signed-off-by: Fabiano Fidêncio <[email protected]>
If the VM is down when removing the container, ttrpc.ErrClosed would be returned and we'd return this error up in the chain. However, if the VM is down, so is the container and we could simply ignore the error reported, as already done in a few other parts of our code. Signed-off-by: Fabiano Fidêncio <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #4263 +/- ##
==========================================
- Coverage 38.59% 38.58% -0.01%
==========================================
Files 111 111
Lines 8893 8895 +2
==========================================
Hits 3432 3432
- Misses 5077 5079 +2
Partials 384 384 |
|
@fidencio: The following test failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
I believe it was @evanfoster who found the problems with the kubelet interacting with cri-o and kata /area vm this LGTM, we could also introduce our own named error to be clearer about the case, but it makes no functional difference /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fidencio, haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@fidencio are the pods going into a pending state? Could this possibly address some of the issues in https://bugzilla.redhat.com/show_bug.cgi?id=1884035 - Pods are illegally transitioning back to pending |
@TomSweeneyRedHat, no, as far as I could see the pods are left forever in the "Terminating" state because of the ttrpc.ErrClosed not being properly treated. Mind, tho, that this only affects pods using "vm" (as in kata-containers) as their runtime class. By the way, although I don't think it's related, quite nice linking skills, I must admit! I'll keep that BZ reference in mind in case we face something similar in the near future. Thanks, @TomSweeneyRedHat! |
|
/retest |
| if !errors.Is(err, ttrpc.ErrClosed) { | ||
| return -1, errdefs.FromGRPC(err) | ||
| } | ||
| return -1, errdefs.ErrNotFound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define a custom error here instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really on the fence about this.
Defining a custom error seems the cleanest possible way to proceed, but we need to ensure it'll be handled inside our own code. The reason for that is containerd also returning errdefs.ErrNotFound in case the error is ttrpc.ErrClosed, which makes me think that it's some kind of non-written convention.
Would you be okay with a follow-up PR where I'd switch the ErrNotFound by our own defined error, carefully taking into consideration the places where we actually can do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
|
/lgtm |
|
/cherry-pick release-1.19 |
|
@fidencio: new pull request created: #4283 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-1.18 |
|
@fidencio: #4263 failed to apply on top of branch "release-1.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
It's been reported by Robert Krawitz that some pods were not terminating when using the VM runtime type. I've tried to reproduce the issue locally and failed. Robert, then, gave me access to one of his macines and his scripts and turns out the situation doesn't happen all the time and, when it happens, it happens usually with an infra pod hanging and with pods containing init pods, and it seems to happen because the VM is shutdown, which causes a ttrpc.ErrClosed report, but we don't handle it properly. This PR tries to fix the issue mentioned.
Which issue(s) this PR fixes:
None
Special notes for your reviewer:
@haircommander, I remember in the past we had issues with removing pods on runtime VM side that caused kubelet to get confused. Would you have any tip on how I could ensure this is not happening in this case?
Also, I'm not exactly comfortable with
runtime_vm: Don't let wait() return ttrpc.ErrClosedandruntime_vm: StopContainers() should not fail when the VM is shutdown. Maybe we could actually return ttrpc.ErrClosed (not passing this to errdefs.FromGRPC()) and check for it in the second patch. That sounds cleaner to me, but I'd like to hear a feedback from the reviewers.Does this PR introduce a user-facing change?