-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix vm containers could not restore after CRI-O restart #5574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @gozssky. Thanks for your PR. I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cc @haircommander |
5458124 to
61f9d86
Compare
| c.opLock.Lock() | ||
| defer c.opLock.Unlock() | ||
|
|
||
| // Lets ensure we're able to properly get construct the Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reads to me more of an operation to do when creating the runtime. WDYt about moving it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems ok, but this PR does n't change these lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nevermind then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR shuffles this code around, but this code was actually introduced to ensure we can configure the runtime to receive a specific configuration file.
| return errdefs.ErrNotFound | ||
| } | ||
|
|
||
| if err = r.restoreContainerIO(ctx, c, response); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the case where we haven't restarted, does this get called? should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if container io has already existed in ctrs map, restoreContainerIO won't do more operations.
Codecov Report
@@ Coverage Diff @@
## main #5574 +/- ##
==========================================
- Coverage 43.22% 43.05% -0.18%
==========================================
Files 123 123
Lines 12214 12266 +52
==========================================
+ Hits 5280 5281 +1
- Misses 6426 6477 +51
Partials 508 508 |
f162064 to
9c21a51
Compare
|
/retitle Fix vm containers could not restore after CRI-O restart (ci doesn't handle apostrophes well...) |
|
/retest |
@gozssky, ah, sorry, just add the |
Signed-off-by: Yujie Xia <[email protected]>
Signed-off-by: Yujie Xia <[email protected]>
2e41e26 to
acde725
Compare
@fidencio Thank you for the detailed explanation! I adjusted my two patches. The first patch 0731a9b didn't change the |
@gozssky, that's the way to go! |
|
/retest |
|
@gozssky, integration tests are breaking, please take a look at those: |
fgiudici
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice patch, thanks @gozssky !
Cannot understand why tests fail: let me try to retest.
| } | ||
| return nil | ||
| } | ||
| _, err := r.createContainerIO(ctx, c, cio.WithFIFOs(ctrio.NewFIFOSet(cioCfg, closer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that much that we use the containerd/io package directly here (ctrio) while we have the cri-o/utils/io package that wraps the calls of the containerd/io one.
@fidencio, I have taken a look and It seems to me much easier to wrap the check and management of the FIFOs in the cri-o/utils/io package with something like a new "WithExistingFIFO" function rather than making explicit usage of the containerd/cio package all around.
It makes sense to address this in a following PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fgiudici, I partially agree.
For a long time we decided to not directly vendor containerd code and we ended up actually copy and pasting code from containerd to our codebase, and as any copy & pasted code, those are not receiving bug fixes and are there to rot.
I'd be happy with directly vendor the code and get rid of everything we copied into our repos. That would be cleaner and would esnure we actually get bug fixes from those.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks @fidencio, now I see... we have copied the github.com/containerd/cri/pkg/server/io pkg in the cri-o/utils/io!
Vendoring that package would be way better (not looked at the dependencies chain btw), but may be cleaner to keep all the code of fifo management in only one package (maybe there is already something in the containerd/cri/pkg/server/io pkg).
This probably deserves a dedicated investigation and PR :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be cleaner to keep all the code of fifo management in only one package
@fgiudici I couldn't agree more. I'd like to do this in next PR.
|
/retest |
|
@gozssky: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/approve I tag @fidencio for the final lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gozssky, haircommander The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I will run the last round of tests here and most likely have it merged sooner than later. :-) |
|
@gozssky, yet again, thanks a whole lot for the contribution! |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/cherry-pick release-1.23 |
|
@fidencio: new pull request created: #5633 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When cri-o is running with kata runtime, cri-o could not restore containers correctly after it restarts. The
containerd-shim-kata-v2andqemu-system-x86_64processes are leak every time.This PR attempts to restore containers properly with the existing
containerd-shim-kata-v2andqemu-system-x86_64processes.containerd-shim-kata-v2writes the gRPC socket address under container bundle path with a file namedaddress. After cri-o restarts, we can read from that file for shim socket address and connect to. Besides, we also need to restore the container io so that we can exec or attach the container.I tested these changes on our internal cluster, everything worked well.
Which issue(s) this PR fixes:
Fixes #2112 #5569
Special notes for your reviewer:
Does this PR introduce a user-facing change?