Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@sleepymole
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

When cri-o is running with kata runtime, cri-o could not restore containers correctly after it restarts. The containerd-shim-kata-v2 and qemu-system-x86_64 processes are leak every time.

This PR attempts to restore containers properly with the existing containerd-shim-kata-v2 and qemu-system-x86_64 processes. containerd-shim-kata-v2 writes the gRPC socket address under container bundle path with a file named address. After cri-o restarts, we can read from that file for shim socket address and connect to. Besides, we also need to restore the container io so that we can exec or attach the container.

I tested these changes on our internal cluster, everything worked well.

Which issue(s) this PR fixes:

Fixes #2112 #5569

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix vm containers couldn't restore after cri-o restart

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 26, 2022
@openshift-ci openshift-ci bot requested review from giuseppe and nalind January 26, 2022 13:25
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2022

Hi @gozssky. Thanks for your PR.

I'm waiting for a cri-o member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 26, 2022
@sleepymole
Copy link
Contributor Author

/cc @haircommander

@openshift-ci openshift-ci bot requested a review from haircommander January 26, 2022 13:26
@sleepymole sleepymole force-pushed the issue-2112 branch 2 times, most recently from 5458124 to 61f9d86 Compare January 26, 2022 13:34
@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels Jan 26, 2022
c.opLock.Lock()
defer c.opLock.Unlock()

// Lets ensure we're able to properly get construct the Options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads to me more of an operation to do when creating the runtime. WDYt about moving it there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems ok, but this PR does n't change these lines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nevermind then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR shuffles this code around, but this code was actually introduced to ensure we can configure the runtime to receive a specific configuration file.

return errdefs.ErrNotFound
}

if err = r.restoreContainerIO(ctx, c, response); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the case where we haven't restarted, does this get called? should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if container io has already existed in ctrs map, restoreContainerIO won't do more operations.

@haircommander
Copy link
Member

/ok-to-test
/area vm

Awesome @gozssky, many thanks for taking this on! I have some nits and questions, but in general I like the approach.

@fidencio @fgiudici PTAL

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. area/vm Runtime VM related pull requests and issues and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 26, 2022
@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #5574 (acde725) into main (025bebf) will decrease coverage by 0.17%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #5574      +/-   ##
==========================================
- Coverage   43.22%   43.05%   -0.18%     
==========================================
  Files         123      123              
  Lines       12214    12266      +52     
==========================================
+ Hits         5280     5281       +1     
- Misses       6426     6477      +51     
  Partials      508      508              

@haircommander
Copy link
Member

/retitle Fix vm containers could not restore after CRI-O restart

(ci doesn't handle apostrophes well...)

@openshift-ci openshift-ci bot changed the title Fix vm containers couldn't restore after CRI-O restart Fix vm containers could not restore after CRI-O restart Jan 26, 2022
@haircommander
Copy link
Member

/retest

@fidencio
Copy link
Contributor

@gozssky, first and foremost, this PR works like a charm, thanks for your contribution!
Now, a few tips just to improve the life of the reviewer for next interactions:

  1. In 61f9d86 you shuffle the CreateContainer around. I´d leave that out of that commit, as that drives our eyes for a change there, while that's not the important part of your PR.
  2. Move that code around as part of the second patch, please, as you're already moving code around there.
  3. You're pulling in ctrio as a new dependency, which I don't see as a problem. Historically, a long time ago, a bunch of containerd code was brought into CRI-O to avoid vendoring, which has proven to not be a good idea, so using it as a new dep as you did is the correct way. Still on this, could you also check whether we still need something from cio / cioutils? maybe everything we need already comes from ctrio and we could have a follow-up patch ensuring we use everything from there.

I'd appreciate if items 1 and 2 could be worked before getting this merged. 3 can come later.
Yet again, nice work @gozssky!

@fidencio Thanks for your review very much! I'm very sorry that I didn't check the code diff carefully before submitting this PR. But I'm not clear yet what I need to do for item 2. Could you explain a little more?

@gozssky, ah, sorry, just add the CreateContainer code move as part of the patch that's already moving things around.
That's it and it should be good to go! :-)

@sleepymole
Copy link
Contributor Author

@fidencio I submitted the third patch 2e41e26. I'm not sure if I understand your suggestion well. 😅

@fidencio
Copy link
Contributor

@gozssky, my suggestion:

  • On the commit 61f9d86, do not change the placement of the CreateContainer function.
  • Instead, do that as part of 9c21a51
  • Drop 2e41e26, and let's have that as a different PR.

Sounds reasonable?

@sleepymole
Copy link
Contributor Author

@gozssky, my suggestion:

  • On the commit 61f9d86, do not change the placement of the CreateContainer function.
  • Instead, do that as part of 9c21a51
  • Drop 2e41e26, and let's have that as a different PR.

Sounds reasonable?

@fidencio Thank you for the detailed explanation! I adjusted my two patches. The first patch 0731a9b didn't change the CreateContainer function. And the second patched acde725 reuse the createContainerIO logic in CreateContainer and move functions after the function that calls them. Since the commit hash changed, I force-push my two commits. I'm not sure if this is appropriate.

@fidencio
Copy link
Contributor

@fidencio Thank you for the detailed explanation! I adjusted my two patches. The first patch 0731a9b didn't change the CreateContainer function. And the second patched acde725 reuse the createContainerIO logic in CreateContainer and move functions after the function that calls them. Since the commit hash changed, I force-push my two commits. I'm not sure if this is appropriate.

@gozssky, that's the way to go!
I will re-review / re-test it in the afternoon, thanks!

@fidencio
Copy link
Contributor

/retest

@fidencio
Copy link
Contributor

@gozssky, integration tests are breaking, please take a look at those:

not ok 74 ctr update resources
# (from function `copyimg' in file ./helpers.bash, line 234,
#  from function `setup_img' in file ./helpers.bash, line 245,
#  from function `setup_crio' in file ./helpers.bash, line 255,
#  from function `start_crio' in file ./helpers.bash, line 337,
#  in test file ./ctr.bats, line 729)
#   `start_crio' failed
# time="2022-01-29T09:50:35Z" level=error msg="Error importing dir:/home/runner/work/cri-o/cri-o/.artifacts/redis-image: writing blob: adding layer with blob \"sha256:276d6b52cd5b0a1ec792aff4b4feb6969e60985969775a009b277104b9532396\": Error processing tar file(exit status 127): Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!\n"
# time="2022-01-29T09:50:37Z" level=fatal msg="connect: connect endpoint 'unix:///tmp/tmp.r3tXFgBgkr/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"
# time="2022-01-29T09:50:39Z" level=fatal msg="connect: connect endpoint 'unix:///tmp/tmp.r3tXFgBgkr/crio.sock', make sure you are running as root and the endpoint has been started: context deadline exceeded"

Copy link
Contributor

@fgiudici fgiudici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice patch, thanks @gozssky !
Cannot understand why tests fail: let me try to retest.

}
return nil
}
_, err := r.createContainerIO(ctx, c, cio.WithFIFOs(ctrio.NewFIFOSet(cioCfg, closer)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that much that we use the containerd/io package directly here (ctrio) while we have the cri-o/utils/io package that wraps the calls of the containerd/io one.
@fidencio, I have taken a look and It seems to me much easier to wrap the check and management of the FIFOs in the cri-o/utils/io package with something like a new "WithExistingFIFO" function rather than making explicit usage of the containerd/cio package all around.
It makes sense to address this in a following PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgiudici, I partially agree.

For a long time we decided to not directly vendor containerd code and we ended up actually copy and pasting code from containerd to our codebase, and as any copy & pasted code, those are not receiving bug fixes and are there to rot.
I'd be happy with directly vendor the code and get rid of everything we copied into our repos. That would be cleaner and would esnure we actually get bug fixes from those.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks @fidencio, now I see... we have copied the github.com/containerd/cri/pkg/server/io pkg in the cri-o/utils/io!
Vendoring that package would be way better (not looked at the dependencies chain btw), but may be cleaner to keep all the code of fifo management in only one package (maybe there is already something in the containerd/cri/pkg/server/io pkg).
This probably deserves a dedicated investigation and PR :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be cleaner to keep all the code of fifo management in only one package

@fgiudici I couldn't agree more. I'd like to do this in next PR.

@fgiudici
Copy link
Contributor

fgiudici commented Feb 7, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2022

@gozssky: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/openshift-jenkins/e2e_crun_cgroupv2 acde725 link false /test e2e_cgroupv2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@haircommander
Copy link
Member

/approve

I tag @fidencio for the final lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gozssky, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2022
@fidencio
Copy link
Contributor

I will run the last round of tests here and most likely have it merged sooner than later. :-)

@fidencio
Copy link
Contributor

@gozssky, yet again, thanks a whole lot for the contribution!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2022
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit d5f68d8 into cri-o:main Feb 11, 2022
@sleepymole sleepymole deleted the issue-2112 branch February 11, 2022 01:49
@fidencio
Copy link
Contributor

/cherry-pick release-1.23

@openshift-cherrypick-robot

@fidencio: new pull request created: #5633

Details

In response to this:

/cherry-pick release-1.23

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/vm Runtime VM related pull requests and issues dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for crio restart for RuntimeVM (v2) implementation

7 participants