Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@haircommander
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

Before, there was the possibility load could cause cri-o to segfault from double closing of channels.

this PR aims to simplify container stop code while retaining the required behavior.

Now, the first stop begins a registration process where the container stop begins and new timeouts
come in to interrupt. There are two communication channels, and only one location where they can be closed.
This also adds a watcher mechanism so callers can wait on the container stop

This PR also includes some cleanups:

  • replace IsAlive() with Living()
  • move around where ShouldBeStopped() is called

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a very rare panic from a double closed channel in container stop

@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 12, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 12, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 12, 2023
@haircommander haircommander marked this pull request as ready for review July 12, 2023 19:44
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci openshift-ci bot requested review from QiWang19 and wgahnagl July 12, 2023 19:44
@haircommander haircommander marked this pull request as draft July 12, 2023 19:51
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@haircommander haircommander marked this pull request as ready for review July 12, 2023 20:07
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 12, 2023
@openshift-ci openshift-ci bot requested a review from sohankunkerkar July 12, 2023 20:08
@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #7129 (1caede0) into main (791027b) will increase coverage by 0.10%.
The diff coverage is 65.00%.

❗ Current head 1caede0 differs from pull request most recent head da69490. Consider uploading reports for the commit da69490 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7129      +/-   ##
==========================================
+ Coverage   49.16%   49.26%   +0.10%     
==========================================
  Files         135      135              
  Lines       15484    15476       -8     
==========================================
+ Hits         7612     7624      +12     
+ Misses       6970     6950      -20     
  Partials      902      902              

@haircommander
Copy link
Member Author

/retest

1 similar comment
@haircommander
Copy link
Member Author

/retest

@haircommander
Copy link
Member Author

@cri-o/cri-o-maintainers PTAL

@haircommander
Copy link
Member Author

/retest

// will do said cleanup.
func (c *Container) SetAsStopping(timeout int64) (alreadyStopping bool) {
// First, need to check if the container is already stopping
// If it was first set as stopping, it returns true.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say something like "Returns true if the container was not set as stopping before, and false otherwise (i.e. on subsequent calls)."


// The initial container process either doesn't exist, or isn't ours.
if err := c.IsAlive(); err != nil {
c.state.Finished = time.Now()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set c.state.Status here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state machine is kind of weird. TBH I don't think we should set either ever, and rely on UpdateContainerStatus to set them, but this is here for now.

@kolyshkin
Copy link
Collaborator

Left some nits. Note I did not review tests changes. Overall LGTM.

I think the last commit needs to be squashed into the previous one.

@kolyshkin
Copy link
Collaborator

Oh, and change IsAlive to Living should be the first patch in the series.

as IsAlive implies the function will return a bool, which it does not

Signed-off-by: Peter Hunt <[email protected]>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2023
@haircommander
Copy link
Member Author

Do we want to write an integration test for this, which triggers stop multiple times and ensures that everything is sane afterwards?

we have one!
https://github.com/cri-o/cri-o/blob/main/test/ctr.bats#L969

It's really more of an internal error (though it's exported to be tested).
also, take the state lock in ShouldBeStopped() to allow it to be called without taking
the opLock beforehand

Signed-off-by: Peter Hunt <[email protected]>
Before, there was the possibility load could cause cri-o to segfault from double closing of channels.

this PR aims to simplify container stop code while retaining the required behavior.

Now, the first stop begins a registration process where the container stop begins and new timeouts
come in to interrupt. There are two commuincation channels, and only one location where they can be closed.
This also adds a watcher mechanism so callers can wait on the container stop

Signed-off-by: Peter Hunt <[email protected]>
@haircommander
Copy link
Member Author

comments addressed, PTAL @saschagrunert

@haircommander
Copy link
Member Author

/retest

1 similar comment
@saschagrunert
Copy link
Member

/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kolyshkin, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,kolyshkin,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@saschagrunert
Copy link
Member

/test e2e-gcp-ovn

@saschagrunert
Copy link
Member

/test ci-e2e-evented-pleg

@saschagrunert
Copy link
Member

/retest

@haircommander
Copy link
Member Author

/override ci/prow/e2e-gcp-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 24, 2023

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/e2e-gcp-ovn

Details

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Member Author

/retest

2 similar comments
@sohankunkerkar
Copy link
Member

/retest

@saschagrunert
Copy link
Member

/retest

@saschagrunert
Copy link
Member

/test e2e-gcp-ovn

@haircommander
Copy link
Member Author

/override ci/prow/e2e-gcp-ovn

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 25, 2023

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/e2e-gcp-ovn

Details

In response to this:

/override ci/prow/e2e-gcp-ovn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit a298001 into cri-o:main Jul 25, 2023
@haircommander
Copy link
Member Author

/cherry-pick release-1.27

@openshift-cherrypick-robot

@haircommander: #7129 failed to apply on top of branch "release-1.27":

Applying: oci: change IsAlive to Living
Using index info to reconstruct a base tree...
M	internal/oci/container_test.go
M	internal/oci/runtime_oci.go
Falling back to patching base and 3-way merge...
Auto-merging internal/oci/runtime_oci.go
Auto-merging internal/oci/container_test.go
CONFLICT (content): Merge conflict in internal/oci/container_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 oci: change IsAlive to Living
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rborundia
Copy link

Codecov Report

Merging #7129 (1caede0) into main (791027b) will increase coverage by 0.10%.
The diff coverage is 65.00%.

❗ Current head 1caede0 differs from pull request most recent head da69490. Consider uploading reports for the commit da69490 to get more accurate results

Additional details and impacted files

Do we want to write an integration test for this, which triggers stop multiple times and ensures that everything is sane afterwards?

we have one! https://github.com/cri-o/cri-o/blob/main/test/ctr.bats#L969

Hi,
Is this patch set going to be ported to 1.25 we have seen similar issue on it ?
Should I open new issue for 1.25 ?

time="2023-08-01 05:07:32.685552597-07:00" level=warning msg="Stopping container 26acb2fa4e0a83f54f89f657733fad1bc418952bb8f012c9016d2bc8f05fa595 with stop signal timed out: timeout reached after 10 seconds waiting for container process to exit" id=7a29fcdb-9d74-4e31-be2d-f62586715601 name=/runtime.v1.RuntimeService/StopPodSandbox
Aug 1 05:09:32 demo1-n1 crio[2158]: panic: close of closed channel
Aug 1 05:09:32 demo1-n1 crio[2158]: goroutine 534739 [running]:
Aug 1 05:09:32 demo1-n1 crio[2158]: panic({0x557343d7b4a0, 0x5573440924f0})
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/panic.go:987 +0x3ba fp=0xc00132fb90 sp=0xc00132fad0 pc=0x557341b92d3a
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.closechan(0xc001c8ac60)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/chan.go:365 +0x3fc fp=0xc00132fbd0 sp=0xc00132fb90 pc=0x557341b5da9c
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/cri-o/cri-o/internal/oci.WaitContainerStop({0x5573440b54c8?, 0xc003bb8d80}, 0xc003bbba20, 0x1bf08eb000, 0x0)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/internal/oci/runtime_oci.go:844 +0x511 fp=0xc00132fce0 sp=0xc00132fbd0 pc=0x5573433cbc71
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).StopContainer(0xc000f14840?, {0x5573440b54c8, 0xc003bb8d80}, 0xc003bbba20, 0xa)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/internal/oci/runtime_oci.go:898 +0x4e5 fp=0xc00132fdf0 sp=0xc00132fce0 pc=0x5573433cc345
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/cri-o/cri-o/internal/oci.(*Runtime).StopContainer(0x557341bc4abd?, {0x5573440b54c8, 0xc003bb8d80}, 0x0?, 0xc00132fe80?)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/internal/oci/oci.go:272 +0x65 fp=0xc00132fe38 sp=0xc00132fdf0 pc=0x5573433c21e5
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/cri-o/cri-o/server.(*Server).stopContainer(0xc000997500, {0x5573440b54c8, 0xc003bb8d80}, 0xc003bbba20, 0x5573422da41c?)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/server/container_stop.go:54 +0x256 fp=0xc00132ff08 sp=0xc00132fe38 pc=0x557343463556
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/cri-o/cri-o/server.(*Server).stopPodSandbox.func1()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/server/sandbox_stop_linux.go:56 +0x3b fp=0xc00132ff78 sp=0xc00132ff08 pc=0x55734347f67b
Aug 1 05:09:32 demo1-n1 crio[2158]: golang.org/x/sync/errgroup.(*Group).Go.func1()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x64 fp=0xc00132ffe0 sp=0xc00132ff78 pc=0x5573429378e4
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.goexit()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc00132ffe8 sp=0xc00132ffe0 pc=0x557341bc8e41
Aug 1 05:09:32 demo1-n1 crio[2158]: created by golang.org/x/sync/errgroup.(*Group).Go
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0xa5
Aug 1 05:09:32 demo1-n1 crio[2158]: goroutine 1 [select, 1414 minutes]:
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.gopark(0xc000f45b60?, 0x3?, 0xd?, 0x0?, 0xc000f45752?)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/proc.go:363 +0xd6 fp=0xc0009b7550 sp=0xc0009b7530 pc=0x557341b95d96
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.selectgo(0xc0009b7b60, 0xc000f4574c, 0xc0007359c0?, 0x0, 0x0?, 0x1)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/select.go:328 +0x7bc fp=0xc0009b7690 sp=0xc0009b7550 pc=0x557341ba631c
Aug 1 05:09:32 demo1-n1 crio[2158]: main.main.func2(0xc00072be40)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/cmd/crio/main.go:395 +0x17d5 fp=0xc0009b7be0 sp=0xc0009b7690 pc=0x55734354d895
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/urfave/cli/v2.(*App).RunContext(0xc000a0e000, {0x5573440b5458?, 0xc00012a000}, {0xc0001181e0, 0x1, 0x1})
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/vendor/github.com/urfave/cli/v2/app.go:390 +0xf48 fp=0xc0009b7eb0 sp=0xc0009b7be0 pc=0x557342b515c8
Aug 1 05:09:32 demo1-n1 crio[2158]: github.com/urfave/cli/v2.(*App).Run(...)
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/vendor/github.com/urfave/cli/v2/app.go:251
Aug 1 05:09:32 demo1-n1 crio[2158]: main.main()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/home/abuild/rpmbuild/BUILD/cri-o-1.25.3/cmd/crio/main.go:438 +0x69a fp=0xc0009b7f80 sp=0xc0009b7eb0 pc=0x55734354c03a
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.main()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/proc.go:250 +0x213 fp=0xc0009b7fe0 sp=0xc0009b7f80 pc=0x557341b959d3
Aug 1 05:09:32 demo1-n1 crio[2158]: runtime.goexit()
Aug 1 05:09:32 demo1-n1 crio[2158]: #11/usr/lib64/go-1.19/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc0009b7fe8 sp=0xc0009b7fe0 pc=0x557341bc8e41
Aug 1 05:09:32 demo1-n1 crio[2158]: goroutine 2 [force gc (idle), 5 minutes]:

@haircommander
Copy link
Member Author

we will be backporting it. I am not sure 1.25 will get it at this point but at least 1.26 I would guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants