Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kolyshkin
Copy link
Collaborator

@kolyshkin kolyshkin commented Sep 18, 2020

Just trying to debug a failure that happened here:

The issue:

Similar bug: ~~
~~ * containerd/containerd#3667

### Hypothesis of the hour

The mount gets propagated to other namespaces, and it takes time for the unmount event to be propagated. Might have to do with an older/buggy RHEL7 kernel. Might need to be mitigated by retrying with a wee sleep.

Cause

Missing fs.may_detach_mounts = 1 sysctl setting.

Signed-off-by: Kir Kolyshkin <[email protected]>
@openshift-ci-robot
Copy link

@kolyshkin: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Sep 18, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kolyshkin
To complete the pull request process, please assign saschagrunert
You can assign the PR to them by writing /assign @saschagrunert in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Sep 18, 2020

Codecov Report

Merging #4210 into master will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4210      +/-   ##
==========================================
- Coverage   38.75%   38.74%   -0.02%     
==========================================
  Files         111      111              
  Lines        8698     8701       +3     
==========================================
  Hits         3371     3371              
- Misses       4963     4966       +3     
  Partials      364      364              

@kolyshkin
Copy link
Collaborator Author

kolyshkin commented Sep 18, 2020

OK it's not just me. This one is from an older CI run for #4147, also integration_rhel.

time="2020-09-17 18:17:58.827593322Z" level=debug msg="Response error: Removing namespaces encountered the following errors [unlinkat /var/run/ipcns/b1e4ff7f-8f68-461d-ac5c-485a4596aec0: device or resource busy unlinkat /var/run/netns/b1e4ff7f-8f68-461d-ac5c-485a4596aec0: device or resource busy]\ngithub.com/cri-o/cri-o/internal/lib/sandbox.(*Sandbox).RemoveManagedNamespaces\n\t/go/src/github.com/cri-o/cri-o/internal/lib/sandbox/namespaces.go:202\ngithub.com/cri-o/cri-o/server.(*Server).RemovePodSandbox\n\t/go/src/github.com/cri-o/cri-o/server/sandbox_remove.go:96\nk8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_RemovePodSandbox_Handler.func1\n\t/go/src/github.com/cri-o/cri-o/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:7767\ngithub.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1\n\t/go/src/github.com/cri-o/cri-o/internal/log/interceptors.go:56\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1\n\t/go/src/github.com/cri-o/cri-o/server/metrics/interceptors.go:24\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34\nk8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_RemovePodSandbox_Handler\n\t/go/src/github.com/cri-o/cri-o/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:7769\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:722\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nunable to remove managed namespaces\ngithub.com/cri-o/cri-o/server.(*Server).RemovePodSandbox\n\t/go/src/github.com/cri-o/cri-o/server/sandbox_remove.go:97\nk8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_RemovePodSandbox_Handler.func1\n\t/go/src/github.com/cri-o/cri-o/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:7767\ngithub.com/cri-o/cri-o/internal/log.UnaryInterceptor.func1\n\t/go/src/github.com/cri-o/cri-o/internal/log/interceptors.go:56\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/cri-o/cri-o/server/metrics.UnaryInterceptor.func1\n\t/go/src/github.com/cri-o/cri-o/server/metrics/interceptors.go:24\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n\t/go/src/github.com/cri-o/cri-o/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34\nk8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_RemovePodSandbox_Handler\n\t/go/src/github.com/cri-o/cri-o/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:7769\ngoogle.golang.org/grpc.(*Server).processUnaryRPC\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:1024\ngoogle.golang.org/grpc.(*Server).handleStream\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:1313\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.1\n\t/go/src/github.com/cri-o/cri-o/vendor/google.golang.org/grpc/server.go:722\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374" file="go-grpc-middleware/chain.go:25" id=617ecd9d-a6d4-4ba6-8bcf-296f1cc20cd5 name=/runtime.v1alpha2.RuntimeService/RemovePodSandbox
removing the pod sandbox "69c2eaceb1f31ee3019121c45a1f53363b17e9bb16b4aa3309dfacdbc9d57194": rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [unlinkat /var/run/ipcns/b1e4ff7f-8f68-461d-ac5c-485a4596aec0: device or resource busy unlinkat /var/run/netns/b1e4ff7f-8f68-461d-ac5c-485a4596aec0: device or resource busy]

One more from #4184:

removing the pod sandbox "0d5bf5eeb0048bb70ab8ee9bca0a497e216a6c6cfa42507a8d735f61c825784d": rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [unlinkat /var/run/netns/0376543d-4917-417f-b406-7e8ab07cb847: device or resource busy]

Can probably find more looking into https://prow.ci.openshift.org/?repo=cri-o%2Fcri-o&state=failure

@haircommander
Copy link
Member

here's a dump of some others #3996

@haircommander
Copy link
Member

/retest

@haircommander
Copy link
Member

/test integration_rhel

@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 18, 2020

@kolyshkin: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/integration_fedora 84f7703 link /test integration_fedora
ci/openshift-jenkins/integration_crun 84f7703 link /test integration_crun
ci/openshift-jenkins/integration_crun_cgroupv2 84f7703 link /test integration_cgroupv2
ci/openshift-jenkins/e2e_crun_cgroupv2 84f7703 link /test e2e_cgroupv2
ci/prow/e2e-aws 84f7703 link /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@haircommander
Copy link
Member

/test integration_rhel

@kolyshkin
Copy link
Collaborator Author

And now, when we need it to fail, it's passing 😃

@kolyshkin
Copy link
Collaborator Author

/test integration_rhel

1 similar comment
@kolyshkin
Copy link
Collaborator Author

/test integration_rhel

@kolyshkin
Copy link
Collaborator Author

New theory: this might be because fs.may_detach_mounts needs to be set on RHEL7, and it's not.

Kernel 3.18 added this: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
This was backported to RHEL7.4 (https://bugzilla.redhat.com/show_bug.cgi?id=1247935), but the feature is off by default. Some software enables it at runtime, or packages a sysctl file and a script that does that.

I am not sure if the flag is enabled in our CI. Let me check...

@kolyshkin
Copy link
Collaborator Author

/test integration_rhel

@kolyshkin
Copy link
Collaborator Author

aaaaaaand I was right!

./test/test_runner.sh 
+ uname -a
Linux ip-172-18-9-178.ec2.internal 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 11 19:12:04 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
+ tail /proc/sys/fs/may_detach_mounts
0
+ grep may_detach_mounts /etc/sysctl.conf /etc/sysctl.d/99-sysctl.conf
+ exit 0

@kolyshkin
Copy link
Collaborator Author

The obvious solution would be to have

echo 1 > /proc/sys/fs/may_detach_mounts

as part of tests, but first I need to make sure that some package from OC distro for RHEL7 sets it, too.

@mrunalp
Copy link
Member

mrunalp commented Sep 19, 2020

We set sysctls here https://github.com/cri-o/cri-o/blob/master/contrib/test/integration/system.yml, but only in the setup job so that would mean getting a new ami tagged. @haircommander or I could walk you through that. Another option as you pointed out is to just call this before the test invocations in main.yml which is easier.

@kolyshkin
Copy link
Collaborator Author

@mrunalp Fixing this for tests is easy. Whether this should be set is also a no brainer.

My concern is whether this sysctl is always set for RHEL7 in production. If we run our tests with this sysctl set, but it's not set on real RHEL7 nodes, that's ❌ no good.

NicolasT added a commit to scality/metalk8s that referenced this pull request Mar 19, 2021
On EL7.4, a new sysctl, `fs.may_detach_mounts`, was added which should
be enabled on hosts where container runtimes are being used (it's off
by default for backwards compatibility). Docker sets this as part of
the daemon startup process, `containerd` (and `cri-o`) don't. We've
seen the netns cleanup issue resulting from this not being set in the
past in various deployments.

Adding a `sysctl.d` drop-in to set the value as part of the containerd
RPM we build should fix this.

Fixes: #3211
See: https://access.redhat.com/solutions/5430091
See: containerd/containerd#3667 (comment)
See: cri-o/cri-o#4210
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants