-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[do not merge] debug ns file removal #4210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kir Kolyshkin <[email protected]>
|
@kolyshkin: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kolyshkin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report
@@ Coverage Diff @@
## master #4210 +/- ##
==========================================
- Coverage 38.75% 38.74% -0.02%
==========================================
Files 111 111
Lines 8698 8701 +3
==========================================
Hits 3371 3371
- Misses 4963 4966 +3
Partials 364 364 |
|
OK it's not just me. This one is from an older CI run for #4147, also
One more from #4184:
Can probably find more looking into https://prow.ci.openshift.org/?repo=cri-o%2Fcri-o&state=failure |
|
here's a dump of some others #3996 |
|
/retest |
|
/test integration_rhel |
|
@kolyshkin: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/test integration_rhel |
|
And now, when we need it to fail, it's passing 😃 |
|
/test integration_rhel |
1 similar comment
|
/test integration_rhel |
|
New theory: this might be because Kernel 3.18 added this: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe I am not sure if the flag is enabled in our CI. Let me check... |
Signed-off-by: Kir Kolyshkin <[email protected]>
|
/test integration_rhel |
|
aaaaaaand I was right! |
|
The obvious solution would be to have echo 1 > /proc/sys/fs/may_detach_mountsas part of tests, but first I need to make sure that some package from OC distro for RHEL7 sets it, too. |
|
We set sysctls here https://github.com/cri-o/cri-o/blob/master/contrib/test/integration/system.yml, but only in the setup job so that would mean getting a new ami tagged. @haircommander or I could walk you through that. Another option as you pointed out is to just call this before the test invocations in main.yml which is easier. |
|
@mrunalp Fixing this for tests is easy. Whether this should be set is also a no brainer. My concern is whether this sysctl is always set for RHEL7 in production. If we run our tests with this sysctl set, but it's not set on real RHEL7 nodes, that's ❌ no good. |
On EL7.4, a new sysctl, `fs.may_detach_mounts`, was added which should be enabled on hosts where container runtimes are being used (it's off by default for backwards compatibility). Docker sets this as part of the daemon startup process, `containerd` (and `cri-o`) don't. We've seen the netns cleanup issue resulting from this not being set in the past in various deployments. Adding a `sysctl.d` drop-in to set the value as part of the containerd RPM we build should fix this. Fixes: #3211 See: https://access.redhat.com/solutions/5430091 See: containerd/containerd#3667 (comment) See: cri-o/cri-o#4210
Just trying to debug a failure that happened here:
The issue:
Similar bug: ~~~~ * containerd/containerd#3667
### Hypothesis of the hourThe mount gets propagated to other namespaces, and it takes time for the unmount event to be propagated. Might have to do with an older/buggy RHEL7 kernel. Might need to be mitigated by retrying with a wee sleep.Cause
Missing
fs.may_detach_mounts = 1sysctl setting.