Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kolyshkin
Copy link
Collaborator

/kind failing-test

What this PR does / why we need it:

The following failure happens in CI on RHEL7 from time to time:

removing the pod sandbox "0d5bf5eeb0048bb70ab8ee9bca0a497e216a6c6cfa42507a8d735f61c825784d": rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [unlinkat /var/run/netns/0376543d-4917-417f-b406-7e8ab07cb847: device or resource busy]

(with different test cases and different namespaces).

RHEL7 kernel has upstream commit [1] backported, but the feature is
controlled by a sysctl and is off by default. For the feature to work,
one needs to set fs.may_detach_mounts = 1.

On production RHEL7 systems, this is done by runc rpm (see [2]), but
we're not using those rpms for CI, so we have to set the sysctl
manually.

Add an integration test case to check the sysctl is set.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17

Which issue(s) this PR fixes:

Fixes: #3996
Some more notes at #4210

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

@openshift-ci-robot openshift-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Sep 21, 2020
@@ -0,0 +1,20 @@
#!/usr/bin/env bats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more users are apt to find this out if it's in cri-o itself. we could have a log message come out of internal/config/node similarly to how we check other node level knobs. wdyt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is strictly for CI, so it would fail in case we are not running in a sane env.

Adding a log message to cri-o itself is also on my radar. Added.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should make it fatal or not. I'd rather make it fatal as otherwise we'll have to deal with some weird EBUSY bugs for no reason.

@codecov
Copy link

codecov bot commented Sep 21, 2020

Codecov Report

Merging #4217 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4217   +/-   ##
=======================================
  Coverage   38.72%   38.72%           
=======================================
  Files         111      111           
  Lines        8702     8702           
=======================================
  Hits         3370     3370           
  Misses       4967     4967           
  Partials      365      365           

@haircommander
Copy link
Member

/test ami

(I'm not sure the ami update jobs actually work anymore, but it's worth a shot)

@kolyshkin
Copy link
Collaborator Author

Apparently it's working; from integration_rhel output:

ok 163 if fs.may_detach_mounts is set

@kolyshkin
Copy link
Collaborator Author

A bunch of failed CI jobs doesn't have any logs...

/retest

@kolyshkin
Copy link
Collaborator Author

A minor fix to the error message...

@mrunalp
Copy link
Member

mrunalp commented Sep 22, 2020

From e2e_fedora logs:

fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to reload sysctl: net.ipv4.conf.all.route_localnet = 1\nnet.ipv4.ping_group_range = 0 2147483647\nnet.bridge.bridge-nf-call-iptables = 1\nsysctl: cannot stat /proc/sys/fs/may_detach_mounts: No such file or directory\n"}

state: present
value: 1
sysctl_set: yes
ignoreerrors: yes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreerrors boolean
Choices:
no ←
yes
Use this option to ignore errors about unknown keys.

Alas, this is not working for some reason

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's because of state: present stance. Let me try removing it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrying with state: absent.

The following failure happens in CI on RHEL7 from time to time:

> removing the pod sandbox "0d5bf5eeb0048bb70ab8ee9bca0a497e216a6c6cfa42507a8d735f61c825784d": rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [unlinkat /var/run/netns/0376543d-4917-417f-b406-7e8ab07cb847: device or resource busy]

(with different test cases and different namespaces).

RHEL7 kernel has upstream commit [1] backported, but the feature is
controlled by a sysctl and is off by default. For the feature to work,
one needs to set fs.may_detach_mounts = 1.

On production RHEL7 systems, this is done by runc rpm (see [2]), but
we're not using those rpms for CI, so we have to set the sysctl
manually.

Add an integration test case to check the sysctl is set.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17

Signed-off-by: Kir Kolyshkin <[email protected]>
init: checkFsMayDetachMounts,
err: &checkFsMayDetachMountsErr,
activated: nil,
fatal: true,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only question is whether to make it fatal. I'm in favor to do so, and this is gross misconfiguration which will result in various failures later, so it's better to bail out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically if we fail here it means RPM packaging is broken or something like that.

... and make it fatal. See previous commit for details.

Signed-off-by: Kir Kolyshkin <[email protected]>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 28, 2020

@kolyshkin: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/openshift-jenkins/ami_fedora 58020e4 link /test ami_fedora
ci/openshift-jenkins/ami_rhel 58020e4 link /test ami_rhel
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kolyshkin
Copy link
Collaborator Author

ci/openshift-jenkins/integration_crun_cgroupv2 — Jenkins job failed.

and if you click on "Details" link it says

Test started today at 2:36 PM is still running after 39m15s. (more info)

:(

@kolyshkin
Copy link
Collaborator Author

/retest

@kolyshkin
Copy link
Collaborator Author

Look ma, green CI 🎉

@mrunalp
Copy link
Member

mrunalp commented Sep 29, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 29, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolyshkin, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2020
@openshift-merge-robot openshift-merge-robot merged commit 1455ee8 into cri-o:master Sep 29, 2020
@kolyshkin
Copy link
Collaborator Author

/cherry-pick release-1.19

@openshift-cherrypick-robot

@kolyshkin: new pull request created: #4228

Details

In response to this:

/cherry-pick release-1.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin
Copy link
Collaborator Author

/cherry-pick release-1.18

@openshift-cherrypick-robot

@kolyshkin: #4217 failed to apply on top of branch "release-1.18":

Applying: Fix bogus CI test failures
Applying: internal/config/node: add checkFsMayDetachMounts
Using index info to reconstruct a base tree...
A	internal/config/node/node.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): internal/config/node/node.go deleted in HEAD and modified in internal/config/node: add checkFsMayDetachMounts. Version internal/config/node: add checkFsMayDetachMounts of internal/config/node/node.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 internal/config/node: add checkFsMayDetachMounts
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-1.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rhel 7 and managed ns are not playing together well

6 participants