-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix bogus CI test failures #4217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bogus CI test failures #4217
Conversation
| @@ -0,0 +1,20 @@ | |||
| #!/usr/bin/env bats | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think more users are apt to find this out if it's in cri-o itself. we could have a log message come out of internal/config/node similarly to how we check other node level knobs. wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is strictly for CI, so it would fail in case we are not running in a sane env.
Adding a log message to cri-o itself is also on my radar. Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should make it fatal or not. I'd rather make it fatal as otherwise we'll have to deal with some weird EBUSY bugs for no reason.
Codecov Report
@@ Coverage Diff @@
## master #4217 +/- ##
=======================================
Coverage 38.72% 38.72%
=======================================
Files 111 111
Lines 8702 8702
=======================================
Hits 3370 3370
Misses 4967 4967
Partials 365 365 |
|
/test ami (I'm not sure the ami update jobs actually work anymore, but it's worth a shot) |
|
Apparently it's working; from integration_rhel output:
|
58020e4 to
9819474
Compare
|
A bunch of failed CI jobs doesn't have any logs... /retest |
06dc9ee to
721ed24
Compare
|
A minor fix to the error message... |
|
From e2e_fedora logs: |
| state: present | ||
| value: 1 | ||
| sysctl_set: yes | ||
| ignoreerrors: yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreerrors boolean
Choices:
no ←
yes
Use this option to ignore errors about unknown keys.
Alas, this is not working for some reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's because of state: present stance. Let me try removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retrying with state: absent.
The following failure happens in CI on RHEL7 from time to time: > removing the pod sandbox "0d5bf5eeb0048bb70ab8ee9bca0a497e216a6c6cfa42507a8d735f61c825784d": rpc error: code = Unknown desc = unable to remove managed namespaces: Removing namespaces encountered the following errors [unlinkat /var/run/netns/0376543d-4917-417f-b406-7e8ab07cb847: device or resource busy] (with different test cases and different namespaces). RHEL7 kernel has upstream commit [1] backported, but the feature is controlled by a sysctl and is off by default. For the feature to work, one needs to set fs.may_detach_mounts = 1. On production RHEL7 systems, this is done by runc rpm (see [2]), but we're not using those rpms for CI, so we have to set the sysctl manually. Add an integration test case to check the sysctl is set. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe [2] https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17 Signed-off-by: Kir Kolyshkin <[email protected]>
721ed24 to
25b8c02
Compare
| init: checkFsMayDetachMounts, | ||
| err: &checkFsMayDetachMountsErr, | ||
| activated: nil, | ||
| fatal: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only question is whether to make it fatal. I'm in favor to do so, and this is gross misconfiguration which will result in various failures later, so it's better to bail out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically if we fail here it means RPM packaging is broken or something like that.
... and make it fatal. See previous commit for details. Signed-off-by: Kir Kolyshkin <[email protected]>
25b8c02 to
bd3aa81
Compare
|
@kolyshkin: The following tests failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
and if you click on "Details" link it says
:( |
|
/retest |
|
Look ma, green CI 🎉 |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kolyshkin, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-1.19 |
|
@kolyshkin: new pull request created: #4228 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-1.18 |
|
@kolyshkin: #4217 failed to apply on top of branch "release-1.18": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind failing-test
What this PR does / why we need it:
The following failure happens in CI on RHEL7 from time to time:
(with different test cases and different namespaces).
RHEL7 kernel has upstream commit [1] backported, but the feature is
controlled by a sysctl and is off by default. For the feature to work,
one needs to set fs.may_detach_mounts = 1.
On production RHEL7 systems, this is done by runc rpm (see [2]), but
we're not using those rpms for CI, so we have to set the sysctl
manually.
Add an integration test case to check the sysctl is set.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17
Which issue(s) this PR fixes:
Fixes: #3996
Some more notes at #4210
Special notes for your reviewer:
Does this PR introduce a user-facing change?