-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Use timeout for conmon cgroup move #5535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We're now using a timeout which is able to unblock gaining the sandbox mutex in a container creation vs sandbox stop race. Signed-off-by: Sascha Grunert <[email protected]> Signed-off-by: Peter Hunt <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #5535 +/- ##
==========================================
- Coverage 43.11% 43.10% -0.02%
==========================================
Files 121 121
Lines 12148 12150 +2
==========================================
- Hits 5238 5237 -1
- Misses 6407 6410 +3
Partials 503 503 |
|
Change LGTM, but 6 minutes seems a bit long to wait. |
mrunalp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: haircommander, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
@haircommander: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest-required Please review the full test history for this PR and help us cut down flakes. |
|
/cherry-pick release-1.23 |
|
@haircommander: new pull request created: #5540 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherry-pick release-1.22 |
|
@haircommander: new pull request created: #5545 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| if err := mgr.RetryOnDisconnect(func(c *systemdDbus.Conn) error { | ||
| _, err = c.StartTransientUnitContext(ctx, unitName, "replace", properties, ch) | ||
| return err | ||
| return errors.Wrap(err, "start transient unit") | ||
| }); err != nil { | ||
| return err | ||
| } | ||
|
|
||
| // Block until job is started | ||
| <-ch | ||
| close(ch) | ||
| select { | ||
| case <-ch: | ||
| close(ch) | ||
| case <-time.After(time.Minute * 6): | ||
| // This case is a work around to catch situations where the dbus library sends the | ||
| // request but it unexpectedly disappears. We set the timeout very high to make sure | ||
| // we wait as long as possible to catch situations where dbus is overwhelmed. | ||
| // We also don't use the native context cancelling behavior of the dbus library, | ||
| // because experience has shown that it does not help. | ||
| // TODO: Find cause of the request being dropped in the dbus library and fix it. | ||
| return errors.Errorf("timed out moving conmon with pid %d to cgroup", pid) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two ideas coming into my mind:
- What if we reduce the timeout to 1 minute and retry multiple times with an exponential backoff?
- What if we check for the error "Message recipient disconnected from message bus without replying" and restart dbus in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would we be able to get the error in this case?
|
/cherry-pick release-1.20 |
|
@haircommander: #5535 failed to apply on top of branch "release-1.20": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We're now using a timeout which is able to unblock gaining
the sandbox mutex in a container creation vs sandbox stop race.
Signed-off-by: Sascha Grunert [email protected]
Signed-off-by: Peter Hunt [email protected]
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?