Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kolyshkin
Copy link
Collaborator

As seen in [1], sometimes coreos/go-systemd/dbus package deadlocks: the
jobCompete is stuck trying to send job result string to the channel
while holding the jobListener lock, while startJob (called by
StartTransientUnit) waits for the same lock.

Alas, it is not clear why the channel is not being read, nor was I able
to reproduce it locally.

Make the job result channel buffered, so jobJistener won't block on
channel send and thus StartTransientUnit won't be stuck either.

While at it,

  • move the error wrapping out of mgr.RetryOnDisconnect function,
    and use fmt.Errorf with %w instead of obsoleted errors.Wrap;

  • improve error messages, printing the systemd unit name (so we can
    check it in systemd log);

  • do check the job result string -- in case it is not "done",
    return an error back to the caller, which should help avoid other
    issues down the line.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2082344

Avoid a deadlock while moving conmon to a different system scope (RHBZ#2082344)

As seen in [1], sometimes coreos/go-systemd/dbus package deadlocks: the
jobCompete is stuck trying to send job result string to the channel
while holding the jobListener lock, while startJob (called by
StartTransientUnit) waits for the same lock.

Alas, it is not clear why the channel is not being read, nor was I able
to reproduce it locally.

Make the job result channel buffered, so jobJistener won't block on
channel send and thus StartTransientUnit won't be stuck either.

While at it,

 - move the error wrapping out of mgr.RetryOnDisconnect function,
   and use fmt.Errorf with %w instead of obsoleted errors.Wrap;

 - improve error messages, printing the systemd unit name (so we can
   check it in systemd log);

 - do check the job result string -- in case it is not "done",
   return an error back to the caller, which should help avoid other
   issues down the line.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2082344

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin requested review from mrunalp and runcom as code owners May 27, 2022 01:55
@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 27, 2022
@openshift-ci openshift-ci bot requested a review from fidencio May 27, 2022 01:56
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2022
@codecov
Copy link

codecov bot commented May 27, 2022

Codecov Report

Merging #5914 (343bcdd) into release-1.21 (8ed37a9) will decrease coverage by 0.00%.
The diff coverage is 28.57%.

@@               Coverage Diff                @@
##           release-1.21    #5914      +/-   ##
================================================
- Coverage         44.97%   44.97%   -0.01%     
================================================
  Files               109      109              
  Lines             11063    11065       +2     
================================================
  Hits               4976     4976              
- Misses             5608     5610       +2     
  Partials            479      479              

if s != "done" {
return fmt.Errorf("error moving conmon with pid %d to systemd unit %s: got %s", pid, unitName, s)
}
case <-time.After(time.Minute * 6):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find myself wondering if we still need this case. Ideally, the buffered channel would fix the deadlock long term. From the sounds of it, having the request timeout just caused a less obvious deadlock

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we do. The buffered channel fixes the deadlock, but if we haven't received a reply from systemd we should still say so and return an error.

@kolyshkin
Copy link
Collaborator Author

lint job fails since it uses an old version of golangci-lint but a new version of golang (IOW release-1.21 branch needs some love).

@haircommander
Copy link
Member

/retest
/approve

LGTM, @cri-o/cri-o-maintainers PTAL

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 31, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, kolyshkin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [haircommander,kolyshkin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kolyshkin
Copy link
Collaborator Author

lint job fails since it uses an old version of golangci-lint but a new version of golang (IOW release-1.21 branch needs some love).

Being fixed in #5918

@rphillips
Copy link
Contributor

@kolyshkin I think we should get this into 'main' first.

Can we give the customer a test build to try out?

@kolyshkin
Copy link
Collaborator Author

@kolyshkin I think we should get this into 'main' first.

Does it matter if we do the backport or forward port?

Can we give the customer a test build to try out?

Yes, that would be a good thing to do, despite the lack of clear repro.

@haircommander
Copy link
Member

Does it matter if we do the backport or forward port?

let's start with main and pull back sequentially, so we order when they merge by inverse relative stability (1.21 should be very stable, 1.24 should be stable)

@kolyshkin
Copy link
Collaborator Author

Forward-port to main branch: #5922

@kolyshkin
Copy link
Collaborator Author

let's start with main and pull back sequentially, so we order when they merge by inverse relative stability (1.21 should be very stable, 1.24 should be stable)

Meaning I can still do a forward-port, but the order of merging is important, right?

Here's one for main: #5922

@TomSweeneyRedHat
Copy link
Contributor

LGTM
But lint doesn't look happy

@nee1esh
Copy link

nee1esh commented Jun 8, 2022

/retest

@haircommander haircommander changed the title utils/RunUnderSystemdScope: fix wrt channel deadlock [1.21] utils/RunUnderSystemdScope: fix wrt channel deadlock Jun 8, 2022
@github-actions
Copy link

github-actions bot commented Jul 9, 2022

A friendly reminder that this PR had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2022
@EmmanuelKasper
Copy link

/retest

@haircommander
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2022
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@haircommander
Copy link
Member

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

25 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@haircommander
Copy link
Member

/override ci/openshift-jenkins/e2e_rhel

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 24, 2022

@haircommander: Overrode contexts on behalf of haircommander: ci/openshift-jenkins/e2e_rhel

Details

In response to this:

/override ci/openshift-jenkins/e2e_rhel

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants