Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@kolyshkin
Copy link
Collaborator

@kolyshkin kolyshkin commented May 31, 2022

As seen in [1], sometimes coreos/go-systemd/dbus package deadlocks: the
jobCompete is stuck trying to send job result string to the channel
while holding the jobListener lock, while startJob (called by
StartTransientUnit) waits for the same lock.

Alas, it is not clear why the channel is not being read, nor was I able
to reproduce it locally.

Make the job result channel buffered, so jobJistener won't block on
channel send and thus StartTransientUnit won't be stuck either.

While at it,

  • move the error wrapping out of mgr.RetryOnDisconnect function,
    and use fmt.Errorf with %w instead of obsoleted errors.Wrap;

  • improve error messages, printing the systemd unit name (so we can
    check it in systemd log);

  • do check the job result string -- in case it is not "done",
    return an error back to the caller, which should help avoid other
    issues down the line.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2082344

/kind bug

Fix a rare deadlock while communicating to systemd (RHBZ 2082344)

This is a forward-port of #5914 to main branch.

@kolyshkin kolyshkin requested review from mrunalp and runcom as code owners May 31, 2022 23:38
@openshift-ci openshift-ci bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. labels May 31, 2022
@openshift-ci openshift-ci bot requested review from klihub and wgahnagl May 31, 2022 23:38
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2022
Copy link
Member

@saschagrunert saschagrunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haircommander
Copy link
Member

/retest

@nee1esh
Copy link

nee1esh commented Jun 1, 2022

/retest-required

@kolyshkin
Copy link
Collaborator Author

Test failures seem to be unrelated.

@haircommander
Copy link
Member

e2e_fedora and e2e_crun* are unrelated (being investigated in #5924). however, agnostic and gcp may not be

@nee1esh
Copy link

nee1esh commented Jun 2, 2022

/retest-required

As seen in [1], sometimes coreos/go-systemd/dbus package deadlocks: the
jobCompete is stuck trying to send job result string to the channel
while holding the jobListener lock, while startJob (called by
StartTransientUnit) waits for the same lock.

Alas, it is not clear why the channel is not being read, nor was I able
to reproduce it locally.

Make the job result channel buffered, so jobJistener won't block on
channel send and thus StartTransientUnit won't be stuck either.

While at it,

 - move the error wrapping out of mgr.RetryOnDisconnect function,
   and use fmt.Errorf with %w instead of obsoleted errors.Wrap;

 - improve error messages, printing the systemd unit name (so we can
   check it in systemd log);

 - do check the job result string -- in case it is not "done",
   return an error back to the caller, which should help avoid other
   issues down the line.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2082344

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Collaborator Author

rebased

@kolyshkin
Copy link
Collaborator Author

Failures in e2e-gcp and e2e-agnostic are the same as in #5506 which makes me think they are unrelated to this PR (or #5506).

Failures in e2e-cgroupv2 and e2e-fedora:

Kubernetes e2e suite: [sig-network] Services should be rejected for evicted pods (no endpoints exist) expand_less
Kubernetes e2e suite: [sig-node] Pods Extended Pod Container lifecycle evicted pods should be terminal expand_more

are the same as in e.g. and #5917, so also seem unrelated.

@kolyshkin
Copy link
Collaborator Author

/retest

1 similar comment
@nee1esh
Copy link

nee1esh commented Jun 8, 2022

/retest

@haircommander
Copy link
Member

/override ci/prow/e2e-gcp
/override ci/prow/e2e-agnostic
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 8, 2022

@haircommander: Overrode contexts on behalf of haircommander: ci/prow/e2e-agnostic, ci/prow/e2e-gcp

Details

In response to this:

/override ci/prow/e2e-gcp
/override ci/prow/e2e-agnostic
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@haircommander
Copy link
Member

/cherry-pick release-1.24

@openshift-cherrypick-robot

@haircommander: once the present PR merges, I will cherry-pick it on top of release-1.24 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@saschagrunert
Copy link
Member

/override ci/prow/e2e-gcp

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

@saschagrunert: Overrode contexts on behalf of saschagrunert: ci/prow/e2e-gcp

Details

In response to this:

/override ci/prow/e2e-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kolyshkin, saschagrunert

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kolyshkin,saschagrunert]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@saschagrunert
Copy link
Member

/override ci/prow/e2e-agnostic

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

@saschagrunert: Overrode contexts on behalf of saschagrunert: ci/prow/e2e-agnostic

Details

In response to this:

/override ci/prow/e2e-agnostic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-cherrypick-robot

@haircommander: new pull request created: #5954

Details

In response to this:

/cherry-pick release-1.24

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines 91 to 92
case <-ch:
close(ch)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this should not be there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #6124 to address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants