kgo: do not requeue offset loads after the session is canceled#1350
Closed
Segflow wants to merge 1 commit into
Closed
kgo: do not requeue offset loads after the session is canceled#1350Segflow wants to merge 1 commit into
Segflow wants to merge 1 commit into
Conversation
When listOrEpoch reschedules a failed load, the reload was requeued from an unconditional defer. On session cancellation (group leave during Close) the backoff select returns immediately but the defer still requeued the load, and the requeued listOrEpoch ran with the same canceled context and requeued again with no backoff. That busy loop kept a session worker alive, so stopSession never saw workers == 0 and Close hung. Only requeue when the backoff timer fires; return without requeuing when the session context is done. Fixes twmb#1349.
Owner
|
Thanks. Coincidentally, I had Claude w/ Fable (during the few days it was available) run audits over the codebase and it caught the same thing — fix here: 34b16df. It bails out early instead of spin-looping, but importantly it preserves the offset loads still waiting to be issued, whereas this PR drops them — which itself introduces a different bug: on a session stop that keeps the partition (cooperative revoke, SetOffsets on other partitions, topic purge), the dropped reload becomes a silently stuck cursor that never resumes. I'll close this in favor of #1348, which I aim to get out soon. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1349.
(*consumerSession).listOrEpochrequeued a failed offset load from an unconditionaldefer:When the session context is canceled (group leave during
Close), the select returns at once but the deferredloadWithSessionstill requeues the load. The requeuedlistOrEpochruns with the same canceled context, fails, and requeues again with no backoff. The busy loop keeps a session worker alive, sostopSessionnever seesworkers == 0andClosehangs.This regressed in 6c7aab5, which removed the session-context early return from the drain loop. Before that, a canceled session returned before
reloadswas populated.The fix only requeues when the backoff timer fires and returns without requeuing once the session context is done. Added
TestIssue1349in kfake, which hangs without the fix.