Thanks to visit codestin.com
Credit goes to github.com

Skip to content

kgo: do not requeue offset loads after the session is canceled#1350

Closed
Segflow wants to merge 1 commit into
twmb:masterfrom
Segflow:fix-listorepoch-reload-on-close
Closed

kgo: do not requeue offset loads after the session is canceled#1350
Segflow wants to merge 1 commit into
twmb:masterfrom
Segflow:fix-listorepoch-reload-on-close

Conversation

@Segflow

@Segflow Segflow commented Jun 19, 2026

Copy link
Copy Markdown

Fixes #1349.

(*consumerSession).listOrEpoch requeued a failed offset load from an unconditional defer:

defer s.decWorker()
defer reloads.loadWithSession(s, "reload offsets from load failure")
after := time.NewTimer(time.Second)
defer after.Stop()
select {
case <-after.C:
case <-s.ctx.Done():
    return
}

When the session context is canceled (group leave during Close), the select returns at once but the deferred loadWithSession still requeues the load. The requeued listOrEpoch runs with the same canceled context, fails, and requeues again with no backoff. The busy loop keeps a session worker alive, so stopSession never sees workers == 0 and Close hangs.

This regressed in 6c7aab5, which removed the session-context early return from the drain loop. Before that, a canceled session returned before reloads was populated.

The fix only requeues when the backoff timer fires and returns without requeuing once the session context is done. Added TestIssue1349 in kfake, which hangs without the fix.

When listOrEpoch reschedules a failed load, the reload was requeued from an
unconditional defer. On session cancellation (group leave during Close) the
backoff select returns immediately but the defer still requeued the load, and
the requeued listOrEpoch ran with the same canceled context and requeued again
with no backoff. That busy loop kept a session worker alive, so stopSession
never saw workers == 0 and Close hung.

Only requeue when the backoff timer fires; return without requeuing when the
session context is done. Fixes twmb#1349.
@twmb

twmb commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Thanks. Coincidentally, I had Claude w/ Fable (during the few days it was available) run audits over the codebase and it caught the same thing — fix here: 34b16df. It bails out early instead of spin-looping, but importantly it preserves the offset loads still waiting to be issued, whereas this PR drops them — which itself introduces a different bug: on a session stop that keeps the partition (cooperative revoke, SetOffsets on other partitions, topic purge), the dropped reload becomes a silently stuck cursor that never resumes.

I'll close this in favor of #1348, which I aim to get out soon.

@twmb twmb closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kgo: Close hangs on a group consumer when an offset load keeps failing during group leave

2 participants