Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Audit fixes#1348

Merged
twmb merged 139 commits into
masterfrom
audit-fixes
Jun 23, 2026
Merged

Audit fixes#1348
twmb merged 139 commits into
masterfrom
audit-fixes

Conversation

@twmb

@twmb twmb commented Jun 15, 2026

Copy link
Copy Markdown
Owner

The most substantive runs were during the Fable week, with follow up rounds running through areas / scenarios of less importance. Anyway, time to review commit by commit.

twmb and others added 30 commits June 10, 2026 23:32
checkUnknownFailLimit counted only UNKNOWN_TOPIC_OR_PARTITION and reset
the count on any other error. Two problems:

* UNKNOWN_TOPIC_ID reset the count. Produce v13+ addresses topics by
  ID (KIP-516); when a topic is deleted and recreated, the client
  deliberately keeps the old ID, so every produce after recreation
  returns UNKNOWN_TOPIC_ID. The error is retriable, and RecordRetries
  and RecordDeliveryTimeout are unbounded by default, so the unknown
  topic limit is the only bound -- and code 100 actively defeated it.
  Records buffered and retried silently forever with no signal to the
  user, while the consumer side of the same scenario surfaces an error
  to every poll after five fails.

* Any error alternation defeated the limit: a broker alternating
  UNKNOWN_TOPIC_OR_PARTITION with any other retriable error kept the
  count below the limit forever.

Now both unknown-topic errors bump the count, only a successful
produce resets it, and other errors leave it unchanged. This matches
waitUnknownTopic's existing semantics for unloaded topics: count only
unknown errors, never reset on other errors.
The missing-partition comment in mergeTopicPartitions said purging
"can happen automatically for consumers if the user opted into
ConsumeRecreatedTopics". No such option exists; it was planned in the
issue #908 era but never implemented. Describe what actually clears
partitions: a manual PurgeTopicsFromClient, or the automatic regex
consumer purge after a topic has been missing from metadata for longer
than ConsiderMissingTopicDeletedAfter.
The heartbeat loop's err variable is fed both by the heartbeat itself
and by fetchOffsets via fetchErrCh. The 848 in-place retry arm
classifies by error value alone, so a retryable error returned by
fetchOffsets - e.g. the coordinator moved and OffsetFetch exhausted its
internal retries - was retried as if a heartbeat had blipped: the next
heartbeat succeeded, the retry counter reset, and the session lived on.
But the fetch goroutine was already gone, the fetched-for partitions
were never handed to assignPartitions (that only happens after a
successful offset fetch), and nothing inside a live session re-runs the
fetch. The member then acked partitions it never started consuming on
every heartbeat, silently, until an external rebalance bounced the
session.

Track whether the error came from fetchErrCh and exclude fetch errors
from the in-place retry arm. They now propagate to manage848, whose
transient arm restarts the session; the restart re-fetches outstanding
partitions via g.fetching. Heartbeat errors keep the in-place retry
that 90bcc2b introduced.

Co-Authored-By: Claude Fable 5 <[email protected]>
STALE_MEMBER_EPOCH reaches the manage loop via OffsetFetch (the
heartbeat itself fences with FENCED_MEMBER_EPOCH), meaning the broker
still has this member. Rejoining with a fresh member id - previously
shared with the UnknownMemberID arm - stranded the old member
server-side until the session timeout, and because a live member's
partitions stay in the target assignment, the fresh incarnation
received an empty assignment and consumed nothing at all until the
eviction.

Rejoining at epoch 0 with the same member id is KIP-848's lost-response
recovery: the broker re-admits the member in place and re-delivers its
assignment. Move StaleMemberEpoch into the FencedMemberEpoch arm, which
already keeps the member id; this also matches the share group
consumer, which keeps its UUID on fence/unknown.

Co-Authored-By: Claude Fable 5 <[email protected]>
ConsumerGroupHeartbeat is deliberately never retried by the client's
coordinator request wrapper: heartbeats carry reconciliation state and
the heartbeat loop owns retries with full state knowledge. The leave
(MemberEpoch -1, or -2 for static members) inherited that path
incidentally, but a leave carries no reconcilable state and is
idempotent - and the coordinator moving is precisely when a leave gets
issued against a stale cached coordinator. Firing it exactly once meant
a single NOT_COORDINATOR lost the leave and the member ghosted until
the session timeout, where classic LeaveGroup retries through the
wrapper. The same applied to the share group leave.

Route MemberEpoch<0 heartbeats (consumer and share) through
handleCoordinatorReqSimple and teach parseRetryErr about
ShareGroupHeartbeatResponse so coordinator errors evict the cached
coordinator and retry. The wrapper bounds retries by the session
timeout, which is the natural cap for a leave: past it, the broker has
expired the member's session anyway.

Because the leave is now retried, a retry can find the member already
gone (a prior attempt succeeded but its response was lost, or the
session expired first). Map UNKNOWN_MEMBER_ID on a leave to success in
both leave paths: the member being out of the group is the goal state
of leaving.

Co-Authored-By: Claude Fable 5 <[email protected]>
416e826 stopped canceling prior in-flight commits in favor of waiting
for them (canceling kills the connection, and the broker could then
process a replacement commit issued on a new connection before the
original), but the CommitOffsets and CommitOffsetsSync doc comments
still described the old canceling behavior.

Co-Authored-By: Claude Fable 5 <[email protected]>
Repros from the broker-dies/leader-moves-mid-rebalance audit
(rebalance-churn.md), written to fail before the prior commits: 848
offset-fetch failures must not silently stall the assignment,
STALE_MEMBER_EPOCH resets must keep the member id, and 848/share
leaves must survive a transient NOT_COORDINATOR (with UNKNOWN_MEMBER_ID
on a leave reporting success). The classic-protocol siblings pin the
behavior the fixes restore parity with.

Co-Authored-By: Claude Fable 5 <[email protected]>
Audit round 4 (txn-churn.md). Six fixes, each independently traced:

InitProducerID now retries CONCURRENT_TRANSACTIONS in place. The
doWithConcurrentTransactions wrapper added in 2451c59 never functioned:
the inner fn returned nil before checking the response ErrorCode, and the
coordinator wrapper does not convert CONCURRENT_TRANSACTIONS into a
request error. Taking over a crashed incarnation's ongoing transaction
ALWAYS receives CONCURRENT_TRANSACTIONS at least once (the broker fences
and aborts the old transaction and tells us to retry), so routine
recovery surfaced as a 'producer ID has a fatal, unrecoverable error'
BeginTransaction failure.

maybeRecoverProducerID treats retriable broker codes stored as the
producer-id load failure (COORDINATOR_LOAD_IN_PROGRESS and friends that
outlived their internal retries) as reload-recoverable, matching the
existing transport-error arm instead of reporting them fatal.

maybeRecoverProducerID gates the narrowed KIP-890p2 recoverable set on
tx890p2 (the mode our transactions actually ran under) instead of
supportsKeyVersion(EndTxn, 5): brokers advertise EndTxn v5 regardless of
the finalized transaction.version, so the old gate disabled the
KIP-360/KIP-588 recovery on 4.0+ clusters still operating TV1 semantics,
killing the producer on routine transaction-timeout aborts.

EndTransaction under KIP-890p2 now issues an EndTxn abort when produces
were attempted but none succeeded. Produce requests implicitly register
their partition with the transaction coordinator durably BEFORE the data
append, and the client marks addedToTxn only on produce success, so a
transaction whose every produce failed skipped EndTxn entirely: the
broker-side transaction stayed ongoing until the transaction timeout and
the next transaction's produces (same epoch, no bump) silently joined it.
Aborting an empty TV2 transaction is legal and bumps the epoch.

supportsKIP890p2 no longer opts in when the user's MaxVersions cap is
below the 890p2 wire versions (produce v12, EndTxn v5, TxnOffsetCommit
v5). Opting in skipped AddPartitionsToTxn while version negotiation kept
produce at v11 or lower, so brokers rejected every transactional batch
with INVALID_TXN_STATE.

doWithConcurrentTransactions orders its backoff escalation longest-first;
the switch cases were ordered so the 500ms and 1s arms were unreachable.

The produce-path TransactionAbortable arm no longer continues into
producing: doTxnReq's error path already removed every batch from the
transaction and decremented its inflight count, so producing them anyway
double-decremented the uint8 inflight counter (wrapping to 255 and
permanently wedging the recBuf drain gate) and could re-drain in-flight
batches. The error now takes the fail-producer-id arm, which delivers the
same error to the records and remains recoverable via EndTransaction.
(Unreachable from spec brokers today: the client pins AddPartitionsToTxn
to v3, which cannot carry TRANSACTION_ABORTABLE.)

Also refreshes stale docs (RequireStableFetchOffsets is a permanent
no-op; the pre-KIP-447 sleep is 500ms not 200ms; KIP-890 recoverability
phrasing) and drops commitTxn's unused cancel plumbing, whose comment
described group-context cancellation that never existed: the request
deliberately rides the caller's context.
Six tests from audit round 4 (txn-churn.md). Four reproduce the bugs
fixed in the previous commit and assert the correct behavior:
InitProducerID survives one injected CONCURRENT_TRANSACTIONS,
BeginTransaction survives a retriable InitProducerID load failure
(NOT_ENOUGH_REPLICAS surfaces via the reload path rather than as a
fatal producer state), EndTransaction aborts a KIP-890p2 transaction
whose every produce failed, and a MaxVersions-pinned client produces
transactionally against a transaction.version=2 cluster via the
explicit AddPartitionsToTxn path. Two controls pin behavior that
already held: the TV1 failed-produce abort, and the coordinator
wrapper retrying EndTxn through one NOT_COORDINATOR.
A metadata response can omit a partition we already know about: brokers
serve metadata from their own caches, which lag the controller, so
after a CreatePartitions a refresh that lands on a not-yet-caught-up
broker reports the old partition count. The merge keeps the partition
around for exactly this reason (metadata.go, "we are keeping the
partition around for safety"), but bumpRepeatedLoadErr treated the
errMissingMetadataPartition it bumps with as terminally non-retryable:
willFail reduced to canFail, so for the default idempotent producer
every buffered-but-never-sent record was failed on the FIRST stale
refresh, and a transactional producer was forced into a spurious abort.
The failure is purely client-manufactured: the broker still has the
partition and a drain moments later would have succeeded, yet one
lagging broker response failed records that the infinite-retry defaults
(recordRetries=MaxInt64, recordTimeout=0) promise to keep retrying.
The Java client parks such batches and only delivery.timeout.ms
expires them.

Fix: treat errMissingMetadataPartition as retryable, bounded by the
same unknown-topic fail limit as UNKNOWN_TOPIC_OR_PARTITION and
UNKNOWN_TOPIC_ID - it is the metadata-side twin of those broker
errors. Transient omissions (the CreatePartitions propagation window)
now heal: the partition is restored on a later refresh, the same
recBuf resumes with sequences intact, and the records deliver. The
permanent case (a topic recreated with fewer partitions) still fails
records once the limit (UnknownTopicRetries, default 4) trips via the
metadata retry loop, so nothing hangs forever.

Repros in pkg/kfake/partition_count_test.go:
TestProduceTransientMissingPartitionKeepsRecords fails before this
commit (record failed during the stale window) and passes after;
TestProducePersistentMissingPartitionStillFails guards the bound.
A share coordinator assigns by topic ID + partition index and can hand
out newly added partitions before the client's metadata has seen the
CreatePartitions (metadata is served from per-broker caches that lag
the controller). assignPartitions skipped such partitions - correct in
the moment - but nothing ever retried them, in two compounding ways:

1. The skip was recorded as success: nowAssigned stored the full
   broker assignment including the skipped partition, and the add loop
   skipped anything already in nowAssigned, so even a broker re-send
   of the identical assignment would not activate it.
2. The member epoch is acked on every heartbeat response regardless,
   so the broker considers the assignment delivered and never re-sends
   it; on nil-assignment heartbeats, handleHeartbeatResp only
   re-resolved unresolved topic IDs - an unactivatable partition index
   of a resolved topic was retried by nothing.

Net effect: the new partition was silently never consumed (no error,
nothing above debug logs) until some unrelated assignment change. The
regular 848 consumer does not have this hole: its assigned-but-unknown
partitions funnel into the offset-load machinery, whose loads are
retried on metadata updates until they apply.

Fix: the add loop now skips based on the cursor's own activation state
(assigned.Swap) rather than on nowAssigned, and sets pendingAssigns
when it cannot activate an assigned partition; while pendingAssigns is
set, handleHeartbeatResp re-returns the current assignment on
nil-assignment heartbeats so activation is retried (each pass
re-triggers a metadata refresh, mirroring the existing
unresolved-topic-ID retry). Topics absent from tps entirely
(purged/unsubscribed) deliberately do not set the flag: they will
never appear in tps, and the broker is guaranteed to send a new
assignment in response to the subscription change.

Repro in pkg/kfake/partition_count_test.go:
TestShareAssignedNewPartitionStaleMetadata fails before this commit
(the new partition is never consumed) and passes after;
Test848AssignedNewPartitionStaleMetadata guards the 848 sibling chain.
Real brokers reject offset commits for partitions their metadata does
not know with UNKNOWN_TOPIC_OR_PARTITION at the API layer, before the
group coordinator sees the request (KafkaApis.handleOffsetCommitRequest
checks metadataCache.getLeaderAndIsr per partition); the coordinator
itself stores offsets blindly. kfake accepted and persisted commits for
any topic/partition, so client paths that handle per-partition commit
rejection were untestable against it.

Validate against a topic metadata snapshot taken when the request is
dispatched to the group goroutine (in the cluster goroutine, where
c.data is safe to read - the same pattern ConsumerGroupHeartbeat uses
for creq.topicMeta), and commit only the partitions that pass. Auth is
still checked first, matching the real broker's ordering. Both the
classic and 848 commit handlers funnel through fillOffsetCommitWithACL,
so one check covers both. TxnOffsetCommit is deliberately untouched:
its broker-side validation has not been verified.

Also adds the partition-count audit regression tests: producer
transient/persistent missing-partition behavior (stale metadata views
replayed via a Metadata ControlKey), share and 848 new-partition
assignment under stale client metadata, and the commit validation
above.
…; back off top-level errors

The share consumer's only leader-move heal was the CurrentLeader hint in
ShareFetch/ShareAcknowledge error responses, and the broker populates
that hint only for NOT_LEADER_OR_FOLLOWER / FENCED_LEADER_EPOCH and only
when it already knows the new leader (KafkaApis processShareFetchResponse).
Every other error shape - a leaderless failover window, a dead broker
(no response at all), UNKNOWN_TOPIC_ID propagation, storage errors - had
no heal: nothing in the share fetch path ever triggered a metadata
update, so an affected partition sat erroring until the periodic refresh
(MetadataMaxAge, default 5 minutes). The classic fetch path heals both
shapes: per-partition response errors collect into updateWhy and trigger
an immediate update, and the transport-failure backoff opportunistically
triggers one (this is how classic cursors escape a dead broker).

On top of that, the share path surfaced every hint-less per-partition
error straight into the polled fetch. Classic strips retriable errors
unless KeepRetryableFetchErrors is set, and gives UNKNOWN_TOPIC_ID a
5-strike grace via cursor.unknownIDFails before surfacing (a just
created topic transiently returns it while brokers sync; persistent
means recreation, which deliberately stalls loudly). The Java share
consumer likewise swallows all retriable share fetch errors and requests
a metadata update (ShareFetchCollector.handleInitializeErrors), throwing
only auth/corruption. So under a routine leader move, kgo share users
polled kerr.NotLeaderForPartition errors that the Fetches.Errors docs
describe as restart-worthy, while consumption silently stalled.

Finally, a top-level ShareFetch error reset the session and returned
straight into the next fetch: no buffered fetch to pace on, no backoff.
A persistent top-level error - e.g. GROUP_AUTHORIZATION_FAILED after an
ACL revocation mid-run, which the broker answers top-level - hot-looped
at round-trip pace (5301 requests in 2s in the in-process repro).
Transport errors and all-errors-stripped responses already backed off;
top-level errors now take the same backoff.

This commit makes the share fetch path mirror classic:

  - per-partition errors without a leader hint are classified: retriable
    errors are stripped (KeepRetryableFetchErrors restores surfacing),
    UNKNOWN_TOPIC_ID gets the same 5-strike grace counter on the share
    cursor, non-retriable errors still surface
  - collected errors trigger a metadata update with classic's exact
    split: pure unknown-topic reasons ride the debounced trigger,
    anything else triggers immediately
  - the share backoff opportunistically triggers a (debounced) metadata
    update, healing dead-broker cursors
  - top-level response errors back off before refetching

Stripped-empty responses flow into the existing allErrsStripped backoff,
which paces hint-less error retries. The hinted-move arm stays first and
is unchanged. Share-only blast radius; classic/848/direct consumers and
producers are untouched.

Repros (fail pre-fix, pkg/kfake/share_churn_test.go):
TestShareFetchLeaderMoveNoHintHeals (error surfaced + 15s stall, now
heals in ~0.3s), TestShareFetchTransportErrorTriggersMetadata (18s
stall, now ~0.9s), TestShareFetchTopLevelErrorBackoff (5301 requests/2s,
now backoff-paced); TestShareFetchLeaderMoveHintHeals guards the hint
arm on both sides.
…ints

The share assignment and cursor-move paths bounds-checked partition
numbers from broker responses only on the upper side:
assignPartitions's revoke and add loops and applyMoves all did
"if int(p) >= len(td.partitions) { continue }" before indexing
td.partitions[p]. A negative partition number passes that check and
panics with index out of range, crashing the process - the revoke/add
loops run on the share manage goroutine, applyMoves on the metadata
loop.

The values come straight from the wire: assignment partitions from
ShareGroupHeartbeat responses (stored wholesale into nowAssigned, so
the revoke loop replays them on the next assignment change too), and
move targets from ShareFetch/ShareAcknowledge response partitions that
carry errors with CurrentLeader hints. A sane broker never sends a
negative partition, but a buggy or hostile one can, and the classic/848
assignment funnel already routes exactly this to safety
(consumer.go assignPartitions: offset.at >= 0 && partition >= 0 &&
partition < len bounds check). The share path now skips negative
indexes too: silently in the revoke loop and applyMoves (matching the
too-large skip), with a warn in the add loop. The add-loop skip
deliberately does not set pendingAssigns: unlike a too-large index,
which heals once metadata catches up to grown partitions, a negative
index can never become valid, so retrying activation each heartbeat
would spin forever.

Repro (panics pre-fix with "index out of range [-1]" in
assignPartitions): TestShareAssignmentNegativePartitionNoPanic in
pkg/kfake/share_churn_test.go, which hijacks one post-join heartbeat to
deliver an assignment of partitions [0, -1] and asserts partition 0
keeps consuming.
…roker

ShareFetch and ShareAcknowledge response partitions carry an inline
(untagged) CurrentLeader struct that is always serialized. The Java
schema default is leaderId=-1/leaderEpoch=-1, and a real broker
serializes that whenever it does not populate a hint - KafkaApis fills
CurrentLeader only for NOT_LEADER_OR_FOLLOWER and FENCED_LEADER_EPOCH,
and only when it knows the new leader. kfake's donep() response
builders left the Go zero value 0/0 on every other error partition
(UnknownTopicID, UnknownTopicOrPartition, TopicAuthorizationFailed, ack
errors), and kmsg's generated Default() does not default these fields.

Clients treat LeaderID >= 0 && LeaderEpoch >= 0 as a valid KIP-951-style
move hint, and kfake node IDs start at 0 - so every kfake hint-less
error partition silently told clients "the leader is node 0, epoch 0",
making them migrate share cursors to broker 0 instead of exercising the
no-hint error path that real brokers produce. Set -1/-1 in both donep()
builders; the NotLeaderForPartition arms still overwrite with the real
leader.

Found during the round-9 share-churn audit: the no-hint repros
(share_churn_test.go) model exactly the responses real brokers send for
leaderless windows and propagation lag, which kfake could not produce.
Round-9 audit repros for the share consumer under broker churn
(share-churn.md). Five tests, all -race:

  - TestShareFetchLeaderMoveNoHintHeals: a leader move whose NOT_LEADER
    responses carry no CurrentLeader hint (the leaderless-window shape,
    and the only shape for error codes the broker never hints for) must
    not surface retriable errors to poll and must heal via a triggered
    metadata refresh. Pre-fix: NotLeaderForPartition surfaced on every
    poll and the partition stalled for the full window.
  - TestShareFetchLeaderMoveHintHeals: control; the CurrentLeader hint
    path migrates without metadata, passing pre- and post-fix.
  - TestShareFetchTransportErrorTriggersMetadata: every ShareFetch to
    the old leader has its connection killed (a dead broker, no
    response, no hint possible); the fetch backoff must trigger a
    metadata refresh like the classic source backoff. Pre-fix: stalled.
  - TestShareFetchTopLevelErrorBackoff: persistent top-level errors
    (group auth revoked mid-run) must back off; pre-fix the fetch loop
    was round-trip paced (5301 requests in the 2s window).
  - TestShareAssignmentNegativePartitionNoPanic: an assignment of
    partitions [0, -1] must be skipped, not crash; pre-fix it panicked
    the manage goroutine with index out of range [-1].
…tionsToTxn

On a pre-KIP-890p2 cluster, txnReqBuilder.add puts a partition in the
AddPartitionsToTxn request only the first time it joins the transaction;
later batches for that partition ride produce requests with no add. When
an AddPartitionsToTxn failed, doTxnReq's deferred cleanup ran
removeFromTxn over EVERY batch in the produce request, clearing
addedToTxn for partitions whose membership was a broker-acked fact from
an earlier add of the same transaction.

EndTransaction's anyAdded walk then saw no added partitions and returned
nil without issuing EndTxn -- before ever consulting the failed producer
ID (its own comment, 'anyAdded is true if the producer ID was failed',
documents the invariant this broke). The broker still had an ongoing
transaction holding previously appended batches, which the transaction
timeout eventually aborts: records whose produce promises succeeded, in
a commit that returned nil, are silently discarded. The same shape
reaches commit-nil via transport-failed adds when user-configured record
retries/timeouts fail the requeued batches before the coordinator heals.

The Java client only reverts pending adds on AddPartitionsToTxn failure
(pendingPartitionsInTransaction); acked adds and the sticky
transactionStarted flag are untouched, so its EndTxn is never skipped
after a partial add failure.

Scope the un-marking to partitions actually in the failed txnReq; every
batch in the produce request is still requeued (drain index reset +
inflight decrement). Also refuse to strip or fatal on response
partitions that were not in the request, so a buggy broker reply cannot
clobber acked membership either, and document why per-partition
TransactionAbortable is deliberately left in the request.

Repro: TestAuditTxnV1AddPartitionsFatalKeepsEarlierAdds (fails pre-fix)
and TestAuditTxnV1AddPartitionsRetriableStripControl in pkg/kfake.
EndTransaction documents that a commit attempted while the producer ID
has an error returns kerr.OperationNotAttempted and that the caller
should then retry with TryAbort. The commit attempt, however, had
already consumed the transaction state before reaching the producer-ID
check: inTxn was cleared on entry and the anyAdded walk swapped every
recBuf's addedToTxn (and the group's offsetsAddedToTxn) to false. The
documented TryAbort retry then hit the !inTxn early return and reported
success without sending anything.

GroupTransactSession.End does this retry internally ('end transaction
with commit not attempted; retrying as abort'), so the flagship session
API logged an abort that never reached the broker. The broker-side
transaction stayed ongoing -- stalling read_committed consumers on the
LSO -- until the transaction timeout aborted it, or until a later
InitProducerID re-init happened to clear it for recoverable producer-ID
errors. The !inTxn gate predates the OperationNotAttempted contract
(dbd8d35, 2020); nothing documents the no-op as intended, and the Java
client never skips the EndTxn abort once a partition was added
(TransactionManager gates EndTxn only on its sticky transactionStarted).

When the commit is not attempted, restore what the call consumed (inTxn,
each swapped addedToTxn, offsetsAddedToTxn) before returning
OperationNotAttempted so the abort retry still sees the transaction and
issues EndTxn. producingTxn deliberately stays false: produces between
the failed commit and the abort retry fail fast with
errNotInTransaction rather than buffering against a failed producer ID.

Repro: TestAuditTxnAbortRetryAfterOperationNotAttempted in pkg/kfake
(fails pre-fix).
recBuf.inflight was a uint8, but the number of concurrent requests
holding a batch of one recBuf is bounded by the sink's inflight
semaphore, and with idempotency disabled that is the user's
MaxProduceRequestsInflightPerBroker -- which config validation does not
bound above. A value over 255 with 256+ buffered batches could wrap the
counter: createReq's 'inflight != 0 && !okOnSink' gate and decInflight's
zero check then fire at the wrong times, in the worst case clearing
inflightOnSink while requests are still in flight and letting a
migrated recBuf drain on a new sink concurrently with old-sink requests
(the cross-sink reordering inflightOnSink exists to prevent).

No runnable repro: the wrap needs more than 255 physically concurrent
in-flight produce requests, which is not feasible as a fast kfake unit
test. int32 makes the wrap unreachable (2^31 concurrent requests each
holding memory).
…duplicated

handleReqResp deleted req.metrics entries when a produce response named
a topic (or topic ID, or partition) that was not in the request. For a
genuinely invented entry the delete was a no-op: metrics entries only
exist for batches AppendTo actually serialized. But a DUPLICATED reply
entry takes the same arm -- processing the first occurrence empties that
topic/partition out of req.batches -- and there the delete erased the
metrics of a batch the first occurrence legitimately finished, so the
deferred metrics hook silently skipped OnProduceBatchWritten for a
successfully produced batch.

Drop the deletes; the one that does real work (clearing metrics for
batches whose response carried an error, i.e. !didProduce) stays.

Repro: TestAuditProduceDuplicateResponseEntryKeepsHook in pkg/kfake
(fails pre-fix).
- sink.seqResps referenced a 'seqRespsMu' that has not existed since the
  field became a ring with an internal mutex.
- doSequenced and produceRequest.firstCancelingCtx claimed request
  cancellation applies 'if and only if' idempotency is disabled;
  AllowIdempotentProduceCancellation (0338467) is a second opt-in
  (mutually exclusive with transactions via config validation).
- Produce's promise documentation said promises 'should be relatively
  fast'; make the real contract explicit: promises are called serially
  and must not block on the client. finishRecordPromise calls the
  promise before decrementing buffered counts, and the cond broadcast
  that wakes blocked producers and Flush is deferred until the promise
  worker drains its queue, so a promise blocking on Produce-at-limit or
  Flush waits on a wakeup only its own return can deliver.
  AbortingFirstErrPromise already dodges this by spawning a goroutine.
Round 3 of the audit program: subsystem sweep of pkg/kgo/sink.go.

- TestAuditTxnV1AddPartitionsFatalKeepsEarlierAdds: a fatal
  AddPartitionsToTxn for a new partition must not un-mark partitions
  that joined the transaction via an earlier acked add; pre-fix,
  EndTransaction(TryCommit) returned nil without issuing EndTxn while
  the broker held an ongoing transaction with an appended batch
  (timeout-aborted later = silent loss after a nil commit).
- TestAuditTxnV1AddPartitionsRetriableStripControl: a retriable
  per-partition add error strips, requeues, re-adds, and produces
  within one Flush; held pre-fix and guards the txnReq membership
  check.
- TestAuditTxnAbortRetryAfterOperationNotAttempted: the documented
  TryAbort retry after a not-attempted commit must issue EndTxn;
  pre-fix the commit attempt consumed inTxn/addedToTxn and the retry
  silently no-opped (GroupTransactSession.End's internal retry-as-abort
  included).
- TestAuditProduceDuplicateResponseEntryKeepsHook: a duplicated produce
  response topic entry must not erase the finished batch's metrics;
  pre-fix OnProduceBatchWritten never fired for the produced batch.

All three BUG REPRODUCED tests verified failing against pre-fix kgo
(51f4fcb) and passing with the sink-sweep fixes, -race.
…oker

A fetch response can redirect a partition to a preferred read replica
(KIP-392 follower fetching) whose broker the client has not yet learned
from metadata - e.g. a freshly added replica that the fetched-from broker
already knows about before the client's periodic refresh catches up.

source.fetch deletes such a cursor from the request's used offsets and
calls cursorOffsetPreferred.move() to migrate it. When move() found no
source for the preferred broker it triggered a metadata update and
returned WITHOUT re-enabling the cursor. The cursor was use()'d (made
unusable) when the request was built, and nothing else makes it usable
again:

  - the defer in fetch() that would finishUsing()/allowUsable() it skips
    it, because move()'s caller already deleted it from the used offsets;
  - the leader is unchanged, so the triggered metadata refresh performs no
    cursor migration (migrateCursorTo only runs on a leader/epoch change),
    and merely learning the new broker never touches cursor usability.

The partition is then silently never consumed again until an unrelated
session restart (rebalance, assign, or a real leader change). No error
surfaces - a silent stall.

Re-enable the cursor on its current (leader) source in the !exists arm via
allowUsable(). We keep consuming from the leader, and the forced metadata
update lets a later fetch's preferred replica be honored once the broker
is known. triggerUpdateMetadataNow coalesces (non-blocking send), so the
re-fetch loop cannot spam metadata.

Regression: TestAuditPreferredReplicaUnknownBrokerNoStrand in
pkg/kfake/source_sweep_test.go consumes 0/5 records pre-fix (stranded) and
5/5 post-fix.
processRecordBatch read batch.NumRecords straight off the wire and passed
it to ensureLen, which sizes a slice with s[:n]. A buggy or malicious
broker (the CRC only guards accidental in-transit corruption, and the
exported ProcessFetchPartition accepts arbitrary caller input) sending a
record batch with NumRecords < 0 panicked the fetch goroutine with "slice
bounds out of range [:-1]", crashing the client. A hostile huge NumRecords
drove an unbounded up-front allocation of NumRecords * sizeof(kmsg.Record).

Guard both: reject a negative count as a corrupt batch (matching the Java
client's DefaultRecordBatch "Found invalid record count"
InvalidRecordException), and clamp the count to the available record bytes
- every record needs at least one byte, so a count exceeding the byte
count is impossible for a well-formed batch. The true decodable count is
still recomputed by readRawRecordsInto, and the KAFKA-5443 truncation
defer leaves the offset unadvanced whenever it disagrees with numRecords,
so valid batches are unaffected (their count is always below the byte
count).

Regression: TestAuditFetchNegativeRecordCountNoPanic (panics pre-fix) and
TestAuditFetchHugeRecordCountBounded in pkg/kfake/source_sweep_test.go.
Round 5 of the FRANZ_AUDIT.md program (subsystem sweep of pkg/kgo
source.go + record_and_fetch.go: the fetch path). Regression tests for the
two fixes landed this round:

  - TestAuditPreferredReplicaUnknownBrokerNoStrand: a fetch hijacked into a
    preferred-replica move to an unknown broker must not strand the cursor
    (0/5 records pre-fix, 5/5 post-fix).
  - TestAuditFetchNegativeRecordCountNoPanic: a record batch with a
    negative record count must error, not panic in ensureLen.
  - TestAuditFetchHugeRecordCountBounded: an oversized record count must
    not drive a giant up-front allocation.

buildRecordBatchBytes hand-builds a v2 record batch with a valid
Castagnoli CRC, so the corrupt-count batches still pass the client's
default CRC check - the realistic malicious/buggy-broker case, since CRC
only defends against accidental in-transit corruption.
handleReqResp looked up each response partition in req.usedOffsets and
processed it with no guard against the same partition (or topic) appearing
more than once in one response. A correct broker never duplicates, but a
buggy or hostile one that does was processed against the same
*cursorOffsetNext twice:

  - the partition's error/records surfaced to the user twice (two
    FetchPartition entries for one partition);
  - a duplicated preferred-replica redirect enqueued two move() calls for
    one cursor - the second reads/writes cursor.source after the first
    made the cursor eligible on its new source, the exact concurrent-
    source hazard that #1167 guards against.

Track the *cursorOffsetNext pointers already handled in a per-response
seen set and skip (with a warning) any repeat. We deliberately do NOT
dedup by deleting from req.usedOffsets: that map is what re-enables each
cursor after the response, so removing an entry would strand the
legitimately-processed cursor. Keying by the stable per-request
*cursorOffsetNext also collapses a duplicated topic (same partition
pointer) while leaving a topic legitimately split across response entries
(distinct partitions) fully processed.

Regression: TestAuditFetchDuplicatePartitionDeduped in
pkg/kfake/source_sweep_test.go (error surfaces twice pre-fix, once post-fix).
TestAuditFetchDuplicatePartitionDeduped injects a fetch response listing
one partition twice with a non-retryable error and asserts the client
surfaces it exactly once - exercising the handleReqResp dedup that guards
against a buggy/hostile broker duplicating a partition (or topic) in a
single response.
A heartbeat response can assign topic IDs the client cannot yet map to
names (newly created topic, metadata not refreshed). Those IDs park in
g848.unresolvedAssigned and are folded into every heartbeat's owned
Topics so the server sees them acknowledged.

That state was never cleared on a member reset. After any rejoin-class
event while a topic was still unresolved (UNKNOWN_MEMBER_ID after an
outage, FENCED/STALE_MEMBER_EPOCH, or any manage-loop error), the next
initialJoin's request carried the old member's unresolved topics in
Topics at MemberEpoch 0. The broker rejects every (re)join whose
owned-partitions list is non-empty with INVALID_REQUEST
("TopicPartitions must be empty when (re-)joining"), and the only
other thing that clears unresolvedAssigned is a successful
assignment-carrying response - which a rejected join never produces.
The consumer was permanently wedged: every join retry failed
identically until process restart.

Clear unresolvedAssigned in initialJoin next to the other per-member
resets. The join response always re-delivers the member's full
assignment, so nothing is lost. The Java client clears its equivalent
unresolved-IDs cache the same way on every transition to joining, and
the in-code invariant comment ("our initialJoin should *always* have
an empty Topics in the request") already promised this.

Repro: TestAudit848StaleUnresolvedJoin in pkg/kfake (fails pre-fix
with the join loop erroring INVALID_REQUEST forever; kfake now
mirrors the broker's join validation).
… notification

The 848 manage loop silently restarts the heartbeat session on
retryable broker/dial/coordinator errors and counts the restarts in
consecutiveTransientRestarts; every cfg.retries consecutive restarts
it injects a fake fetch error (ErrGroupSession "heartbeat has been
failing for N consecutive attempts") so the user learns the group is
unreachable - that injection is the path's only user-visible signal.

The transient arm sets err = nil to keep the session loop going, and
then fell through to the shared 'if err == nil reset' tail, zeroing
the counter on the same iteration that incremented it. The counter
could never pass 1, so with the default 20 retries the notification
was unreachable: a permanently unreachable coordinator was a silent
infinite restart loop.

Continue the loop directly from the transient arm instead. The reset
now only runs for exits whose nil err means a processed response
(rebalance / reassignment), which is what the counter's comment
("without any successful heartbeat in between ... Reset on success")
always intended.

Repro: TestAudit848TransientRestartNotification in pkg/kfake (fails
pre-fix: NOT_COORDINATOR on every heartbeat never surfaces any error
to poll).
should848 consulted broker-advertised versions only. A user pinning
MaxVersions to a version set whose ConsumerGroupHeartbeat max is 0
(e.g. pinning 3.7/3.8 behavior against a 4.x cluster) got v0 on the
wire while the manage loop ran v1 semantics. v0 has no
SubscribedTopicRegex field, so kmsg silently dropped the regex from
the join: a regex consumer joined with no subscription at all and
silently consumed nothing, forever. The classic fallback - the right
outcome for a sub-v1 cap - never happened because the broker itself
advertises v1.

Gate supportsKIP848v1 on the MaxVersions cap as well, mirroring
supportsKIP890p2 (which gained the same guard for the KIP-890 wire
versions).

Repro: TestAudit848MaxVersionsV0FallsBackToClassic in pkg/kfake
(fails pre-fix: nothing is ever consumed; post-fix the client falls
back to the classic protocol and consumes).
twmb added 21 commits June 14, 2026 10:32
…ERROR

GroupTransactSession.End retries EndTransaction on a handful of error
codes. The OperationNotAttempted and TransactionAbortable arms both set
willTryCommit=false ("retry as abort") before the goto, because
EndTransaction consumes inTxn on its erroring call and a re-call no-ops
at its `if !inTxn { return nil }` guard. The UNKNOWN_SERVER_ERROR arm,
added in 3ecaff2 when the OperationNotAttempted `if` became a `switch`,
omitted that downgrade.

The consequence is a silent EOS data loss. Trace:

  1. End(TryCommit): offsets commit OK, heartbeat OK => willTryCommit=true.
  2. EndTransaction(commit=true) sets inTxn=false, issues EndTxn, the
     broker answers UNKNOWN_SERVER_ERROR. That code is uniquely excluded
     from failProducerID, so the producer ID stays good and inTxn stays
     false (only the OperationNotAttempted arm restores it).
  3. End's USE arm goto-retries WITHOUT willTryCommit=false.
  4. The retried EndTransaction(commit=true) returns nil at its !inTxn
     guard, sending no EndTxn.
  5. endTxnErr==nil and willTryCommit==true, so End's success tail reports
     a committed transaction and setOffsets(postcommit) advances the
     consumer offsets - even though the broker's UNKNOWN_SERVER_ERROR left
     the commit unconfirmed and the transaction may have aborted. The
     consumer's input offsets move past records whose output may never
     become visible: silent loss.

Some brokers return UNKNOWN_SERVER_ERROR from EndTxn (the surrounding
comments cite Redpanda in certain versions); conformant Kafka does not
emit it there, so happy-path Kafka/kfake testing never hit this.

Fix: set willTryCommit=false in the USE arm, mirroring its two sibling
arms. The no-op retry then reports not-committed and End rewinds to the
last committed offsets, so the caller reprocesses (at-least-once) instead
of advancing past an unconfirmed commit. The dead backoff timer (it only
ever gated the no-op retry, never a real re-issue) is dropped so the arm
matches the others; `time` stays in use elsewhere in the file.

This extends f25fe06's discipline (restore/abort the consumed txn state
so the documented EndTransaction retry behaves) from the
OperationNotAttempted arm to its UNKNOWN_SERVER_ERROR sibling
(FRANZ_AUDIT pattern 17, plus pattern 22: a manufactured nil treated as
success by an err==nil tail).

Repro: TestAuditTxnEndTxnUnknownServerErrorNotFalseCommit in
pkg/kfake/txn_resweep_test.go injects UNKNOWN_SERVER_ERROR on every
EndTxn during a GroupTransactSession commit; End(TryCommit) returns
committed=true pre-fix and committed=false post-fix (-race).
TestAuditTxnEndTxnUnknownServerErrorNotFalseCommit drives a
GroupTransactSession through a consume-then-commit and injects
UNKNOWN_SERVER_ERROR on every EndTxn via a keep-forever control. Pre-fix
End(TryCommit) returns committed=true (the consumer offsets were advanced
for a transaction the broker may have aborted); post-fix it returns
committed=false and rewinds. Companion to the kgo fix in the prior commit.
PurgeTopicsFromConsuming (consumer.purgeTopics) told the manage loop the
subscription shrank by calling g.rejoin unconditionally. For a classic group
that is correct: it bounces the heartbeat session so the member re-joins and
the group rebalances. For a next-gen (848) group it is wrong, and it violates
the invariant that ForceRebalance's 848 redirect documents as its own safety
justification: "nothing feeds rejoinCh in 848 mode."

In 848 the shared heartbeat loop receives the rejoinCh signal and converts it
to RebalanceInProgress - NOT errReassigned848 - so stopHeartbeating stays
false and heartbeats keep running while the session-end revoke
(revokeThisSession) executes g.nowAssigned.write(), a clone-modify-store
read-modify-write. A concurrent heartbeat's handleResp does
g.nowAssigned.store() with the server's latest assignment. If the revoke
clones before the heartbeat stores but stores after, the server's assignment
is lost from nowAssigned: the client briefly under- or over-claims partitions
(delayed pickup of newly assigned partitions, or a brief over-claim of
reassigned ones). 848 servers re-send the target assignment until the
member's owned set matches, so it self-heals within a heartbeat - but the
unnecessary session bounce and the lost-update race are exactly what
ForceRebalance's 848 redirect was added to prevent. purgeTopics was the lone
subscription-change feeder that forgot the 848 guard its two siblings carry
(findNewAssignments forces a heartbeat for new topics; ForceRebalance
redirects).

Fix: centralize the "subscription changed, reconcile per protocol" dispatch
in groupConsumer.signalSubscriptionChange (classic => rejoin; 848 => a
best-effort forced heartbeat whose next request re-reports the live
subscription, the server then driving the revoke through normal
reconciliation). Route all three feeders - findNewAssignments,
ForceRebalance, and purgeTopics - through it so the 848-vs-classic difference
cannot be re-derived and forgotten at a future call site, which is the root
cause here. Classic behavior and the two already-correct sites are unchanged.

The lost-update race is not deterministically reproducible, so
TestSignalSubscriptionChange848 asserts the fix's mechanism: in 848 mode the
dispatch forces a heartbeat and never feeds rejoinCh; in classic mode it
feeds rejoinCh. It fails pre-fix (the 848 arm fed rejoinCh).
…ies(0)

The KIP-848 manage loop silently restarts the heartbeat session on retryable
broker/dial/coordinator errors and, every cfg.retries consecutive restarts,
injects a fake fetch error (ErrGroupSession "heartbeat has been failing for N
consecutive attempts") so the user learns the group is unreachable - that
injection is the path's only user-visible signal.

The "every cfg.retries-th restart" gate was
`restarts >= cfg.retries && restarts % cfg.retries == 0`. cfg.retries has no
floor (RequestRetries(n) stores n verbatim; default 20), so a user who sets
RequestRetries(0) makes the modulo `restarts % 0` - an integer divide-by-zero
panic. The panic fires on the very first transient heartbeat error (a broker
restart, EOF, NOT_COORDINATOR, a refused dial - all routine) on the manage848
goroutine, which is unrecovered and so crashes the whole client process. The
sibling in-session gate (heartbeat(), `hbBrokerRetries < cfg.retries`) uses
`<` and is panic-safe; only this modulo arm divides.

retries=0 also disables the in-session heartbeat retry, so heartbeat()
propagates the first transient error immediately and every transient error is
its own restart. Dropping the notification for retries=0 would re-introduce
the silent-infinite-restart-loop that aae08a3 fixed for the general case, so
the fix must still fire it - on every restart. shouldNotify848Restart treats
retries < 1 as "notify on every restart," preserving the exact
every-Nth-restart behavior for retries >= 1.

TestShouldNotify848Restart covers retries=0 (recovers the pre-fix panic into a
clean failure, since the real panic is on a background goroutine) and the
retries=2 cadence.
TestAudit848PurgeReconcilesViaHeartbeat is the end-to-end guard for the kgo
fix "do not feed rejoinCh from a next-gen (KIP-848) topic purge": an 848 group
consuming two topics must keep consuming the kept topic after
PurgeTopicsFromConsuming drops the other, with the dropped topic not
reappearing. The purge now reconciles through a forced heartbeat rather than a
rejoinCh session bounce. The mechanism repro is TestSignalSubscriptionChange848
in pkg/kgo; this exercises the real consumer/purge path against kfake.
writeTxnMarkersSharder.shard re-buckets each marker by the leader of its
partitions, grouping markers under a pidEpochCommit key and rebuilding a
WriteTxnMarkersRequestMarker from that key. The key (and the rebuilt
marker) carried ProducerID, ProducerEpoch, Committed, and TransactionVersion
but never CoordinatorEpoch, so every sharded marker went out with
CoordinatorEpoch=0 regardless of the user's value, and two markers that
differ only in CoordinatorEpoch collapsed into one bucket.

CoordinatorEpoch is a v0+ field the broker uses to detect fenced writers: a
real broker passes it to the group coordinator's completeTransaction for
__consumer_offsets markers and embeds it verbatim into the
EndTransactionMarker control record written to every other partition's log.
Dropping it to 0 stamps the wrong coordinator epoch into the on-disk marker
and feeds epoch 0 to the group-offset fencing path. The bug is reachable via
kadm.WriteTxnMarkers (which sets rm.CoordinatorEpoch and then RequestShards)
and any raw RequestSharded of a WriteTxnMarkersRequest.

CoordinatorEpoch has been dropped since the sharder was added (83f0dbe);
58e9695 later added txnVersion to the same key but missed it. This is the
WriteTxnMarkers sibling of the round-10 AddPartitionsToTxn VerifyOnly fix
(1e9b852): a txn sharder silently losing a per-request fencing field.

TestAuditWriteTxnMarkersPreservesCoordinatorEpoch drives the real sharder
with a marker carrying CoordinatorEpoch=5 and asserts it survives into the
issued shards; it reports 0 pre-fix.
…tionIsOpen

EnsureProduceConnectionIsOpen issues a forceOpenReq to each target broker.
handleReq opens the connection via loadConnection (dial + ApiVersions +
SASL), then -- for a forceOpenReq -- rewrites the request to a throwaway
ApiVersions and sends it through waitResp to probe the connection end to
end.

On an acks=0 produce connection that probe is unsafe. loadConnection
routes the forceOpenReq (it embeds a ProduceRequest, key 0) to cxnProduce,
and init spawns the discard goroutine (hasDiscard) that owns ALL reads on
the connection -- no promisedResp is ever pushed for acks=0 produce. The
rewritten ApiVersions is response-expecting, so the isNoResp switch does
not match it and waitResp starts a handleResps reader on the same socket
the discard goroutine is already reading: two concurrent io.ReadFulls
split one byte stream. The reply is consumed (in whole or part) by
whichever reader the kernel wakes, the other desyncs, the connection dies,
and EnsureProduceConnectionIsOpen -- meant to REDUCE produce latency by
pre-warming the connection -- instead hangs on the stranded read until its
timeout and returns a spurious error, having killed the very connection it
tried to warm. This affects every broker (the discard goroutine runs for
any acks=0 produce connection, not only EventHubs).

This is the same concurrent-reader hazard the in-place SASL reauth path
already avoids (the expiry arm gates on !hasDiscard; loadConnection
recreates an expired discard connection rather than reauthing it in
place). The force-open arm simply predated the hasDiscard mechanism. The
round-11 broker sweep asserted "discard connections never push
promisedResps" -- true for the produce/reauth paths it traced, but the
force-open arm is a third sibling that does.

Fix: a force-open request on a hasDiscard connection reports success
without writing or reading anything. The connection is already fully open
(init proved it works end to end); there is nothing more to probe, and we
must not start a second reader. Non-discard force-opens are unchanged
(handleResps is their sole reader).

Repro: TestAuditEnsureProduceConnectionAcks0NoConcurrentRead (pkg/kfake)
wraps each dialed conn in a concurrent-read detector and delays the
ApiVersions response so the discard read and the force-open read are
reliably both blocked at once; it observes a second concurrent reader and
an EnsureProduceConnectionIsOpen failure pre-fix, neither post-fix
(-race).
…ssion test

TestAuditEnsureProduceConnectionAcks0NoConcurrentRead reproduces the
force-open-vs-discard concurrent-reader race fixed in the prior commit.

An acks=0 produce connection runs the discard goroutine, which owns all
reads. Pre-fix, EnsureProduceConnectionIsOpen's force-open request was
rewritten to a response-expecting ApiVersions and sent through waitResp,
starting a handleResps reader that raced the discard goroutine on the same
socket.

The test wraps every dialed connection in a concurrent-read detector
(records whether two goroutines were ever inside Read on one connection at
once) and installs a SleepControl on ApiVersions so the discard read and
the force-open read are deterministically both blocked during the delay;
the race is otherwise dependent on which blocked reader the kernel wakes
first. Pre-fix the detector fires and EnsureProduceConnectionIsOpen
returns an error; post-fix neither happens. Runs under -race.
A Produce that hits the max-buffered limit (auto-flush mode) parks in a
goroutine waiting for space; Flush and BufferedProduceRecords account for it
via the bufferedRecords + blocked sum. On the SUCCESS path the parked
goroutine decrements blocked and the caller increments bufferedRecords under
one continuous lock hold, so the sum is unchanged and no waiter needs waking.

On the CANCEL path (record / produce / client context done -> drainBuffered)
the record is failed, not buffered: blocked is decremented with no
compensating bufferedRecords++, so the sum drops by one. The only broadcast
on that path is the one drainBuffered issues to wake the parked goroutine,
and it fires BEFORE the decrement. A Flush waiting on the sum that re-checks
its predicate in that window observes the stale pre-decrement value and goes
back to waiting; when the decrement then brings the sum to zero nothing
broadcasts, so Flush(context.Background()) hangs forever (a Flush with a
cancelable context degrades to a context-timeout). The classic trigger is a
graceful shutdown: the last buffered record completes - waking Flush, which
re-waits because a concurrently-blocked produce is still counted - just as
that blocked produce's context is canceled.

Fix: drainBuffered broadcasts after releasing the lock, i.e. after the
blocked decrement is visible, mirroring how every other sum-changing site
notifies the cond. The success path is unchanged (it conserves the sum and
needs no broadcast).

The exact lost wakeup is a scheduler race - whether the waiting Flush
re-acquires the lock before the parked goroutine decrements - so the repro
TestAuditFlushWokenByBlockedProduceCancel drives the scenario in a loop:
pre-fix an iteration hangs (the watchdog fires on iteration 0 in practice),
post-fix all 200 iterations complete because the cancel path now broadcasts
after the decrement (-race).
…rrier

migrateShareCursorTo relocates a share cursor between sources when a metadata
refresh observes a leader change for a share partition (mergeTopicPartitions's
partitionKindShare arm, on the metadata loop). It does the same
removeShareCursor/addShareCursor as the CurrentLeader-hint sibling
applyMovesBlocking, but it was the one share-cursor relocation that did not
register with the share consumer's worker barrier.

The share consumer drains acks per source on leave: it waits for every share
worker to exit (for sc.workers > 0), then calls closeShareSession on a snapshot
of the source list, draining each source's cursors and setting
shareCursor.closed so post-drain user acks are callback'd rather than stranded.
A cursor that escapes every drain keeps its pending acks with no drainer:
sc.pendingAcks never returns to 0 (FlushAcks hangs) and the held records
release only via the broker's acquisition-lock timeout.

applyMovesBlocking already guards this by registering its migration via
sc.incWorker/decWorker, so leave's barrier waits for an in-flight move (or, if
already dying, incWorker bails and the cursor stays on its current source for
closeShareSession to drain). migrateShareCursorTo ran on the metadata loop with
no such registration, so a metadata-driven leader-change migration racing
LeaveGroup/Close could remove the cursor from its old source before that source
drained and add it to an already-drained source (or one created after leave's
snapshot), stranding the cursor's pending acks. The metadata loop stays alive
throughout leave -- cl.ctx is canceled only after Close waits on c.s.left -- and
c.s.tps still holds the share partitions, so the merge can migrate during the
leave window.

Fix: wrap migrateShareCursorTo's source swap in sc.incWorker/decWorker,
mirroring applyMovesBlocking. new.shareCursor is assigned before the incWorker
check so the stored partition data is valid on the dying-skip path. This
extends commit 1006894 (which supervised the applyMoves migration) to its
metadata-merge sibling.

The end-to-end strand is a shutdown race not deterministically reproducible, so
TestAuditShareMetadataMigrationWaitsForLeave asserts the mechanism: it parks the
migration after its incWorker by holding sinksAndSourcesMu (the first lock the
swap takes) and verifies the migration registers in sc.workers, exactly what
leave's barrier waits on. Pre-fix the migration takes sinksAndSourcesMu first
with no incWorker, so it never registers and the poll times out.
A topic's partitions are led by different brokers, so the same topic is
returned in separate Fetch entries (one Fetch per broker response).
EachTopic groups those entries by topic name into a single FetchTopic.
When TopicID was added to FetchTopic (4bfb0c6), the len(fs)==1 fast path
was updated to pass the whole FetchTopic through (preserving the ID), but
the multi-fetch grouping path rebuilt FetchTopic from a name=>partitions
map with a hard-coded zero TopicID.

The result: EachTopic returned a zero TopicID whenever more than one
broker replied. A topic's partitions are spread across brokers and arrive
in 2+ Fetch entries, so this is the normal case in a real multi-broker
cluster; only a single-broker poll hits the len(fs)==1 path that
preserves the ID. The bug is therefore invisible in single-broker tests
but live in production -- a "works in test, broken in prod" trap. Any
caller building a TopicID-keyed structure from EachTopic saw every topic
collapse onto the zero ID. No data loss / duplicate / stall (TopicID is
informational, Kafka 3.1+), but the field silently contradicts both its
own documented contract and the sibling len(fs)==1 path.

Carry the TopicID across the grouped Fetch entries: the broker returns
the same ID in every fetch response for a topic, so the first non-zero
copy is authoritative; topics with no ID (pre-3.1 brokers, share fetches,
which do not set it) stay zero. The classic/direct consume path populates
the ID (source.go fetchTopic build), so the fix is observable there.

TestEachTopicPreservesTopicID covers the multi-fetch case (fails pre-fix:
the grouped TopicID is zero), plus the single-fetch and no-ID cases, all
under -race.
…r when a stale pin survives the no-selection fallback

The adaptive UniformBytesPartitioner pins a partition and re-picks by
weighted-random selection only when the pin is no longer usable. The
re-pick block is entered whenever the pinned p.onPart is invalid: either
the byte-window reset cleared it to the -1 sentinel, or - the case this
fixes - the pinned index is now >= n because the writable partition count
shrank under us. writablePartitions holds only partitions with no load
error (metadata.go:588-597), so a routine leader election or rolling
restart that briefly leaves some partitions leaderless drops them from the
writable set while the full partition count is preserved; a partitioner
pinned to a now-dropped partition (and not crossing its byte threshold this
call, so onPart is not reset to -1) enters the re-pick with p.onPart >= n.

The weighted-selection loop normally assigns a fresh in-range index, but it
can select nothing: floating-point rounding can leave pick just above 0
after subtracting every weight (the case the original "if p.onPart == -1"
fallback was guarding). That fallback only fired for the -1 sentinel, so
when the loop selected nothing AND the re-pick was entered with a stale
p.onPart >= n, the stale index was returned unchanged. doPartition then
rejects it as an out-of-range partitioning choice and FAILS the record
("invalid record partitioning choice of %d from %d available") rather than
producing it.

The sibling non-adaptive branch re-picks unconditionally via Intn(n) and so
never had this hole. Track whether the loop selected anything and fall back
to the last partition when it did not, regardless of the prior p.onPart
value, matching the non-adaptive branch's guarantee that a re-pick always
yields an in-range index. Behavior on the -1 sentinel path and the normal
pick path is unchanged.

The production trigger needs both a writable-shrink window and a ~1/2^53
float-rounding draw, and the failure is loud (the record's promise gets the
error) and self-healing (the next non-edge record re-picks a valid
partition), so the severity is low - but the fix is a one-line robustness
change aligning the adaptive branch with its non-adaptive sibling. The
regression test forces the "loop selects nothing" path deterministically
with a -1 backup (weight 1/0 = +Inf, so pick - Inf is never <= 0), a
Go-version-independent stand-in for the rounding fallthrough, and asserts
the re-pick stays in range after a writable-partition shrink: it returns
the stale partition 5 (for n=3) pre-fix, an in-range partition post-fix.
…n, not broker

The adaptive arm's doc and internal comment described the weighting as
per-broker ("chooses a broker based on the inverse of the backlog ... for
that broker"), but the partitioner weights strictly per-partition: the
TopicBackupIter yields per-partition buffered record counts and the
selection table is built per partition - there is no broker aggregation at
the partitioner layer. The KIP-794 partitioner this mirrors also weights
per-partition (its load stats are per-partition queue sizes), so the
"broker" wording was inaccurate against both this code and the reference.

Reword the two mechanism descriptions to say partition. The user-facing
intent sentence (favoring less-loaded brokers) is left as-is: favoring
fast-draining, low-backlog partitions does effectively steer more produce
to responsive brokers, which is the documented goal.
…outs

Two RecordReader bugs, both reachable from valid input, fixed under one
cohesive robustness change. Extends the R16 hardening (ed19927), which
guarded readSize's allocation against a hostile size but left these two
consume-side gaps.

1. A truncated fixed-size read panics the reader (Medium).

   next()'s io.EOF handling falls through to fn.parse(r.buf, rec) with an
   empty r.buf. readSize reports plain io.EOF (not io.ErrUnexpectedEOF) only
   when it read zero bytes; a partial read is already io.ErrUnexpectedEOF and
   returns early. When such a zero-byte read is the LAST fn after an earlier
   real read, it is not the clean record boundary (the boundary check
   requires i==0 or a preceding noread), so it reaches parse with an empty
   buffer. The fixed-width number parsers index r.buf at constant offsets
   (binary.BigEndian.Uint64's b[7], ..., the byte reader's b[0]) and panic on
   a short slice.

   Trace: a binary layout ending in a fixed-width field is ordinary, e.g.
   "%p{big32}%o{big64}" (4-byte partition + 8-byte offset per record). A
   stream/file holding N whole records plus a partial final record whose
   leading field(s) are present but whose trailing fixed-size field is cut
   reaches the trailing field with zero bytes -> io.EOF -> empty-buffer
   fall-through -> panic, crashing any consumer of the API (e.g. kcl). This
   also violates ReadRecord's documented contract, which promises
   io.ErrUnexpectedEOF for a mid-record EOF.

   Fix: before parsing, treat a fixed-size read (read.size > 0) whose buffer
   is short as the truncation it is and return io.ErrUnexpectedEOF. Text and
   value reads (sizefn, delim, regexp, json) are unaffected: their parsers
   tolerate an empty buffer, so an empty trailing value stays valid.

2. A read-nothing layout loops forever (Low).

   A layout of only fixed-number verbs (e.g. "%p{3}") builds an all-noread
   fns list. next() then never performs a read, never hits EOF, and never
   sets r.done, so ReadRecord returns identical records forever -- an
   unbounded produce loop in kcl. Reject such a layout at construction
   (reads == 0), matching the parse-time rejection R16 added for other
   malformed layouts.

Repros in record_formatter_test.go, both fail pre-fix:
  - TestRecordReaderTruncatedFixedSizeNoPanic: five truncated binary layouts
    (big/little 64/32/16, byte) panic pre-fix ("index out of range, length
    0"), return io.ErrUnexpectedEOF post-fix.
  - TestNewRecordReaderRejectsBadLayouts: "%p{3}", "%T{3}", "%p{3}%o{4}"
    returned nil error pre-fix (would loop), error post-fix.
NewRecordFormatter declared and incremented a loop counter `i` that is never
read (its reader-side sibling parseReadLayout has no such counter); remove it.
Also fix "undersands" -> "understands" in the AppendPartitionRecord doc.

No behavior change.
…re field

cfg.validate enforced only a LOWER bound (>= 100ms) on SessionTimeout,
RebalanceTimeout, and ProduceRequestTimeout, and did not validate
TransactionTimeout at all. All four are time.Duration (int64 nanoseconds)
config values that are later cast with int32(d.Milliseconds()) into int32
wire fields:

  - JoinGroup SessionTimeoutMs / RebalanceTimeoutMs (consumer_group.go:1415-1416,
    consumer_group_848.go:703)
  - ProduceRequest TimeoutMs (broker.go:568, sink.go:96)
  - InitProducerId TransactionTimeoutMs (producer.go:1098)

A Duration whose millisecond value exceeds math.MaxInt32 (~24.8 days)
silently overflows that cast: a 30 day SessionTimeout becomes
SessionTimeoutMillis = -1702967296 (negative garbage the broker rejects or
mishandles), and a ~50 day one wraps to a small positive value (e.g. ~7h)
that the broker quietly accepts as a completely different timeout - silent
corruption of a user-supplied value with no error. Java cannot reach this
because session.timeout.ms and friends are int32-millisecond typed at the
config source (ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG is Type.INT); kgo
accepts a Duration, so the bound must be enforced client-side.

Add an upper-bound validation row for each of the four, capping at
math.MaxInt32 milliseconds - the wire field's capacity, the only principled
cap. EXTENDS b3cef0a (R16's heartbeat-units fix) in the same validation
table: same policy (the table is the only place these can be guarded before
the wire cast), broader coverage (overflow rather than unit mismatch).

Repro TestConfigRejectsInt32MillisOverflow: 30-day (negative-wrapping) and
50-day (positive-wrapping) values for each of the four are rejected post-fix
and accepted pre-fix; large-but-fitting (20-day) values still validate.
…races

The test asserted the per-batch mid-drain broadcast in finishPromises by
racing a flush-observer goroutine against a 20000-link self-feeding promise
chain: success required the woken Flush goroutine to be scheduled before the
chain drained the cap. Under CPU saturation in the full -race suite the woken
goroutine lost that race and the test false-failed (cap hit, not the 30s
timeout), even though the code is correct.

Assert the real property at its source instead. Add a test-only
producer.onBatchPromiseBroadcast hook, invoked at the per-batch
p.c.Broadcast() with moreQueued reporting whether further ring elements are
still queued (l > 1, the current element not yet dropPeek'd). The test
observes a moreQueued==true broadcast directly on the producer's broadcast
path - no goroutine has to win a scheduler race - which is exactly the
mid-drain broadcast the fix guarantees. A real Flush goroutine remains as a
corroborating end-to-end wake check. chainLinks/capHit are now read only after
the chain has fully stopped (chainStopped), so there is no data race.

Verified: target test green at -count=50 under -race (and at -count=30 under
3 background CPU spinners); Producer|Flush|Promise green at -count=5 under
-race. Confirmed the test still fails deterministically (fast, no timeout)
when the broadcast is moved back to ring-exit-only.
appendTo's appendSum closure builds an otelSum for every ".total" metric
(producer/consumer connection.creation.total and user MetricTypeSum metrics)
but never set isMonotonic on the constructed struct. otelSum.appendTo was
already wired to serialize the field (Field 3, bool) and the otelSum struct's
own comment promises "We always set isMonotonic to true", but with the field
left false the proto3 default elided it entirely - every counter went out as a
non-monotonic Sum.

OTLP Sum.is_monotonic declares the counter semantic: a monotonic sum is a true
cumulative counter, which downstream OTLP collectors / backends use to
distinguish counters from up-down gauges and to compute rates. franz-go's
.total sums ARE monotonic cumulative counters, and user MetricTypeSum metrics
are enforced non-decreasing at append time (lastTot > um.ValueInt skips), so
every sum emitted via appendSum is monotonic. The Java client sets
setIsMonotonic(monotonic) for the same counters
(SinglePointMetric.sum/deltaSum -> setIsMonotonic; KafkaMetricsCollector treats
Total/Sum/CumulativeCount as monotonically increasing). Monotonicity is
independent of temporality, so both the delta and cumulative arms set it.

Not data loss/dup/stall - a serialization-fidelity gap that mis-declares the
metric's semantic type to the broker's telemetry pipeline. Pattern 51 (a
build/rebuild omitting a semantically-significant field) on the serialization
axis; pattern 6 (a comment naming behavior the code does not perform).

Repro TestAppendSumIsMonotonic (pkg/kgo/metrics_714_test.go) drives the real
appendSum via m.appendTo for both delta and cumulative temporality, walks the
serialized OTLP protobuf down to the .total Sum, and asserts is_monotonic==true
(absent pre-fix, present post-fix; -race).
pushMetrics' no-requested-metrics arm computed its re-get wait as
time.Duration(gresp.PushIntervalMillis) * time.Millisecond with no floor,
unlike the push loop which already does max(..., time.Second). A broker that
returns an empty RequestedMetrics list (a valid, common "no metrics subscribed
right now" state) together with a non-positive PushIntervalMillis - a hostile or
buggy broker, or an alt-broker divergence; the field is an int32 the broker
fully controls - makes that wait <= 0, so time.NewTimer fires immediately and
the loop re-issues GetTelemetrySubscriptions at round-trip pace forever, with a
debug log per iteration. The push loop's max(..., time.Second) floor masked
that the GET-path sibling had none (pattern 3).

The Java client guards this at the source: ClientTelemetryUtils.validateIntervalMs
substitutes DEFAULT_PUSH_INTERVAL_MS (5m) for any interval <= 0. We do the same
once, right after a successful GetTelemetrySubscriptions, so BOTH the
re-get arm and the push loop pace on a sane value (the push loop's existing
floor becomes belt-and-suspenders). A non-positive interval is invalid per the
protocol intent, so substituting the documented default rather than honoring it
is the correct repair, not merely a lower bound.

Pattern 31 (a server-advised retry/interval parameter adopted without a
progress/floor bound) - the share-churn 3.3 / source-resweep B2 hot-loop
sibling on the telemetry-interval axis.

Repro TestValidatePushIntervalMillis (pkg/kgo/metrics_714_test.go) asserts the
extracted validatePushIntervalMillis substitutes the default for non-positive
advertised intervals (incl. MinInt32) and leaves positive ones unchanged, and
that the substituted interval yields a positive (non-immediate) re-get timer;
the end-to-end hot-loop needs a broker injecting an empty-metrics / non-positive
response, which kfake's telemetry handler does not expose - the mechanism test
is the deterministic guard (R23 non-deterministic-repro precedent).
Several deliberate behaviors that the franz-go audit catalogued as "do not
re-file as a bug" lacked an in-code marker stating WHY they are intentional,
so a future audit (or a well-meaning patch) could re-flag them. Migrate the
load-bearing rationales to concise, sited comments:

  - cursor.topicID (source.go): a recreated topic's new ID is deliberately
    never adopted; the consumer stalls loudly and the user purges+re-adds.
    Cites issue #908 / PR #391/#377 (OffsetForLeaderEpoch has no TopicID, so
    an adopted ID cannot be validated against truncation).

  - fetchOffsets UNSTABLE_OFFSET_COMMIT (consumer_group.go): the unbounded 1s
    retry is protocol-mandated (require_stable hides pending txnal offsets);
    a retry cap would convert a mandated wait into a spurious error.

  - groupExternal.updateLatest (consumer_group.go): rejoining on a one-response
    stale partition-count shrink is intentional self-healing churn, matching
    Java's leader exposure — not a bug to silence with a shrink filter.

  - updateBrokers empty-list wipe (client.go): an empty Brokers list falling
    back to seeds is the KIP-1102 REBOOTSTRAP_REQUIRED semantic, called
    explicitly by the rebootstrap path; not a hostile-input gap.

  - default autocommit head lag (consumer_group.go): the one-poll dirty->head
    lag is what makes default autocommit at-least-once; committing dirty at
    revoke would open a loss window (user decision 2026-04-24).

  - broker throttle (broker.go): a throttle is honored in full with no cap
    (KIP-219), matching Java; the wait is Close-interruptible and holds no
    lock. Capping it would break the quota mechanism.

No behavior change; comments only.
The txn-churn and rebalance-churn audits each established a set of invariants
that any future change to coordinator/leader-churn recovery must preserve.
There is no open issue that fits either, so anchor them as doc comments at the
function that owns each recovery loop, so the constraints outlive the audit
notes:

  - manage848 (consumer_group_848.go): the rebalance-churn invariants —
    heartbeat errors retry in place while fetch errors restart the session
    via g.fetching; member-identity resets are the minimum the error implies
    (fresh UUID only for UnknownMemberID); leaves are idempotent and exempt
    from the CGHB no-retry rule.

  - doWithConcurrentTransactions (txn.go): the txn-churn invariants — the
    wrapper/CT-loop division, anyAdded TV1 gating + TV2-only forced abort,
    producer-fenced-means-dead — plus the two design-sized items left not
    taken (commit-after-failed-produce; TV2 mid-session downgrade).

The silent.md zero-loss topic-recreation design constraints were posted to
issue #908 (the canonical recreation issue) rather than duplicated in code.

No behavior change; comments only.
Comment thread pkg/kgo/config.go Outdated
Comment thread pkg/kgo/sink.go Outdated
Comment thread pkg/kfake/issues_test.go
Comment thread pkg/kfake/txn_churn_test.go
Comment thread pkg/kgo/producer.go
Comment thread pkg/kgo/source.go
Comment thread pkg/kgo/sink.go
Comment thread pkg/kgo/consumer_group.go
Comment thread pkg/kgo/consumer.go
Comment thread pkg/kgo/client.go Outdated
Comment-only trims/reframings:
  - drop the over-explanatory tails on UnknownTopicRetries (config.go) and
    checkUnknownFailLimit (sink.go), keeping the reset/bump rule and the three
    errors that count
  - delete the redundant RequireStableFetchOffsets no-op paragraph (txn.go)
  - reframe updateBrokers' empty-list rationale as the long-standing seed
    fallback it is, not a KIP-1102 artifact (client.go)
  - note producedInTxn is set at buffer time and the worst case is an
    always-legal empty EndTxn abort (producer.go)
  - drop "activating" from the share pendingAssigns comment (consumer_share.go)

Behavior/refactor:
  - NewConsumerBalancer dedups members in one map pass instead of an O(n^2)
    scan-then-rebuild (group_balancer.go)
  - fetchOffsets no longer re-fetches partitions it already surfaced a
    non-retryable error for and dropped; injected partitions are filtered out
    of the request on goto-start retries (consumer_group.go)
twmb added 2 commits June 23, 2026 13:22
TestAuditPendingReloadSurvivesPartitionRevoke pins the preservation half
of the dying-session reload fix: a partition caught mid-reload on a
retriable error must be carried into the next session when the session
is stopped by a revoke that keeps that partition. It drives the
cooperative-rebalance shape (assignInvalidateMatching) deterministically
via RemoveConsumePartitions on a direct consumer with explicit
partitions, so a dropped load cannot self-heal -- nothing re-lists a
pinned partition that has no cursor.
@twmb twmb merged commit 5b5fa28 into master Jun 23, 2026
12 checks passed
@twmb twmb deleted the audit-fixes branch June 23, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant