Audit fixes#1348
Merged
Merged
Conversation
checkUnknownFailLimit counted only UNKNOWN_TOPIC_OR_PARTITION and reset the count on any other error. Two problems: * UNKNOWN_TOPIC_ID reset the count. Produce v13+ addresses topics by ID (KIP-516); when a topic is deleted and recreated, the client deliberately keeps the old ID, so every produce after recreation returns UNKNOWN_TOPIC_ID. The error is retriable, and RecordRetries and RecordDeliveryTimeout are unbounded by default, so the unknown topic limit is the only bound -- and code 100 actively defeated it. Records buffered and retried silently forever with no signal to the user, while the consumer side of the same scenario surfaces an error to every poll after five fails. * Any error alternation defeated the limit: a broker alternating UNKNOWN_TOPIC_OR_PARTITION with any other retriable error kept the count below the limit forever. Now both unknown-topic errors bump the count, only a successful produce resets it, and other errors leave it unchanged. This matches waitUnknownTopic's existing semantics for unloaded topics: count only unknown errors, never reset on other errors.
The missing-partition comment in mergeTopicPartitions said purging "can happen automatically for consumers if the user opted into ConsumeRecreatedTopics". No such option exists; it was planned in the issue #908 era but never implemented. Describe what actually clears partitions: a manual PurgeTopicsFromClient, or the automatic regex consumer purge after a topic has been missing from metadata for longer than ConsiderMissingTopicDeletedAfter.
The heartbeat loop's err variable is fed both by the heartbeat itself and by fetchOffsets via fetchErrCh. The 848 in-place retry arm classifies by error value alone, so a retryable error returned by fetchOffsets - e.g. the coordinator moved and OffsetFetch exhausted its internal retries - was retried as if a heartbeat had blipped: the next heartbeat succeeded, the retry counter reset, and the session lived on. But the fetch goroutine was already gone, the fetched-for partitions were never handed to assignPartitions (that only happens after a successful offset fetch), and nothing inside a live session re-runs the fetch. The member then acked partitions it never started consuming on every heartbeat, silently, until an external rebalance bounced the session. Track whether the error came from fetchErrCh and exclude fetch errors from the in-place retry arm. They now propagate to manage848, whose transient arm restarts the session; the restart re-fetches outstanding partitions via g.fetching. Heartbeat errors keep the in-place retry that 90bcc2b introduced. Co-Authored-By: Claude Fable 5 <[email protected]>
STALE_MEMBER_EPOCH reaches the manage loop via OffsetFetch (the heartbeat itself fences with FENCED_MEMBER_EPOCH), meaning the broker still has this member. Rejoining with a fresh member id - previously shared with the UnknownMemberID arm - stranded the old member server-side until the session timeout, and because a live member's partitions stay in the target assignment, the fresh incarnation received an empty assignment and consumed nothing at all until the eviction. Rejoining at epoch 0 with the same member id is KIP-848's lost-response recovery: the broker re-admits the member in place and re-delivers its assignment. Move StaleMemberEpoch into the FencedMemberEpoch arm, which already keeps the member id; this also matches the share group consumer, which keeps its UUID on fence/unknown. Co-Authored-By: Claude Fable 5 <[email protected]>
ConsumerGroupHeartbeat is deliberately never retried by the client's coordinator request wrapper: heartbeats carry reconciliation state and the heartbeat loop owns retries with full state knowledge. The leave (MemberEpoch -1, or -2 for static members) inherited that path incidentally, but a leave carries no reconcilable state and is idempotent - and the coordinator moving is precisely when a leave gets issued against a stale cached coordinator. Firing it exactly once meant a single NOT_COORDINATOR lost the leave and the member ghosted until the session timeout, where classic LeaveGroup retries through the wrapper. The same applied to the share group leave. Route MemberEpoch<0 heartbeats (consumer and share) through handleCoordinatorReqSimple and teach parseRetryErr about ShareGroupHeartbeatResponse so coordinator errors evict the cached coordinator and retry. The wrapper bounds retries by the session timeout, which is the natural cap for a leave: past it, the broker has expired the member's session anyway. Because the leave is now retried, a retry can find the member already gone (a prior attempt succeeded but its response was lost, or the session expired first). Map UNKNOWN_MEMBER_ID on a leave to success in both leave paths: the member being out of the group is the goal state of leaving. Co-Authored-By: Claude Fable 5 <[email protected]>
416e826 stopped canceling prior in-flight commits in favor of waiting for them (canceling kills the connection, and the broker could then process a replacement commit issued on a new connection before the original), but the CommitOffsets and CommitOffsetsSync doc comments still described the old canceling behavior. Co-Authored-By: Claude Fable 5 <[email protected]>
Repros from the broker-dies/leader-moves-mid-rebalance audit (rebalance-churn.md), written to fail before the prior commits: 848 offset-fetch failures must not silently stall the assignment, STALE_MEMBER_EPOCH resets must keep the member id, and 848/share leaves must survive a transient NOT_COORDINATOR (with UNKNOWN_MEMBER_ID on a leave reporting success). The classic-protocol siblings pin the behavior the fixes restore parity with. Co-Authored-By: Claude Fable 5 <[email protected]>
Audit round 4 (txn-churn.md). Six fixes, each independently traced: InitProducerID now retries CONCURRENT_TRANSACTIONS in place. The doWithConcurrentTransactions wrapper added in 2451c59 never functioned: the inner fn returned nil before checking the response ErrorCode, and the coordinator wrapper does not convert CONCURRENT_TRANSACTIONS into a request error. Taking over a crashed incarnation's ongoing transaction ALWAYS receives CONCURRENT_TRANSACTIONS at least once (the broker fences and aborts the old transaction and tells us to retry), so routine recovery surfaced as a 'producer ID has a fatal, unrecoverable error' BeginTransaction failure. maybeRecoverProducerID treats retriable broker codes stored as the producer-id load failure (COORDINATOR_LOAD_IN_PROGRESS and friends that outlived their internal retries) as reload-recoverable, matching the existing transport-error arm instead of reporting them fatal. maybeRecoverProducerID gates the narrowed KIP-890p2 recoverable set on tx890p2 (the mode our transactions actually ran under) instead of supportsKeyVersion(EndTxn, 5): brokers advertise EndTxn v5 regardless of the finalized transaction.version, so the old gate disabled the KIP-360/KIP-588 recovery on 4.0+ clusters still operating TV1 semantics, killing the producer on routine transaction-timeout aborts. EndTransaction under KIP-890p2 now issues an EndTxn abort when produces were attempted but none succeeded. Produce requests implicitly register their partition with the transaction coordinator durably BEFORE the data append, and the client marks addedToTxn only on produce success, so a transaction whose every produce failed skipped EndTxn entirely: the broker-side transaction stayed ongoing until the transaction timeout and the next transaction's produces (same epoch, no bump) silently joined it. Aborting an empty TV2 transaction is legal and bumps the epoch. supportsKIP890p2 no longer opts in when the user's MaxVersions cap is below the 890p2 wire versions (produce v12, EndTxn v5, TxnOffsetCommit v5). Opting in skipped AddPartitionsToTxn while version negotiation kept produce at v11 or lower, so brokers rejected every transactional batch with INVALID_TXN_STATE. doWithConcurrentTransactions orders its backoff escalation longest-first; the switch cases were ordered so the 500ms and 1s arms were unreachable. The produce-path TransactionAbortable arm no longer continues into producing: doTxnReq's error path already removed every batch from the transaction and decremented its inflight count, so producing them anyway double-decremented the uint8 inflight counter (wrapping to 255 and permanently wedging the recBuf drain gate) and could re-drain in-flight batches. The error now takes the fail-producer-id arm, which delivers the same error to the records and remains recoverable via EndTransaction. (Unreachable from spec brokers today: the client pins AddPartitionsToTxn to v3, which cannot carry TRANSACTION_ABORTABLE.) Also refreshes stale docs (RequireStableFetchOffsets is a permanent no-op; the pre-KIP-447 sleep is 500ms not 200ms; KIP-890 recoverability phrasing) and drops commitTxn's unused cancel plumbing, whose comment described group-context cancellation that never existed: the request deliberately rides the caller's context.
Six tests from audit round 4 (txn-churn.md). Four reproduce the bugs fixed in the previous commit and assert the correct behavior: InitProducerID survives one injected CONCURRENT_TRANSACTIONS, BeginTransaction survives a retriable InitProducerID load failure (NOT_ENOUGH_REPLICAS surfaces via the reload path rather than as a fatal producer state), EndTransaction aborts a KIP-890p2 transaction whose every produce failed, and a MaxVersions-pinned client produces transactionally against a transaction.version=2 cluster via the explicit AddPartitionsToTxn path. Two controls pin behavior that already held: the TV1 failed-produce abort, and the coordinator wrapper retrying EndTxn through one NOT_COORDINATOR.
A metadata response can omit a partition we already know about: brokers serve metadata from their own caches, which lag the controller, so after a CreatePartitions a refresh that lands on a not-yet-caught-up broker reports the old partition count. The merge keeps the partition around for exactly this reason (metadata.go, "we are keeping the partition around for safety"), but bumpRepeatedLoadErr treated the errMissingMetadataPartition it bumps with as terminally non-retryable: willFail reduced to canFail, so for the default idempotent producer every buffered-but-never-sent record was failed on the FIRST stale refresh, and a transactional producer was forced into a spurious abort. The failure is purely client-manufactured: the broker still has the partition and a drain moments later would have succeeded, yet one lagging broker response failed records that the infinite-retry defaults (recordRetries=MaxInt64, recordTimeout=0) promise to keep retrying. The Java client parks such batches and only delivery.timeout.ms expires them. Fix: treat errMissingMetadataPartition as retryable, bounded by the same unknown-topic fail limit as UNKNOWN_TOPIC_OR_PARTITION and UNKNOWN_TOPIC_ID - it is the metadata-side twin of those broker errors. Transient omissions (the CreatePartitions propagation window) now heal: the partition is restored on a later refresh, the same recBuf resumes with sequences intact, and the records deliver. The permanent case (a topic recreated with fewer partitions) still fails records once the limit (UnknownTopicRetries, default 4) trips via the metadata retry loop, so nothing hangs forever. Repros in pkg/kfake/partition_count_test.go: TestProduceTransientMissingPartitionKeepsRecords fails before this commit (record failed during the stale window) and passes after; TestProducePersistentMissingPartitionStillFails guards the bound.
A share coordinator assigns by topic ID + partition index and can hand out newly added partitions before the client's metadata has seen the CreatePartitions (metadata is served from per-broker caches that lag the controller). assignPartitions skipped such partitions - correct in the moment - but nothing ever retried them, in two compounding ways: 1. The skip was recorded as success: nowAssigned stored the full broker assignment including the skipped partition, and the add loop skipped anything already in nowAssigned, so even a broker re-send of the identical assignment would not activate it. 2. The member epoch is acked on every heartbeat response regardless, so the broker considers the assignment delivered and never re-sends it; on nil-assignment heartbeats, handleHeartbeatResp only re-resolved unresolved topic IDs - an unactivatable partition index of a resolved topic was retried by nothing. Net effect: the new partition was silently never consumed (no error, nothing above debug logs) until some unrelated assignment change. The regular 848 consumer does not have this hole: its assigned-but-unknown partitions funnel into the offset-load machinery, whose loads are retried on metadata updates until they apply. Fix: the add loop now skips based on the cursor's own activation state (assigned.Swap) rather than on nowAssigned, and sets pendingAssigns when it cannot activate an assigned partition; while pendingAssigns is set, handleHeartbeatResp re-returns the current assignment on nil-assignment heartbeats so activation is retried (each pass re-triggers a metadata refresh, mirroring the existing unresolved-topic-ID retry). Topics absent from tps entirely (purged/unsubscribed) deliberately do not set the flag: they will never appear in tps, and the broker is guaranteed to send a new assignment in response to the subscription change. Repro in pkg/kfake/partition_count_test.go: TestShareAssignedNewPartitionStaleMetadata fails before this commit (the new partition is never consumed) and passes after; Test848AssignedNewPartitionStaleMetadata guards the 848 sibling chain.
Real brokers reject offset commits for partitions their metadata does not know with UNKNOWN_TOPIC_OR_PARTITION at the API layer, before the group coordinator sees the request (KafkaApis.handleOffsetCommitRequest checks metadataCache.getLeaderAndIsr per partition); the coordinator itself stores offsets blindly. kfake accepted and persisted commits for any topic/partition, so client paths that handle per-partition commit rejection were untestable against it. Validate against a topic metadata snapshot taken when the request is dispatched to the group goroutine (in the cluster goroutine, where c.data is safe to read - the same pattern ConsumerGroupHeartbeat uses for creq.topicMeta), and commit only the partitions that pass. Auth is still checked first, matching the real broker's ordering. Both the classic and 848 commit handlers funnel through fillOffsetCommitWithACL, so one check covers both. TxnOffsetCommit is deliberately untouched: its broker-side validation has not been verified. Also adds the partition-count audit regression tests: producer transient/persistent missing-partition behavior (stale metadata views replayed via a Metadata ControlKey), share and 848 new-partition assignment under stale client metadata, and the commit validation above.
…; back off top-level errors
The share consumer's only leader-move heal was the CurrentLeader hint in
ShareFetch/ShareAcknowledge error responses, and the broker populates
that hint only for NOT_LEADER_OR_FOLLOWER / FENCED_LEADER_EPOCH and only
when it already knows the new leader (KafkaApis processShareFetchResponse).
Every other error shape - a leaderless failover window, a dead broker
(no response at all), UNKNOWN_TOPIC_ID propagation, storage errors - had
no heal: nothing in the share fetch path ever triggered a metadata
update, so an affected partition sat erroring until the periodic refresh
(MetadataMaxAge, default 5 minutes). The classic fetch path heals both
shapes: per-partition response errors collect into updateWhy and trigger
an immediate update, and the transport-failure backoff opportunistically
triggers one (this is how classic cursors escape a dead broker).
On top of that, the share path surfaced every hint-less per-partition
error straight into the polled fetch. Classic strips retriable errors
unless KeepRetryableFetchErrors is set, and gives UNKNOWN_TOPIC_ID a
5-strike grace via cursor.unknownIDFails before surfacing (a just
created topic transiently returns it while brokers sync; persistent
means recreation, which deliberately stalls loudly). The Java share
consumer likewise swallows all retriable share fetch errors and requests
a metadata update (ShareFetchCollector.handleInitializeErrors), throwing
only auth/corruption. So under a routine leader move, kgo share users
polled kerr.NotLeaderForPartition errors that the Fetches.Errors docs
describe as restart-worthy, while consumption silently stalled.
Finally, a top-level ShareFetch error reset the session and returned
straight into the next fetch: no buffered fetch to pace on, no backoff.
A persistent top-level error - e.g. GROUP_AUTHORIZATION_FAILED after an
ACL revocation mid-run, which the broker answers top-level - hot-looped
at round-trip pace (5301 requests in 2s in the in-process repro).
Transport errors and all-errors-stripped responses already backed off;
top-level errors now take the same backoff.
This commit makes the share fetch path mirror classic:
- per-partition errors without a leader hint are classified: retriable
errors are stripped (KeepRetryableFetchErrors restores surfacing),
UNKNOWN_TOPIC_ID gets the same 5-strike grace counter on the share
cursor, non-retriable errors still surface
- collected errors trigger a metadata update with classic's exact
split: pure unknown-topic reasons ride the debounced trigger,
anything else triggers immediately
- the share backoff opportunistically triggers a (debounced) metadata
update, healing dead-broker cursors
- top-level response errors back off before refetching
Stripped-empty responses flow into the existing allErrsStripped backoff,
which paces hint-less error retries. The hinted-move arm stays first and
is unchanged. Share-only blast radius; classic/848/direct consumers and
producers are untouched.
Repros (fail pre-fix, pkg/kfake/share_churn_test.go):
TestShareFetchLeaderMoveNoHintHeals (error surfaced + 15s stall, now
heals in ~0.3s), TestShareFetchTransportErrorTriggersMetadata (18s
stall, now ~0.9s), TestShareFetchTopLevelErrorBackoff (5301 requests/2s,
now backoff-paced); TestShareFetchLeaderMoveHintHeals guards the hint
arm on both sides.
…ints
The share assignment and cursor-move paths bounds-checked partition
numbers from broker responses only on the upper side:
assignPartitions's revoke and add loops and applyMoves all did
"if int(p) >= len(td.partitions) { continue }" before indexing
td.partitions[p]. A negative partition number passes that check and
panics with index out of range, crashing the process - the revoke/add
loops run on the share manage goroutine, applyMoves on the metadata
loop.
The values come straight from the wire: assignment partitions from
ShareGroupHeartbeat responses (stored wholesale into nowAssigned, so
the revoke loop replays them on the next assignment change too), and
move targets from ShareFetch/ShareAcknowledge response partitions that
carry errors with CurrentLeader hints. A sane broker never sends a
negative partition, but a buggy or hostile one can, and the classic/848
assignment funnel already routes exactly this to safety
(consumer.go assignPartitions: offset.at >= 0 && partition >= 0 &&
partition < len bounds check). The share path now skips negative
indexes too: silently in the revoke loop and applyMoves (matching the
too-large skip), with a warn in the add loop. The add-loop skip
deliberately does not set pendingAssigns: unlike a too-large index,
which heals once metadata catches up to grown partitions, a negative
index can never become valid, so retrying activation each heartbeat
would spin forever.
Repro (panics pre-fix with "index out of range [-1]" in
assignPartitions): TestShareAssignmentNegativePartitionNoPanic in
pkg/kfake/share_churn_test.go, which hijacks one post-join heartbeat to
deliver an assignment of partitions [0, -1] and asserts partition 0
keeps consuming.
…roker ShareFetch and ShareAcknowledge response partitions carry an inline (untagged) CurrentLeader struct that is always serialized. The Java schema default is leaderId=-1/leaderEpoch=-1, and a real broker serializes that whenever it does not populate a hint - KafkaApis fills CurrentLeader only for NOT_LEADER_OR_FOLLOWER and FENCED_LEADER_EPOCH, and only when it knows the new leader. kfake's donep() response builders left the Go zero value 0/0 on every other error partition (UnknownTopicID, UnknownTopicOrPartition, TopicAuthorizationFailed, ack errors), and kmsg's generated Default() does not default these fields. Clients treat LeaderID >= 0 && LeaderEpoch >= 0 as a valid KIP-951-style move hint, and kfake node IDs start at 0 - so every kfake hint-less error partition silently told clients "the leader is node 0, epoch 0", making them migrate share cursors to broker 0 instead of exercising the no-hint error path that real brokers produce. Set -1/-1 in both donep() builders; the NotLeaderForPartition arms still overwrite with the real leader. Found during the round-9 share-churn audit: the no-hint repros (share_churn_test.go) model exactly the responses real brokers send for leaderless windows and propagation lag, which kfake could not produce.
Round-9 audit repros for the share consumer under broker churn
(share-churn.md). Five tests, all -race:
- TestShareFetchLeaderMoveNoHintHeals: a leader move whose NOT_LEADER
responses carry no CurrentLeader hint (the leaderless-window shape,
and the only shape for error codes the broker never hints for) must
not surface retriable errors to poll and must heal via a triggered
metadata refresh. Pre-fix: NotLeaderForPartition surfaced on every
poll and the partition stalled for the full window.
- TestShareFetchLeaderMoveHintHeals: control; the CurrentLeader hint
path migrates without metadata, passing pre- and post-fix.
- TestShareFetchTransportErrorTriggersMetadata: every ShareFetch to
the old leader has its connection killed (a dead broker, no
response, no hint possible); the fetch backoff must trigger a
metadata refresh like the classic source backoff. Pre-fix: stalled.
- TestShareFetchTopLevelErrorBackoff: persistent top-level errors
(group auth revoked mid-run) must back off; pre-fix the fetch loop
was round-trip paced (5301 requests in the 2s window).
- TestShareAssignmentNegativePartitionNoPanic: an assignment of
partitions [0, -1] must be skipped, not crash; pre-fix it panicked
the manage goroutine with index out of range [-1].
…tionsToTxn On a pre-KIP-890p2 cluster, txnReqBuilder.add puts a partition in the AddPartitionsToTxn request only the first time it joins the transaction; later batches for that partition ride produce requests with no add. When an AddPartitionsToTxn failed, doTxnReq's deferred cleanup ran removeFromTxn over EVERY batch in the produce request, clearing addedToTxn for partitions whose membership was a broker-acked fact from an earlier add of the same transaction. EndTransaction's anyAdded walk then saw no added partitions and returned nil without issuing EndTxn -- before ever consulting the failed producer ID (its own comment, 'anyAdded is true if the producer ID was failed', documents the invariant this broke). The broker still had an ongoing transaction holding previously appended batches, which the transaction timeout eventually aborts: records whose produce promises succeeded, in a commit that returned nil, are silently discarded. The same shape reaches commit-nil via transport-failed adds when user-configured record retries/timeouts fail the requeued batches before the coordinator heals. The Java client only reverts pending adds on AddPartitionsToTxn failure (pendingPartitionsInTransaction); acked adds and the sticky transactionStarted flag are untouched, so its EndTxn is never skipped after a partial add failure. Scope the un-marking to partitions actually in the failed txnReq; every batch in the produce request is still requeued (drain index reset + inflight decrement). Also refuse to strip or fatal on response partitions that were not in the request, so a buggy broker reply cannot clobber acked membership either, and document why per-partition TransactionAbortable is deliberately left in the request. Repro: TestAuditTxnV1AddPartitionsFatalKeepsEarlierAdds (fails pre-fix) and TestAuditTxnV1AddPartitionsRetriableStripControl in pkg/kfake.
EndTransaction documents that a commit attempted while the producer ID
has an error returns kerr.OperationNotAttempted and that the caller
should then retry with TryAbort. The commit attempt, however, had
already consumed the transaction state before reaching the producer-ID
check: inTxn was cleared on entry and the anyAdded walk swapped every
recBuf's addedToTxn (and the group's offsetsAddedToTxn) to false. The
documented TryAbort retry then hit the !inTxn early return and reported
success without sending anything.
GroupTransactSession.End does this retry internally ('end transaction
with commit not attempted; retrying as abort'), so the flagship session
API logged an abort that never reached the broker. The broker-side
transaction stayed ongoing -- stalling read_committed consumers on the
LSO -- until the transaction timeout aborted it, or until a later
InitProducerID re-init happened to clear it for recoverable producer-ID
errors. The !inTxn gate predates the OperationNotAttempted contract
(dbd8d35, 2020); nothing documents the no-op as intended, and the Java
client never skips the EndTxn abort once a partition was added
(TransactionManager gates EndTxn only on its sticky transactionStarted).
When the commit is not attempted, restore what the call consumed (inTxn,
each swapped addedToTxn, offsetsAddedToTxn) before returning
OperationNotAttempted so the abort retry still sees the transaction and
issues EndTxn. producingTxn deliberately stays false: produces between
the failed commit and the abort retry fail fast with
errNotInTransaction rather than buffering against a failed producer ID.
Repro: TestAuditTxnAbortRetryAfterOperationNotAttempted in pkg/kfake
(fails pre-fix).
recBuf.inflight was a uint8, but the number of concurrent requests holding a batch of one recBuf is bounded by the sink's inflight semaphore, and with idempotency disabled that is the user's MaxProduceRequestsInflightPerBroker -- which config validation does not bound above. A value over 255 with 256+ buffered batches could wrap the counter: createReq's 'inflight != 0 && !okOnSink' gate and decInflight's zero check then fire at the wrong times, in the worst case clearing inflightOnSink while requests are still in flight and letting a migrated recBuf drain on a new sink concurrently with old-sink requests (the cross-sink reordering inflightOnSink exists to prevent). No runnable repro: the wrap needs more than 255 physically concurrent in-flight produce requests, which is not feasible as a fast kfake unit test. int32 makes the wrap unreachable (2^31 concurrent requests each holding memory).
…duplicated handleReqResp deleted req.metrics entries when a produce response named a topic (or topic ID, or partition) that was not in the request. For a genuinely invented entry the delete was a no-op: metrics entries only exist for batches AppendTo actually serialized. But a DUPLICATED reply entry takes the same arm -- processing the first occurrence empties that topic/partition out of req.batches -- and there the delete erased the metrics of a batch the first occurrence legitimately finished, so the deferred metrics hook silently skipped OnProduceBatchWritten for a successfully produced batch. Drop the deletes; the one that does real work (clearing metrics for batches whose response carried an error, i.e. !didProduce) stays. Repro: TestAuditProduceDuplicateResponseEntryKeepsHook in pkg/kfake (fails pre-fix).
- sink.seqResps referenced a 'seqRespsMu' that has not existed since the field became a ring with an internal mutex. - doSequenced and produceRequest.firstCancelingCtx claimed request cancellation applies 'if and only if' idempotency is disabled; AllowIdempotentProduceCancellation (0338467) is a second opt-in (mutually exclusive with transactions via config validation). - Produce's promise documentation said promises 'should be relatively fast'; make the real contract explicit: promises are called serially and must not block on the client. finishRecordPromise calls the promise before decrementing buffered counts, and the cond broadcast that wakes blocked producers and Flush is deferred until the promise worker drains its queue, so a promise blocking on Produce-at-limit or Flush waits on a wakeup only its own return can deliver. AbortingFirstErrPromise already dodges this by spawning a goroutine.
Round 3 of the audit program: subsystem sweep of pkg/kgo/sink.go. - TestAuditTxnV1AddPartitionsFatalKeepsEarlierAdds: a fatal AddPartitionsToTxn for a new partition must not un-mark partitions that joined the transaction via an earlier acked add; pre-fix, EndTransaction(TryCommit) returned nil without issuing EndTxn while the broker held an ongoing transaction with an appended batch (timeout-aborted later = silent loss after a nil commit). - TestAuditTxnV1AddPartitionsRetriableStripControl: a retriable per-partition add error strips, requeues, re-adds, and produces within one Flush; held pre-fix and guards the txnReq membership check. - TestAuditTxnAbortRetryAfterOperationNotAttempted: the documented TryAbort retry after a not-attempted commit must issue EndTxn; pre-fix the commit attempt consumed inTxn/addedToTxn and the retry silently no-opped (GroupTransactSession.End's internal retry-as-abort included). - TestAuditProduceDuplicateResponseEntryKeepsHook: a duplicated produce response topic entry must not erase the finished batch's metrics; pre-fix OnProduceBatchWritten never fired for the produced batch. All three BUG REPRODUCED tests verified failing against pre-fix kgo (51f4fcb) and passing with the sink-sweep fixes, -race.
…oker
A fetch response can redirect a partition to a preferred read replica
(KIP-392 follower fetching) whose broker the client has not yet learned
from metadata - e.g. a freshly added replica that the fetched-from broker
already knows about before the client's periodic refresh catches up.
source.fetch deletes such a cursor from the request's used offsets and
calls cursorOffsetPreferred.move() to migrate it. When move() found no
source for the preferred broker it triggered a metadata update and
returned WITHOUT re-enabling the cursor. The cursor was use()'d (made
unusable) when the request was built, and nothing else makes it usable
again:
- the defer in fetch() that would finishUsing()/allowUsable() it skips
it, because move()'s caller already deleted it from the used offsets;
- the leader is unchanged, so the triggered metadata refresh performs no
cursor migration (migrateCursorTo only runs on a leader/epoch change),
and merely learning the new broker never touches cursor usability.
The partition is then silently never consumed again until an unrelated
session restart (rebalance, assign, or a real leader change). No error
surfaces - a silent stall.
Re-enable the cursor on its current (leader) source in the !exists arm via
allowUsable(). We keep consuming from the leader, and the forced metadata
update lets a later fetch's preferred replica be honored once the broker
is known. triggerUpdateMetadataNow coalesces (non-blocking send), so the
re-fetch loop cannot spam metadata.
Regression: TestAuditPreferredReplicaUnknownBrokerNoStrand in
pkg/kfake/source_sweep_test.go consumes 0/5 records pre-fix (stranded) and
5/5 post-fix.
processRecordBatch read batch.NumRecords straight off the wire and passed it to ensureLen, which sizes a slice with s[:n]. A buggy or malicious broker (the CRC only guards accidental in-transit corruption, and the exported ProcessFetchPartition accepts arbitrary caller input) sending a record batch with NumRecords < 0 panicked the fetch goroutine with "slice bounds out of range [:-1]", crashing the client. A hostile huge NumRecords drove an unbounded up-front allocation of NumRecords * sizeof(kmsg.Record). Guard both: reject a negative count as a corrupt batch (matching the Java client's DefaultRecordBatch "Found invalid record count" InvalidRecordException), and clamp the count to the available record bytes - every record needs at least one byte, so a count exceeding the byte count is impossible for a well-formed batch. The true decodable count is still recomputed by readRawRecordsInto, and the KAFKA-5443 truncation defer leaves the offset unadvanced whenever it disagrees with numRecords, so valid batches are unaffected (their count is always below the byte count). Regression: TestAuditFetchNegativeRecordCountNoPanic (panics pre-fix) and TestAuditFetchHugeRecordCountBounded in pkg/kfake/source_sweep_test.go.
Round 5 of the FRANZ_AUDIT.md program (subsystem sweep of pkg/kgo
source.go + record_and_fetch.go: the fetch path). Regression tests for the
two fixes landed this round:
- TestAuditPreferredReplicaUnknownBrokerNoStrand: a fetch hijacked into a
preferred-replica move to an unknown broker must not strand the cursor
(0/5 records pre-fix, 5/5 post-fix).
- TestAuditFetchNegativeRecordCountNoPanic: a record batch with a
negative record count must error, not panic in ensureLen.
- TestAuditFetchHugeRecordCountBounded: an oversized record count must
not drive a giant up-front allocation.
buildRecordBatchBytes hand-builds a v2 record batch with a valid
Castagnoli CRC, so the corrupt-count batches still pass the client's
default CRC check - the realistic malicious/buggy-broker case, since CRC
only defends against accidental in-transit corruption.
handleReqResp looked up each response partition in req.usedOffsets and
processed it with no guard against the same partition (or topic) appearing
more than once in one response. A correct broker never duplicates, but a
buggy or hostile one that does was processed against the same
*cursorOffsetNext twice:
- the partition's error/records surfaced to the user twice (two
FetchPartition entries for one partition);
- a duplicated preferred-replica redirect enqueued two move() calls for
one cursor - the second reads/writes cursor.source after the first
made the cursor eligible on its new source, the exact concurrent-
source hazard that #1167 guards against.
Track the *cursorOffsetNext pointers already handled in a per-response
seen set and skip (with a warning) any repeat. We deliberately do NOT
dedup by deleting from req.usedOffsets: that map is what re-enables each
cursor after the response, so removing an entry would strand the
legitimately-processed cursor. Keying by the stable per-request
*cursorOffsetNext also collapses a duplicated topic (same partition
pointer) while leaving a topic legitimately split across response entries
(distinct partitions) fully processed.
Regression: TestAuditFetchDuplicatePartitionDeduped in
pkg/kfake/source_sweep_test.go (error surfaces twice pre-fix, once post-fix).
TestAuditFetchDuplicatePartitionDeduped injects a fetch response listing one partition twice with a non-retryable error and asserts the client surfaces it exactly once - exercising the handleReqResp dedup that guards against a buggy/hostile broker duplicating a partition (or topic) in a single response.
A heartbeat response can assign topic IDs the client cannot yet map to
names (newly created topic, metadata not refreshed). Those IDs park in
g848.unresolvedAssigned and are folded into every heartbeat's owned
Topics so the server sees them acknowledged.
That state was never cleared on a member reset. After any rejoin-class
event while a topic was still unresolved (UNKNOWN_MEMBER_ID after an
outage, FENCED/STALE_MEMBER_EPOCH, or any manage-loop error), the next
initialJoin's request carried the old member's unresolved topics in
Topics at MemberEpoch 0. The broker rejects every (re)join whose
owned-partitions list is non-empty with INVALID_REQUEST
("TopicPartitions must be empty when (re-)joining"), and the only
other thing that clears unresolvedAssigned is a successful
assignment-carrying response - which a rejected join never produces.
The consumer was permanently wedged: every join retry failed
identically until process restart.
Clear unresolvedAssigned in initialJoin next to the other per-member
resets. The join response always re-delivers the member's full
assignment, so nothing is lost. The Java client clears its equivalent
unresolved-IDs cache the same way on every transition to joining, and
the in-code invariant comment ("our initialJoin should *always* have
an empty Topics in the request") already promised this.
Repro: TestAudit848StaleUnresolvedJoin in pkg/kfake (fails pre-fix
with the join loop erroring INVALID_REQUEST forever; kfake now
mirrors the broker's join validation).
… notification
The 848 manage loop silently restarts the heartbeat session on
retryable broker/dial/coordinator errors and counts the restarts in
consecutiveTransientRestarts; every cfg.retries consecutive restarts
it injects a fake fetch error (ErrGroupSession "heartbeat has been
failing for N consecutive attempts") so the user learns the group is
unreachable - that injection is the path's only user-visible signal.
The transient arm sets err = nil to keep the session loop going, and
then fell through to the shared 'if err == nil reset' tail, zeroing
the counter on the same iteration that incremented it. The counter
could never pass 1, so with the default 20 retries the notification
was unreachable: a permanently unreachable coordinator was a silent
infinite restart loop.
Continue the loop directly from the transient arm instead. The reset
now only runs for exits whose nil err means a processed response
(rebalance / reassignment), which is what the counter's comment
("without any successful heartbeat in between ... Reset on success")
always intended.
Repro: TestAudit848TransientRestartNotification in pkg/kfake (fails
pre-fix: NOT_COORDINATOR on every heartbeat never surfaces any error
to poll).
should848 consulted broker-advertised versions only. A user pinning MaxVersions to a version set whose ConsumerGroupHeartbeat max is 0 (e.g. pinning 3.7/3.8 behavior against a 4.x cluster) got v0 on the wire while the manage loop ran v1 semantics. v0 has no SubscribedTopicRegex field, so kmsg silently dropped the regex from the join: a regex consumer joined with no subscription at all and silently consumed nothing, forever. The classic fallback - the right outcome for a sub-v1 cap - never happened because the broker itself advertises v1. Gate supportsKIP848v1 on the MaxVersions cap as well, mirroring supportsKIP890p2 (which gained the same guard for the KIP-890 wire versions). Repro: TestAudit848MaxVersionsV0FallsBackToClassic in pkg/kfake (fails pre-fix: nothing is ever consumed; post-fix the client falls back to the classic protocol and consumes).
…ERROR
GroupTransactSession.End retries EndTransaction on a handful of error
codes. The OperationNotAttempted and TransactionAbortable arms both set
willTryCommit=false ("retry as abort") before the goto, because
EndTransaction consumes inTxn on its erroring call and a re-call no-ops
at its `if !inTxn { return nil }` guard. The UNKNOWN_SERVER_ERROR arm,
added in 3ecaff2 when the OperationNotAttempted `if` became a `switch`,
omitted that downgrade.
The consequence is a silent EOS data loss. Trace:
1. End(TryCommit): offsets commit OK, heartbeat OK => willTryCommit=true.
2. EndTransaction(commit=true) sets inTxn=false, issues EndTxn, the
broker answers UNKNOWN_SERVER_ERROR. That code is uniquely excluded
from failProducerID, so the producer ID stays good and inTxn stays
false (only the OperationNotAttempted arm restores it).
3. End's USE arm goto-retries WITHOUT willTryCommit=false.
4. The retried EndTransaction(commit=true) returns nil at its !inTxn
guard, sending no EndTxn.
5. endTxnErr==nil and willTryCommit==true, so End's success tail reports
a committed transaction and setOffsets(postcommit) advances the
consumer offsets - even though the broker's UNKNOWN_SERVER_ERROR left
the commit unconfirmed and the transaction may have aborted. The
consumer's input offsets move past records whose output may never
become visible: silent loss.
Some brokers return UNKNOWN_SERVER_ERROR from EndTxn (the surrounding
comments cite Redpanda in certain versions); conformant Kafka does not
emit it there, so happy-path Kafka/kfake testing never hit this.
Fix: set willTryCommit=false in the USE arm, mirroring its two sibling
arms. The no-op retry then reports not-committed and End rewinds to the
last committed offsets, so the caller reprocesses (at-least-once) instead
of advancing past an unconfirmed commit. The dead backoff timer (it only
ever gated the no-op retry, never a real re-issue) is dropped so the arm
matches the others; `time` stays in use elsewhere in the file.
This extends f25fe06's discipline (restore/abort the consumed txn state
so the documented EndTransaction retry behaves) from the
OperationNotAttempted arm to its UNKNOWN_SERVER_ERROR sibling
(FRANZ_AUDIT pattern 17, plus pattern 22: a manufactured nil treated as
success by an err==nil tail).
Repro: TestAuditTxnEndTxnUnknownServerErrorNotFalseCommit in
pkg/kfake/txn_resweep_test.go injects UNKNOWN_SERVER_ERROR on every
EndTxn during a GroupTransactSession commit; End(TryCommit) returns
committed=true pre-fix and committed=false post-fix (-race).
TestAuditTxnEndTxnUnknownServerErrorNotFalseCommit drives a GroupTransactSession through a consume-then-commit and injects UNKNOWN_SERVER_ERROR on every EndTxn via a keep-forever control. Pre-fix End(TryCommit) returns committed=true (the consumer offsets were advanced for a transaction the broker may have aborted); post-fix it returns committed=false and rewinds. Companion to the kgo fix in the prior commit.
PurgeTopicsFromConsuming (consumer.purgeTopics) told the manage loop the subscription shrank by calling g.rejoin unconditionally. For a classic group that is correct: it bounces the heartbeat session so the member re-joins and the group rebalances. For a next-gen (848) group it is wrong, and it violates the invariant that ForceRebalance's 848 redirect documents as its own safety justification: "nothing feeds rejoinCh in 848 mode." In 848 the shared heartbeat loop receives the rejoinCh signal and converts it to RebalanceInProgress - NOT errReassigned848 - so stopHeartbeating stays false and heartbeats keep running while the session-end revoke (revokeThisSession) executes g.nowAssigned.write(), a clone-modify-store read-modify-write. A concurrent heartbeat's handleResp does g.nowAssigned.store() with the server's latest assignment. If the revoke clones before the heartbeat stores but stores after, the server's assignment is lost from nowAssigned: the client briefly under- or over-claims partitions (delayed pickup of newly assigned partitions, or a brief over-claim of reassigned ones). 848 servers re-send the target assignment until the member's owned set matches, so it self-heals within a heartbeat - but the unnecessary session bounce and the lost-update race are exactly what ForceRebalance's 848 redirect was added to prevent. purgeTopics was the lone subscription-change feeder that forgot the 848 guard its two siblings carry (findNewAssignments forces a heartbeat for new topics; ForceRebalance redirects). Fix: centralize the "subscription changed, reconcile per protocol" dispatch in groupConsumer.signalSubscriptionChange (classic => rejoin; 848 => a best-effort forced heartbeat whose next request re-reports the live subscription, the server then driving the revoke through normal reconciliation). Route all three feeders - findNewAssignments, ForceRebalance, and purgeTopics - through it so the 848-vs-classic difference cannot be re-derived and forgotten at a future call site, which is the root cause here. Classic behavior and the two already-correct sites are unchanged. The lost-update race is not deterministically reproducible, so TestSignalSubscriptionChange848 asserts the fix's mechanism: in 848 mode the dispatch forces a heartbeat and never feeds rejoinCh; in classic mode it feeds rejoinCh. It fails pre-fix (the 848 arm fed rejoinCh).
…ies(0) The KIP-848 manage loop silently restarts the heartbeat session on retryable broker/dial/coordinator errors and, every cfg.retries consecutive restarts, injects a fake fetch error (ErrGroupSession "heartbeat has been failing for N consecutive attempts") so the user learns the group is unreachable - that injection is the path's only user-visible signal. The "every cfg.retries-th restart" gate was `restarts >= cfg.retries && restarts % cfg.retries == 0`. cfg.retries has no floor (RequestRetries(n) stores n verbatim; default 20), so a user who sets RequestRetries(0) makes the modulo `restarts % 0` - an integer divide-by-zero panic. The panic fires on the very first transient heartbeat error (a broker restart, EOF, NOT_COORDINATOR, a refused dial - all routine) on the manage848 goroutine, which is unrecovered and so crashes the whole client process. The sibling in-session gate (heartbeat(), `hbBrokerRetries < cfg.retries`) uses `<` and is panic-safe; only this modulo arm divides. retries=0 also disables the in-session heartbeat retry, so heartbeat() propagates the first transient error immediately and every transient error is its own restart. Dropping the notification for retries=0 would re-introduce the silent-infinite-restart-loop that aae08a3 fixed for the general case, so the fix must still fire it - on every restart. shouldNotify848Restart treats retries < 1 as "notify on every restart," preserving the exact every-Nth-restart behavior for retries >= 1. TestShouldNotify848Restart covers retries=0 (recovers the pre-fix panic into a clean failure, since the real panic is on a background goroutine) and the retries=2 cadence.
TestAudit848PurgeReconcilesViaHeartbeat is the end-to-end guard for the kgo fix "do not feed rejoinCh from a next-gen (KIP-848) topic purge": an 848 group consuming two topics must keep consuming the kept topic after PurgeTopicsFromConsuming drops the other, with the dropped topic not reappearing. The purge now reconciles through a forced heartbeat rather than a rejoinCh session bounce. The mechanism repro is TestSignalSubscriptionChange848 in pkg/kgo; this exercises the real consumer/purge path against kfake.
writeTxnMarkersSharder.shard re-buckets each marker by the leader of its partitions, grouping markers under a pidEpochCommit key and rebuilding a WriteTxnMarkersRequestMarker from that key. The key (and the rebuilt marker) carried ProducerID, ProducerEpoch, Committed, and TransactionVersion but never CoordinatorEpoch, so every sharded marker went out with CoordinatorEpoch=0 regardless of the user's value, and two markers that differ only in CoordinatorEpoch collapsed into one bucket. CoordinatorEpoch is a v0+ field the broker uses to detect fenced writers: a real broker passes it to the group coordinator's completeTransaction for __consumer_offsets markers and embeds it verbatim into the EndTransactionMarker control record written to every other partition's log. Dropping it to 0 stamps the wrong coordinator epoch into the on-disk marker and feeds epoch 0 to the group-offset fencing path. The bug is reachable via kadm.WriteTxnMarkers (which sets rm.CoordinatorEpoch and then RequestShards) and any raw RequestSharded of a WriteTxnMarkersRequest. CoordinatorEpoch has been dropped since the sharder was added (83f0dbe); 58e9695 later added txnVersion to the same key but missed it. This is the WriteTxnMarkers sibling of the round-10 AddPartitionsToTxn VerifyOnly fix (1e9b852): a txn sharder silently losing a per-request fencing field. TestAuditWriteTxnMarkersPreservesCoordinatorEpoch drives the real sharder with a marker carrying CoordinatorEpoch=5 and asserts it survives into the issued shards; it reports 0 pre-fix.
…tionIsOpen EnsureProduceConnectionIsOpen issues a forceOpenReq to each target broker. handleReq opens the connection via loadConnection (dial + ApiVersions + SASL), then -- for a forceOpenReq -- rewrites the request to a throwaway ApiVersions and sends it through waitResp to probe the connection end to end. On an acks=0 produce connection that probe is unsafe. loadConnection routes the forceOpenReq (it embeds a ProduceRequest, key 0) to cxnProduce, and init spawns the discard goroutine (hasDiscard) that owns ALL reads on the connection -- no promisedResp is ever pushed for acks=0 produce. The rewritten ApiVersions is response-expecting, so the isNoResp switch does not match it and waitResp starts a handleResps reader on the same socket the discard goroutine is already reading: two concurrent io.ReadFulls split one byte stream. The reply is consumed (in whole or part) by whichever reader the kernel wakes, the other desyncs, the connection dies, and EnsureProduceConnectionIsOpen -- meant to REDUCE produce latency by pre-warming the connection -- instead hangs on the stranded read until its timeout and returns a spurious error, having killed the very connection it tried to warm. This affects every broker (the discard goroutine runs for any acks=0 produce connection, not only EventHubs). This is the same concurrent-reader hazard the in-place SASL reauth path already avoids (the expiry arm gates on !hasDiscard; loadConnection recreates an expired discard connection rather than reauthing it in place). The force-open arm simply predated the hasDiscard mechanism. The round-11 broker sweep asserted "discard connections never push promisedResps" -- true for the produce/reauth paths it traced, but the force-open arm is a third sibling that does. Fix: a force-open request on a hasDiscard connection reports success without writing or reading anything. The connection is already fully open (init proved it works end to end); there is nothing more to probe, and we must not start a second reader. Non-discard force-opens are unchanged (handleResps is their sole reader). Repro: TestAuditEnsureProduceConnectionAcks0NoConcurrentRead (pkg/kfake) wraps each dialed conn in a concurrent-read detector and delays the ApiVersions response so the discard read and the force-open read are reliably both blocked at once; it observes a second concurrent reader and an EnsureProduceConnectionIsOpen failure pre-fix, neither post-fix (-race).
…ssion test TestAuditEnsureProduceConnectionAcks0NoConcurrentRead reproduces the force-open-vs-discard concurrent-reader race fixed in the prior commit. An acks=0 produce connection runs the discard goroutine, which owns all reads. Pre-fix, EnsureProduceConnectionIsOpen's force-open request was rewritten to a response-expecting ApiVersions and sent through waitResp, starting a handleResps reader that raced the discard goroutine on the same socket. The test wraps every dialed connection in a concurrent-read detector (records whether two goroutines were ever inside Read on one connection at once) and installs a SleepControl on ApiVersions so the discard read and the force-open read are deterministically both blocked during the delay; the race is otherwise dependent on which blocked reader the kernel wakes first. Pre-fix the detector fires and EnsureProduceConnectionIsOpen returns an error; post-fix neither happens. Runs under -race.
A Produce that hits the max-buffered limit (auto-flush mode) parks in a goroutine waiting for space; Flush and BufferedProduceRecords account for it via the bufferedRecords + blocked sum. On the SUCCESS path the parked goroutine decrements blocked and the caller increments bufferedRecords under one continuous lock hold, so the sum is unchanged and no waiter needs waking. On the CANCEL path (record / produce / client context done -> drainBuffered) the record is failed, not buffered: blocked is decremented with no compensating bufferedRecords++, so the sum drops by one. The only broadcast on that path is the one drainBuffered issues to wake the parked goroutine, and it fires BEFORE the decrement. A Flush waiting on the sum that re-checks its predicate in that window observes the stale pre-decrement value and goes back to waiting; when the decrement then brings the sum to zero nothing broadcasts, so Flush(context.Background()) hangs forever (a Flush with a cancelable context degrades to a context-timeout). The classic trigger is a graceful shutdown: the last buffered record completes - waking Flush, which re-waits because a concurrently-blocked produce is still counted - just as that blocked produce's context is canceled. Fix: drainBuffered broadcasts after releasing the lock, i.e. after the blocked decrement is visible, mirroring how every other sum-changing site notifies the cond. The success path is unchanged (it conserves the sum and needs no broadcast). The exact lost wakeup is a scheduler race - whether the waiting Flush re-acquires the lock before the parked goroutine decrements - so the repro TestAuditFlushWokenByBlockedProduceCancel drives the scenario in a loop: pre-fix an iteration hangs (the watchdog fires on iteration 0 in practice), post-fix all 200 iterations complete because the cancel path now broadcasts after the decrement (-race).
…rrier migrateShareCursorTo relocates a share cursor between sources when a metadata refresh observes a leader change for a share partition (mergeTopicPartitions's partitionKindShare arm, on the metadata loop). It does the same removeShareCursor/addShareCursor as the CurrentLeader-hint sibling applyMovesBlocking, but it was the one share-cursor relocation that did not register with the share consumer's worker barrier. The share consumer drains acks per source on leave: it waits for every share worker to exit (for sc.workers > 0), then calls closeShareSession on a snapshot of the source list, draining each source's cursors and setting shareCursor.closed so post-drain user acks are callback'd rather than stranded. A cursor that escapes every drain keeps its pending acks with no drainer: sc.pendingAcks never returns to 0 (FlushAcks hangs) and the held records release only via the broker's acquisition-lock timeout. applyMovesBlocking already guards this by registering its migration via sc.incWorker/decWorker, so leave's barrier waits for an in-flight move (or, if already dying, incWorker bails and the cursor stays on its current source for closeShareSession to drain). migrateShareCursorTo ran on the metadata loop with no such registration, so a metadata-driven leader-change migration racing LeaveGroup/Close could remove the cursor from its old source before that source drained and add it to an already-drained source (or one created after leave's snapshot), stranding the cursor's pending acks. The metadata loop stays alive throughout leave -- cl.ctx is canceled only after Close waits on c.s.left -- and c.s.tps still holds the share partitions, so the merge can migrate during the leave window. Fix: wrap migrateShareCursorTo's source swap in sc.incWorker/decWorker, mirroring applyMovesBlocking. new.shareCursor is assigned before the incWorker check so the stored partition data is valid on the dying-skip path. This extends commit 1006894 (which supervised the applyMoves migration) to its metadata-merge sibling. The end-to-end strand is a shutdown race not deterministically reproducible, so TestAuditShareMetadataMigrationWaitsForLeave asserts the mechanism: it parks the migration after its incWorker by holding sinksAndSourcesMu (the first lock the swap takes) and verifies the migration registers in sc.workers, exactly what leave's barrier waits on. Pre-fix the migration takes sinksAndSourcesMu first with no incWorker, so it never registers and the poll times out.
A topic's partitions are led by different brokers, so the same topic is returned in separate Fetch entries (one Fetch per broker response). EachTopic groups those entries by topic name into a single FetchTopic. When TopicID was added to FetchTopic (4bfb0c6), the len(fs)==1 fast path was updated to pass the whole FetchTopic through (preserving the ID), but the multi-fetch grouping path rebuilt FetchTopic from a name=>partitions map with a hard-coded zero TopicID. The result: EachTopic returned a zero TopicID whenever more than one broker replied. A topic's partitions are spread across brokers and arrive in 2+ Fetch entries, so this is the normal case in a real multi-broker cluster; only a single-broker poll hits the len(fs)==1 path that preserves the ID. The bug is therefore invisible in single-broker tests but live in production -- a "works in test, broken in prod" trap. Any caller building a TopicID-keyed structure from EachTopic saw every topic collapse onto the zero ID. No data loss / duplicate / stall (TopicID is informational, Kafka 3.1+), but the field silently contradicts both its own documented contract and the sibling len(fs)==1 path. Carry the TopicID across the grouped Fetch entries: the broker returns the same ID in every fetch response for a topic, so the first non-zero copy is authoritative; topics with no ID (pre-3.1 brokers, share fetches, which do not set it) stay zero. The classic/direct consume path populates the ID (source.go fetchTopic build), so the fix is observable there. TestEachTopicPreservesTopicID covers the multi-fetch case (fails pre-fix: the grouped TopicID is zero), plus the single-fetch and no-ID cases, all under -race.
…r when a stale pin survives the no-selection fallback
The adaptive UniformBytesPartitioner pins a partition and re-picks by
weighted-random selection only when the pin is no longer usable. The
re-pick block is entered whenever the pinned p.onPart is invalid: either
the byte-window reset cleared it to the -1 sentinel, or - the case this
fixes - the pinned index is now >= n because the writable partition count
shrank under us. writablePartitions holds only partitions with no load
error (metadata.go:588-597), so a routine leader election or rolling
restart that briefly leaves some partitions leaderless drops them from the
writable set while the full partition count is preserved; a partitioner
pinned to a now-dropped partition (and not crossing its byte threshold this
call, so onPart is not reset to -1) enters the re-pick with p.onPart >= n.
The weighted-selection loop normally assigns a fresh in-range index, but it
can select nothing: floating-point rounding can leave pick just above 0
after subtracting every weight (the case the original "if p.onPart == -1"
fallback was guarding). That fallback only fired for the -1 sentinel, so
when the loop selected nothing AND the re-pick was entered with a stale
p.onPart >= n, the stale index was returned unchanged. doPartition then
rejects it as an out-of-range partitioning choice and FAILS the record
("invalid record partitioning choice of %d from %d available") rather than
producing it.
The sibling non-adaptive branch re-picks unconditionally via Intn(n) and so
never had this hole. Track whether the loop selected anything and fall back
to the last partition when it did not, regardless of the prior p.onPart
value, matching the non-adaptive branch's guarantee that a re-pick always
yields an in-range index. Behavior on the -1 sentinel path and the normal
pick path is unchanged.
The production trigger needs both a writable-shrink window and a ~1/2^53
float-rounding draw, and the failure is loud (the record's promise gets the
error) and self-healing (the next non-edge record re-picks a valid
partition), so the severity is low - but the fix is a one-line robustness
change aligning the adaptive branch with its non-adaptive sibling. The
regression test forces the "loop selects nothing" path deterministically
with a -1 backup (weight 1/0 = +Inf, so pick - Inf is never <= 0), a
Go-version-independent stand-in for the rounding fallthrough, and asserts
the re-pick stays in range after a writable-partition shrink: it returns
the stale partition 5 (for n=3) pre-fix, an in-range partition post-fix.
…n, not broker
The adaptive arm's doc and internal comment described the weighting as
per-broker ("chooses a broker based on the inverse of the backlog ... for
that broker"), but the partitioner weights strictly per-partition: the
TopicBackupIter yields per-partition buffered record counts and the
selection table is built per partition - there is no broker aggregation at
the partitioner layer. The KIP-794 partitioner this mirrors also weights
per-partition (its load stats are per-partition queue sizes), so the
"broker" wording was inaccurate against both this code and the reference.
Reword the two mechanism descriptions to say partition. The user-facing
intent sentence (favoring less-loaded brokers) is left as-is: favoring
fast-draining, low-backlog partitions does effectively steer more produce
to responsive brokers, which is the documented goal.
…outs Two RecordReader bugs, both reachable from valid input, fixed under one cohesive robustness change. Extends the R16 hardening (ed19927), which guarded readSize's allocation against a hostile size but left these two consume-side gaps. 1. A truncated fixed-size read panics the reader (Medium). next()'s io.EOF handling falls through to fn.parse(r.buf, rec) with an empty r.buf. readSize reports plain io.EOF (not io.ErrUnexpectedEOF) only when it read zero bytes; a partial read is already io.ErrUnexpectedEOF and returns early. When such a zero-byte read is the LAST fn after an earlier real read, it is not the clean record boundary (the boundary check requires i==0 or a preceding noread), so it reaches parse with an empty buffer. The fixed-width number parsers index r.buf at constant offsets (binary.BigEndian.Uint64's b[7], ..., the byte reader's b[0]) and panic on a short slice. Trace: a binary layout ending in a fixed-width field is ordinary, e.g. "%p{big32}%o{big64}" (4-byte partition + 8-byte offset per record). A stream/file holding N whole records plus a partial final record whose leading field(s) are present but whose trailing fixed-size field is cut reaches the trailing field with zero bytes -> io.EOF -> empty-buffer fall-through -> panic, crashing any consumer of the API (e.g. kcl). This also violates ReadRecord's documented contract, which promises io.ErrUnexpectedEOF for a mid-record EOF. Fix: before parsing, treat a fixed-size read (read.size > 0) whose buffer is short as the truncation it is and return io.ErrUnexpectedEOF. Text and value reads (sizefn, delim, regexp, json) are unaffected: their parsers tolerate an empty buffer, so an empty trailing value stays valid. 2. A read-nothing layout loops forever (Low). A layout of only fixed-number verbs (e.g. "%p{3}") builds an all-noread fns list. next() then never performs a read, never hits EOF, and never sets r.done, so ReadRecord returns identical records forever -- an unbounded produce loop in kcl. Reject such a layout at construction (reads == 0), matching the parse-time rejection R16 added for other malformed layouts. Repros in record_formatter_test.go, both fail pre-fix: - TestRecordReaderTruncatedFixedSizeNoPanic: five truncated binary layouts (big/little 64/32/16, byte) panic pre-fix ("index out of range, length 0"), return io.ErrUnexpectedEOF post-fix. - TestNewRecordReaderRejectsBadLayouts: "%p{3}", "%T{3}", "%p{3}%o{4}" returned nil error pre-fix (would loop), error post-fix.
NewRecordFormatter declared and incremented a loop counter `i` that is never read (its reader-side sibling parseReadLayout has no such counter); remove it. Also fix "undersands" -> "understands" in the AppendPartitionRecord doc. No behavior change.
…re field
cfg.validate enforced only a LOWER bound (>= 100ms) on SessionTimeout,
RebalanceTimeout, and ProduceRequestTimeout, and did not validate
TransactionTimeout at all. All four are time.Duration (int64 nanoseconds)
config values that are later cast with int32(d.Milliseconds()) into int32
wire fields:
- JoinGroup SessionTimeoutMs / RebalanceTimeoutMs (consumer_group.go:1415-1416,
consumer_group_848.go:703)
- ProduceRequest TimeoutMs (broker.go:568, sink.go:96)
- InitProducerId TransactionTimeoutMs (producer.go:1098)
A Duration whose millisecond value exceeds math.MaxInt32 (~24.8 days)
silently overflows that cast: a 30 day SessionTimeout becomes
SessionTimeoutMillis = -1702967296 (negative garbage the broker rejects or
mishandles), and a ~50 day one wraps to a small positive value (e.g. ~7h)
that the broker quietly accepts as a completely different timeout - silent
corruption of a user-supplied value with no error. Java cannot reach this
because session.timeout.ms and friends are int32-millisecond typed at the
config source (ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG is Type.INT); kgo
accepts a Duration, so the bound must be enforced client-side.
Add an upper-bound validation row for each of the four, capping at
math.MaxInt32 milliseconds - the wire field's capacity, the only principled
cap. EXTENDS b3cef0a (R16's heartbeat-units fix) in the same validation
table: same policy (the table is the only place these can be guarded before
the wire cast), broader coverage (overflow rather than unit mismatch).
Repro TestConfigRejectsInt32MillisOverflow: 30-day (negative-wrapping) and
50-day (positive-wrapping) values for each of the four are rejected post-fix
and accepted pre-fix; large-but-fitting (20-day) values still validate.
…races The test asserted the per-batch mid-drain broadcast in finishPromises by racing a flush-observer goroutine against a 20000-link self-feeding promise chain: success required the woken Flush goroutine to be scheduled before the chain drained the cap. Under CPU saturation in the full -race suite the woken goroutine lost that race and the test false-failed (cap hit, not the 30s timeout), even though the code is correct. Assert the real property at its source instead. Add a test-only producer.onBatchPromiseBroadcast hook, invoked at the per-batch p.c.Broadcast() with moreQueued reporting whether further ring elements are still queued (l > 1, the current element not yet dropPeek'd). The test observes a moreQueued==true broadcast directly on the producer's broadcast path - no goroutine has to win a scheduler race - which is exactly the mid-drain broadcast the fix guarantees. A real Flush goroutine remains as a corroborating end-to-end wake check. chainLinks/capHit are now read only after the chain has fully stopped (chainStopped), so there is no data race. Verified: target test green at -count=50 under -race (and at -count=30 under 3 background CPU spinners); Producer|Flush|Promise green at -count=5 under -race. Confirmed the test still fails deterministically (fast, no timeout) when the broadcast is moved back to ring-exit-only.
appendTo's appendSum closure builds an otelSum for every ".total" metric (producer/consumer connection.creation.total and user MetricTypeSum metrics) but never set isMonotonic on the constructed struct. otelSum.appendTo was already wired to serialize the field (Field 3, bool) and the otelSum struct's own comment promises "We always set isMonotonic to true", but with the field left false the proto3 default elided it entirely - every counter went out as a non-monotonic Sum. OTLP Sum.is_monotonic declares the counter semantic: a monotonic sum is a true cumulative counter, which downstream OTLP collectors / backends use to distinguish counters from up-down gauges and to compute rates. franz-go's .total sums ARE monotonic cumulative counters, and user MetricTypeSum metrics are enforced non-decreasing at append time (lastTot > um.ValueInt skips), so every sum emitted via appendSum is monotonic. The Java client sets setIsMonotonic(monotonic) for the same counters (SinglePointMetric.sum/deltaSum -> setIsMonotonic; KafkaMetricsCollector treats Total/Sum/CumulativeCount as monotonically increasing). Monotonicity is independent of temporality, so both the delta and cumulative arms set it. Not data loss/dup/stall - a serialization-fidelity gap that mis-declares the metric's semantic type to the broker's telemetry pipeline. Pattern 51 (a build/rebuild omitting a semantically-significant field) on the serialization axis; pattern 6 (a comment naming behavior the code does not perform). Repro TestAppendSumIsMonotonic (pkg/kgo/metrics_714_test.go) drives the real appendSum via m.appendTo for both delta and cumulative temporality, walks the serialized OTLP protobuf down to the .total Sum, and asserts is_monotonic==true (absent pre-fix, present post-fix; -race).
pushMetrics' no-requested-metrics arm computed its re-get wait as time.Duration(gresp.PushIntervalMillis) * time.Millisecond with no floor, unlike the push loop which already does max(..., time.Second). A broker that returns an empty RequestedMetrics list (a valid, common "no metrics subscribed right now" state) together with a non-positive PushIntervalMillis - a hostile or buggy broker, or an alt-broker divergence; the field is an int32 the broker fully controls - makes that wait <= 0, so time.NewTimer fires immediately and the loop re-issues GetTelemetrySubscriptions at round-trip pace forever, with a debug log per iteration. The push loop's max(..., time.Second) floor masked that the GET-path sibling had none (pattern 3). The Java client guards this at the source: ClientTelemetryUtils.validateIntervalMs substitutes DEFAULT_PUSH_INTERVAL_MS (5m) for any interval <= 0. We do the same once, right after a successful GetTelemetrySubscriptions, so BOTH the re-get arm and the push loop pace on a sane value (the push loop's existing floor becomes belt-and-suspenders). A non-positive interval is invalid per the protocol intent, so substituting the documented default rather than honoring it is the correct repair, not merely a lower bound. Pattern 31 (a server-advised retry/interval parameter adopted without a progress/floor bound) - the share-churn 3.3 / source-resweep B2 hot-loop sibling on the telemetry-interval axis. Repro TestValidatePushIntervalMillis (pkg/kgo/metrics_714_test.go) asserts the extracted validatePushIntervalMillis substitutes the default for non-positive advertised intervals (incl. MinInt32) and leaves positive ones unchanged, and that the substituted interval yields a positive (non-immediate) re-get timer; the end-to-end hot-loop needs a broker injecting an empty-metrics / non-positive response, which kfake's telemetry handler does not expose - the mechanism test is the deterministic guard (R23 non-deterministic-repro precedent).
Several deliberate behaviors that the franz-go audit catalogued as "do not
re-file as a bug" lacked an in-code marker stating WHY they are intentional,
so a future audit (or a well-meaning patch) could re-flag them. Migrate the
load-bearing rationales to concise, sited comments:
- cursor.topicID (source.go): a recreated topic's new ID is deliberately
never adopted; the consumer stalls loudly and the user purges+re-adds.
Cites issue #908 / PR #391/#377 (OffsetForLeaderEpoch has no TopicID, so
an adopted ID cannot be validated against truncation).
- fetchOffsets UNSTABLE_OFFSET_COMMIT (consumer_group.go): the unbounded 1s
retry is protocol-mandated (require_stable hides pending txnal offsets);
a retry cap would convert a mandated wait into a spurious error.
- groupExternal.updateLatest (consumer_group.go): rejoining on a one-response
stale partition-count shrink is intentional self-healing churn, matching
Java's leader exposure — not a bug to silence with a shrink filter.
- updateBrokers empty-list wipe (client.go): an empty Brokers list falling
back to seeds is the KIP-1102 REBOOTSTRAP_REQUIRED semantic, called
explicitly by the rebootstrap path; not a hostile-input gap.
- default autocommit head lag (consumer_group.go): the one-poll dirty->head
lag is what makes default autocommit at-least-once; committing dirty at
revoke would open a loss window (user decision 2026-04-24).
- broker throttle (broker.go): a throttle is honored in full with no cap
(KIP-219), matching Java; the wait is Close-interruptible and holds no
lock. Capping it would break the quota mechanism.
No behavior change; comments only.
The txn-churn and rebalance-churn audits each established a set of invariants
that any future change to coordinator/leader-churn recovery must preserve.
There is no open issue that fits either, so anchor them as doc comments at the
function that owns each recovery loop, so the constraints outlive the audit
notes:
- manage848 (consumer_group_848.go): the rebalance-churn invariants —
heartbeat errors retry in place while fetch errors restart the session
via g.fetching; member-identity resets are the minimum the error implies
(fresh UUID only for UnknownMemberID); leaves are idempotent and exempt
from the CGHB no-retry rule.
- doWithConcurrentTransactions (txn.go): the txn-churn invariants — the
wrapper/CT-loop division, anyAdded TV1 gating + TV2-only forced abort,
producer-fenced-means-dead — plus the two design-sized items left not
taken (commit-after-failed-produce; TV2 mid-session downgrade).
The silent.md zero-loss topic-recreation design constraints were posted to
issue #908 (the canonical recreation issue) rather than duplicated in code.
No behavior change; comments only.
twmb
commented
Jun 22, 2026
Comment-only trims/reframings:
- drop the over-explanatory tails on UnknownTopicRetries (config.go) and
checkUnknownFailLimit (sink.go), keeping the reset/bump rule and the three
errors that count
- delete the redundant RequireStableFetchOffsets no-op paragraph (txn.go)
- reframe updateBrokers' empty-list rationale as the long-standing seed
fallback it is, not a KIP-1102 artifact (client.go)
- note producedInTxn is set at buffer time and the worst case is an
always-legal empty EndTxn abort (producer.go)
- drop "activating" from the share pendingAssigns comment (consumer_share.go)
Behavior/refactor:
- NewConsumerBalancer dedups members in one map pass instead of an O(n^2)
scan-then-rebuild (group_balancer.go)
- fetchOffsets no longer re-fetches partitions it already surfaced a
non-retryable error for and dropped; injected partitions are filtered out
of the request on goto-start retries (consumer_group.go)
TestAuditPendingReloadSurvivesPartitionRevoke pins the preservation half of the dying-session reload fix: a partition caught mid-reload on a retriable error must be carried into the next session when the session is stopped by a revoke that keeps that partition. It drives the cooperative-rebalance shape (assignInvalidateMatching) deterministically via RemoveConsumePartitions on a direct consumer with explicit partitions, so a dropped load cannot self-heal -- nothing re-lists a pinned partition that has no cursor.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The most substantive runs were during the Fable week, with follow up rounds running through areas / scenarios of less importance. Anyway, time to review commit by commit.