core/mvcc: fix pager_commit_lock leak when the commit SM is abandoned#6905
Draft
penberg wants to merge 2 commits into
Draft
core/mvcc: fix pager_commit_lock leak when the commit SM is abandoned#6905penberg wants to merge 2 commits into
penberg wants to merge 2 commits into
Conversation
0439bcc to
5ba268d
Compare
7580dbd ("concurrent-simulator: bound reopen drain loop with max_steps") bounded reopen drain by the shared `max_steps` budget. That's overly strict for legitimate IO-heavy statements: `PRAGMA integrity_check` yields once per page read (cache-cold in MVCC mode after any commit advances WAL state, since `mvcc_refresh_if_db_changed` nukes the cache on every snapshot diff). A reopen that triggers late in a run has only a few hundred main-loop steps left, far short of the ~3000 yields the checker needs, and bails with a misleading "leaked lock" message. Split the budget. Drain iterations no longer count against `max_steps`; they have a dedicated `max_drain_steps` cap (default 1_000_000, exposed as `--max-drain-steps`) that's large enough to absorb legitimate IO-heavy finalization but still catches real engine-side infinite loops (unresolvable IO yield, leaked lock with no other fiber able to make progress). Hitting `max_steps` during drain is no longer a panic — it just exits the drain and falls through to the existing connection-close path, which rolls back any in-flight transactions through `rollback_tx`.
The CommitStateMachine acquires pager_commit_lock in BeginCommitLogicalLog and releases it in CommitEnd via the per-tx pager_commit_lock_held flag. Between the lock-acquire and the flag-set sat a fallible txs.get() lookup; the same shape existed in begin_exclusive_tx. If anything between took the error path — or the wrapping statement was reset/dropped before CommitEnd — the lock leaked and the next committer spun forever inside pager_commit_lock.write(). Reproducer: seed 3642894517192925405 in the in-process-mvcc concurrent simulator job, which hit the workflow's 40-minute timeout in CI. - Hoist the txs.get() lookups before pager_commit_lock.write() in both BeginCommitLogicalLog and begin_exclusive_tx so the per-tx flag is set atomically with lock acquisition. - Track pager_commit_lock_held on the CommitStateMachine itself and add a Drop impl that releases the lock via unlock_commit_lock_if_held when the SM is dropped mid-commit. The per-tx flag's swap is the synchronization point with rollback_tx, so the dual cleanup paths cannot double-unlock. - Bound Whopper::reopen's drain loop. A COMMIT yielding on pager_commit_lock held by a sibling fiber whose BEGIN already returned Done would otherwise spin forever, since the sibling has no statement for reopen to step. After 1024 iterations of zero terminal progress, fall through to the existing Connection::close path which rolls back in-flight txs.
5ba268d to
33cb380
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The CommitStateMachine acquires pager_commit_lock in BeginCommitLogicalLog
and releases it in CommitEnd via the per-tx pager_commit_lock_held flag.
Between the lock-acquire and the flag-set sat a fallible txs.get() lookup;
the same shape existed in begin_exclusive_tx. If anything between took the
error path — or the wrapping statement was reset/dropped before CommitEnd —
the lock leaked and the next committer spun forever inside
pager_commit_lock.write().
Reproducer: seed 3642894517192925405 in the in-process-mvcc concurrent
simulator job, which hit the workflow's 40-minute timeout in CI.
BeginCommitLogicalLog and begin_exclusive_tx so the per-tx flag is set
atomically with lock acquisition.
Drop impl that releases the lock via unlock_commit_lock_if_held when the
SM is dropped mid-commit. The per-tx flag's swap is the synchronization
point with rollback_tx, so the dual cleanup paths cannot double-unlock.
held by a sibling fiber whose BEGIN already returned Done would otherwise
spin forever, since the sibling has no statement for reopen to step.
After 1024 iterations of zero terminal progress, fall through to the
existing Connection::close path which rolls back in-flight txs.