Thanks to visit codestin.com
Credit goes to github.com

Skip to content

core/mvcc: fix pager_commit_lock leak when the commit SM is abandoned#6905

Draft
penberg wants to merge 2 commits into
mainfrom
mvcc-pager-lock-leak
Draft

core/mvcc: fix pager_commit_lock leak when the commit SM is abandoned#6905
penberg wants to merge 2 commits into
mainfrom
mvcc-pager-lock-leak

Conversation

@penberg
Copy link
Copy Markdown
Collaborator

@penberg penberg commented May 12, 2026

The CommitStateMachine acquires pager_commit_lock in BeginCommitLogicalLog
and releases it in CommitEnd via the per-tx pager_commit_lock_held flag.
Between the lock-acquire and the flag-set sat a fallible txs.get() lookup;
the same shape existed in begin_exclusive_tx. If anything between took the
error path — or the wrapping statement was reset/dropped before CommitEnd —
the lock leaked and the next committer spun forever inside
pager_commit_lock.write().

Reproducer: seed 3642894517192925405 in the in-process-mvcc concurrent
simulator job, which hit the workflow's 40-minute timeout in CI.

  • Hoist the txs.get() lookups before pager_commit_lock.write() in both
    BeginCommitLogicalLog and begin_exclusive_tx so the per-tx flag is set
    atomically with lock acquisition.
  • Track pager_commit_lock_held on the CommitStateMachine itself and add a
    Drop impl that releases the lock via unlock_commit_lock_if_held when the
    SM is dropped mid-commit. The per-tx flag's swap is the synchronization
    point with rollback_tx, so the dual cleanup paths cannot double-unlock.
  • Bound Whopper::reopen's drain loop. A COMMIT yielding on pager_commit_lock
    held by a sibling fiber whose BEGIN already returned Done would otherwise
    spin forever, since the sibling has no statement for reopen to step.
    After 1024 iterations of zero terminal progress, fall through to the
    existing Connection::close path which rolls back in-flight txs.

@penberg penberg force-pushed the mvcc-pager-lock-leak branch 2 times, most recently from 0439bcc to 5ba268d Compare May 12, 2026 08:48
@penberg penberg marked this pull request as draft May 12, 2026 09:23
penberg added 2 commits May 13, 2026 10:25
7580dbd ("concurrent-simulator: bound reopen drain loop with max_steps")
bounded reopen drain by the shared `max_steps` budget. That's overly
strict for legitimate IO-heavy statements: `PRAGMA integrity_check`
yields once per page read (cache-cold in MVCC mode after any commit
advances WAL state, since `mvcc_refresh_if_db_changed` nukes the cache
on every snapshot diff). A reopen that triggers late in a run has only
a few hundred main-loop steps left, far short of the ~3000 yields the
checker needs, and bails with a misleading "leaked lock" message.

Split the budget. Drain iterations no longer count against `max_steps`;
they have a dedicated `max_drain_steps` cap (default 1_000_000, exposed
as `--max-drain-steps`) that's large enough to absorb legitimate
IO-heavy finalization but still catches real engine-side infinite loops
(unresolvable IO yield, leaked lock with no other fiber able to make
progress).

Hitting `max_steps` during drain is no longer a panic — it just exits
the drain and falls through to the existing connection-close path,
which rolls back any in-flight transactions through `rollback_tx`.
The CommitStateMachine acquires pager_commit_lock in BeginCommitLogicalLog
and releases it in CommitEnd via the per-tx pager_commit_lock_held flag.
Between the lock-acquire and the flag-set sat a fallible txs.get() lookup;
the same shape existed in begin_exclusive_tx. If anything between took the
error path — or the wrapping statement was reset/dropped before CommitEnd —
the lock leaked and the next committer spun forever inside
pager_commit_lock.write().

Reproducer: seed 3642894517192925405 in the in-process-mvcc concurrent
simulator job, which hit the workflow's 40-minute timeout in CI.

- Hoist the txs.get() lookups before pager_commit_lock.write() in both
  BeginCommitLogicalLog and begin_exclusive_tx so the per-tx flag is set
  atomically with lock acquisition.
- Track pager_commit_lock_held on the CommitStateMachine itself and add a
  Drop impl that releases the lock via unlock_commit_lock_if_held when the
  SM is dropped mid-commit. The per-tx flag's swap is the synchronization
  point with rollback_tx, so the dual cleanup paths cannot double-unlock.
- Bound Whopper::reopen's drain loop. A COMMIT yielding on pager_commit_lock
  held by a sibling fiber whose BEGIN already returned Done would otherwise
  spin forever, since the sibling has no statement for reopen to step.
  After 1024 iterations of zero terminal progress, fall through to the
  existing Connection::close path which rolls back in-flight txs.
@penberg penberg force-pushed the mvcc-pager-lock-leak branch from 5ba268d to 33cb380 Compare May 13, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant