Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@abhizer
Copy link
Contributor

@abhizer abhizer commented Jan 20, 2026

Previously:

  • During S3 sync, we would prevent the GC of existing local checkpoints.
  • Only one checkpoint would be GCed at a time.

This commit updates the gc_checkpoint method such that it such that, we can GC all old checkpoints (ie, checkpoints older than the retention threshold, currently 2 of the most recent) are GCed, except for any checkpoint in the except list or newer.
This except list is populated from currently active requests for checkpoint syncs.

Copilot AI review requested due to automatic review settings January 20, 2026 15:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the checkpoint garbage collection mechanism to better handle concurrent S3 checkpoint syncs. Previously, checkpoint GC was limited to removing one checkpoint at a time and was completely blocked during S3 sync operations. The updated implementation allows GC to remove multiple old checkpoints in a single pass while protecting only those checkpoints that are actively being synced to S3.

Changes:

  • Modified gc_checkpoint to accept an exception list of checkpoint UUIDs that should be preserved
  • Changed GC behavior from single-checkpoint removal to bulk removal of all eligible old checkpoints
  • Updated the main call site to pass UUIDs of checkpoints currently being synced to S3 as exceptions

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
crates/dbsp/src/circuit/dbsp_handle.rs Updated gc_checkpoint signature to accept exception list parameter and updated documentation
crates/dbsp/src/circuit/checkpointer.rs Implemented bulk checkpoint removal with exception list support, added informative logging
crates/adapters/src/controller.rs Updated gc_checkpoint call to pass active sync checkpoint UUIDs as exceptions, added Itertools import for deduplication

@abhizer abhizer requested a review from ryzhyk January 20, 2026 17:13
@ryzhyk ryzhyk requested a review from blp January 20, 2026 17:32
@ryzhyk
Copy link
Contributor

ryzhyk commented Jan 20, 2026

This needs @blp 's review.

Is there any risk of getting stuck in local GC for too long because of a large number of accumulated local checkpoints?

@abhizer
Copy link
Contributor Author

abhizer commented Jan 20, 2026

Mostly, there should be just one checkpoint to GC. But in cases when an older checkpoint was being synced and that prevented GCing (so many of them have accumulated), we want to cleanup all the older ones.
Typically it shouldn't be super expensive to clean them up. There may be other improvements that we can make, like reading the dependencies.json file (we introduced this a few months ago) instead of going through all the batch files to find the reference files.

@lalithsuresh
Copy link
Contributor

@abhizer reminder to add state machine tests for this.

@blp
Copy link
Member

blp commented Jan 21, 2026

If we're syncing an old checkpoint, I believe that this keeps all the checkpoints newer than that. Why not just keep MIN_CHECKPOINT_THRESHOLD newer ones? Then we won't accumulate many new checkpoints if it takes a long time to sync an old one.

Comment on lines 311 to 317
// Find the first checkpoint in checkpoint list that is not in `except`.
self.checkpoint_list
.iter()
.filter(|c| except.contains(&c.uuid))
.take(1)
.filter_map(|c| self.backend.gather_batches_for_checkpoint(c).ok())
.for_each(|batches| {
for batch in batches {
batch_files_to_keep.insert(batch);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're only keeping batches from the first checkpoint in the list? I don't understand why.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption is that the checkpoints are incremental and that:

  • checkpoint n + 1 will only depend on checkpoint n's batch files + new batch files
  • So, when deleting checkpoints n - 1 or older's batch files, if we keep all the batch files of n, we implicitly keep all the batch files of n + 1

Basically, checkpoint n + 1 cannot depend on a batch file that is in n - 1 but not in n.

This should be the same behavior as the current implementation of GC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true. Thanks.

Comment on lines 311 to 317
// Find the first checkpoint in checkpoint list that is not in `except`.
self.checkpoint_list
.iter()
.filter(|c| except.contains(&c.uuid))
.take(1)
.filter_map(|c| self.backend.gather_batches_for_checkpoint(c).ok())
.for_each(|batches| {
for batch in batches {
batch_files_to_keep.insert(batch);
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true. Thanks.

@abhizer abhizer added this pull request to the merge queue Jan 22, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to invalid changes in the merge commit Jan 22, 2026
@abhizer abhizer added this pull request to the merge queue Jan 22, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to invalid changes in the merge commit Jan 22, 2026
@abhizer abhizer force-pushed the better-gc-during-sync branch from ea626df to aba1b3c Compare January 23, 2026 16:48
@abhizer abhizer enabled auto-merge January 23, 2026 16:52
@abhizer abhizer disabled auto-merge January 23, 2026 16:56
@abhizer abhizer force-pushed the better-gc-during-sync branch from aba1b3c to 4a078ac Compare January 23, 2026 16:59
@abhizer abhizer enabled auto-merge January 23, 2026 16:59
@abhizer abhizer added this pull request to the merge queue Jan 23, 2026
@abhizer abhizer removed this pull request from the merge queue due to a manual request Jan 23, 2026
@abhizer abhizer force-pushed the better-gc-during-sync branch from 4a078ac to ef800bf Compare January 23, 2026 18:15
@abhizer abhizer added this pull request to the merge queue Jan 23, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 23, 2026
@abhizer abhizer force-pushed the better-gc-during-sync branch from ef800bf to 6b7a49d Compare January 23, 2026 22:19
@abhizer abhizer enabled auto-merge January 23, 2026 22:19
@abhizer abhizer added this pull request to the merge queue Jan 23, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 24, 2026
Previously:
- During S3 sync, we would prevent the GC of existing local
checkpoints.
- Only one checkpoint would be GCed at a time.

This commit updates the `gc_checkpoint` method such that it such that, we
can GC all *old* checkpoints (ie, checkpoints older than the retention
threshold, currently 2 of the most recent) are GCed, except for any
checkpoint in the `except` list or newer.
This except list is populated from currently active requests for
checkpoint syncs.

Signed-off-by: Abhinav Gyawali <[email protected]>

checkpointer: only preserve checkpoints in except list

This commit updates the checkpointer to only preserve the checkpoints in
the except list, instead of preserving any checkpoint that is newer.

Also adds tests for the Checkpointer to ensure that it works correctly.

Signed-off-by: Abhinav Gyawali <[email protected]>

py: add tests with sync GC count: 1, age: 0

Tests for potential regressions where we only want to keep 1 checkpoint
in object store.

Previously, this introduced a bug by also cleaning up the local checkpoint
directory for this checkpoint, but the pipeline still expects it to be
available.

Signed-off-by: Abhinav Gyawali <[email protected]>
@abhizer abhizer force-pushed the better-gc-during-sync branch from 6b7a49d to 3da078e Compare January 24, 2026 09:12
@abhizer abhizer enabled auto-merge January 24, 2026 09:12
@abhizer abhizer added this pull request to the merge queue Jan 24, 2026
Merged via the queue into main with commit be6f768 Jan 24, 2026
1 check passed
@abhizer abhizer deleted the better-gc-during-sync branch January 24, 2026 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants