Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@jackye1995
Copy link
Contributor

Based on feedback in #5582

Simplified implementation that just keep the first value in case of duplicated on rows in source during merge insert. Users are expected to sort the source properly before using merge-insert.

@jackye1995 jackye1995 requested a review from wjones127 December 31, 2025 06:09
@github-actions github-actions bot added the enhancement New feature or request label Dec 31, 2025
@github-actions
Copy link
Contributor

Code Review

Summary: This PR adds a DedupeBehavior option to merge_insert that allows keeping the first-seen value when duplicate source rows match the same target row, instead of failing.

P1: Performance Issue in Duplicate Filtering

In merge_insert.rs around line 2075-2082, the duplicate filtering uses:

let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_indices.contains(&i)))
    .collect();

This is O(n*m) where n is the batch size and m is the number of duplicates. For batches with many duplicates, this becomes expensive. Consider using a HashSet for O(1) lookups:

let duplicate_set: std::collections::HashSet<_> = duplicate_indices.iter().collect();
let keep_mask: BooleanArray = (0..matched.num_rows())
    .map(|i| Some(!duplicate_set.contains(&i)))
    .collect();

Other Observations

  1. Test Coverage: Good parameterized test covering multiple configurations (full schema, stable row IDs, file versions).

  2. API Design: The builder pattern with dedupe_behavior() follows existing conventions well.

  3. Stats Tracking: The num_skipped_duplicates metric is properly propagated through both code paths (full-schema and non-full-schema merge insert).

Overall this is a straightforward and well-tested feature addition. The main actionable item is the performance concern above.

@codecov
Copy link

codecov bot commented Dec 31, 2025

Codecov Report

❌ Patch coverage is 88.88889% with 20 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write/merge_insert.rs 84.00% 18 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant