-
Notifications
You must be signed in to change notification settings - Fork 509
feat: support indexed v2 merge insert #5592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Code Review: feat: support indexed v2 merge insertSummaryThis PR adds support for indexed merge insert in the v2 path, which uses scalar indices to efficiently look up target rows that match source keys instead of performing full table scans. The implementation handles both fully indexed and partially indexed datasets (where some fragments have index coverage and others don't). P0/P1 IssuesP1: Unbounded memory usage in ReplayExec // rust/lance/src/dataset/write/merge_insert.rs
let source_replay = Arc::new(ReplayExec::new(Capacity::Unbounded, source_exec));Consider adding documentation about this memory characteristic and/or adding a warning when source data exceeds a threshold. P1: UpdateIf condition not properly applied WhenMatched::UpdateIf(condition_str) => {
// ... parsing code ...
if planner.create_physical_expr(&condition).is_ok() {
// Use the matched condition combined with the parsed condition
cases.push((matched, assign_action::Action::UpdateAll.as_literal_expr()));
}
// ...
}The parsed condition should be combined with P1: Single-column key limitation may be unexpected if self.params.on.len() != 1 {
return Err(Error::invalid_input(
"Indexed merge insert only supports single-column join keys",
location!(),
));
}Consider either:
Minor Observations (Non-blocking)
OverallThe implementation is solid with good architectural separation between indexed and non-indexed paths. The main concerns are the memory implications of unbounded replay and the UpdateIf semantic issue. |
aa2ad6d to
45a32bf
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
45a32bf to
72c45a0
Compare
97d2dc2 to
c537499
Compare
c537499 to
84b55ad
Compare
| /// - Source columns | ||
| /// - Target columns (including `_rowid`, `_rowaddr`) | ||
| #[derive(Debug)] | ||
| pub struct IndexedLookupExec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up wrapping this pipeline of source -> project -> MapIndexExec -> AddRowAddrExec -> TakeExec -> project as this execution node, since otherwise DF optimizer keeps adding additional repartitioning steps in between. Not sure if there is any better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise DF optimizer keeps adding additional repartitioning steps in between
I think the DataFusion optimizer does this based on ExecutionPlan::benefits_from_input_partitioning and ExecutionPlan::required_input_distribution. It compares those settings with the children nodes input partitioning to decide whether to add repartitioning. So also worth making sure you are stating the partitioning correctly.
| /// | ||
| /// The source_replay must be a ReplayExec created externally. This ensures | ||
| /// the same Arc is used for both the exposed child and internal DAG. | ||
| fn build_join_pipeline_with_replay( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for fully indexed case, I ended up doing the join within this exec node so that it can more easily pass around the replay node.
Support leveraging an index for v2 merge insert, the index optimization is added in
MergeInsertPlannerto transform a normal hash join into optimized join. With this change, we can properly display query plan. A few cases:oncolumn fully covered by index in target:Index only cover some fragments in target, run a hybrid plan that unions results and then hash join: