feat: support indexed v2 merge insert #5592

jackye1995 · 2025-12-29T20:11:17Z

Support leveraging an index for v2 merge insert, the index optimization is added in MergeInsertPlanner to transform a normal hash join into optimized join. With this change, we can properly display query plan. A few cases:

on column fully covered by index in target:

MergeInsert: on=[id], when_matched=UpdateAll, when_not_matched=InsertAll, when_not_matched_by_source=Keep...
  CoalescePartitionsExec...
    IndexedLookup: key=id, index=id_idx
      Replay...
        StreamingTableExec: partition_sizes=1, projection=[id, value]

Index only cover some fragments in target, run a hybrid plan that unions results and then hash join:

MergeInsert: on=[id], when_matched=UpdateAll, when_not_matched=InsertAll, when_not_matched_by_source=Keep...
  CoalescePartitionsExec...
    HashJoinExec...join_type=Left...
      Replay...
        StreamingTableExec: partition_sizes=1, projection=[id, value]
      ...UnionExec...
        IndexedLookup: key=id, index=id_idx...
          Replay...
            StreamingTableExec: partition_sizes=1, projection=[id, value]
        ...LanceScan...range=None

github-actions · 2025-12-29T20:13:54Z

Code Review: feat: support indexed v2 merge insert

Summary

This PR adds support for indexed merge insert in the v2 path, which uses scalar indices to efficiently look up target rows that match source keys instead of performing full table scans. The implementation handles both fully indexed and partially indexed datasets (where some fragments have index coverage and others don't).

P0/P1 Issues

P1: Unbounded memory usage in ReplayExec
In build_indexed_merge_physical_plan, the code uses Capacity::Unbounded for ReplayExec to avoid deadlocks with HashJoin's CollectLeft mode. While the comment explains the reasoning, this means the entire source dataset could be buffered in memory. For very large source datasets, this could cause OOM issues.

// rust/lance/src/dataset/write/merge_insert.rs
let source_replay = Arc::new(ReplayExec::new(Capacity::Unbounded, source_exec));

Consider adding documentation about this memory characteristic and/or adding a warning when source data exceeds a threshold.

P1: UpdateIf condition not properly applied
In create_indexed_action_expr, the UpdateIf case parses the condition but then ignores it:

WhenMatched::UpdateIf(condition_str) => {
    // ... parsing code ...
    if planner.create_physical_expr(&condition).is_ok() {
        // Use the matched condition combined with the parsed condition
        cases.push((matched, assign_action::Action::UpdateAll.as_literal_expr()));
    }
    // ...
}

The parsed condition should be combined with matched (e.g., matched.and(condition)) rather than being discarded. This means UpdateIf behaves the same as UpdateAll in the indexed path.

P1: Single-column key limitation may be unexpected
The indexed merge insert only supports single-column join keys, but this limitation is only enforced at runtime:

if self.params.on.len() != 1 {
    return Err(Error::invalid_input(
        "Indexed merge insert only supports single-column join keys",
        location!(),
    ));
}

Consider either:

Adding this check earlier in join_key_as_scalar_index so multi-column keys fall back to the non-indexed path silently, OR
Documenting this limitation in the public API

Minor Observations (Non-blocking)

The new batch_size parameter is added to MergeInsertParams but not exposed in MergeInsertBuilder's public API documentation.
Good test coverage for full, partial, and no index scenarios.
The hybrid execution tests cover the important edge cases well.

Overall

The implementation is solid with good architectural separation between indexed and non-indexed paths. The main concerns are the memory implications of unbounded replay and the UpdateIf semantic issue.

codecov · 2025-12-30T01:46:42Z

Codecov Report

❌ Patch coverage is 76.74944% with 412 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...nce/src/dataset/write/merge_insert/logical_plan.rs	55.73%	229 Missing and 14 partials ⚠️
rust/lance/src/dataset/write/merge_insert.rs	89.33%	87 Missing and 13 partials ⚠️
...rc/dataset/write/merge_insert/exec/indexed_join.rs	75.78%	48 Missing and 21 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 · 2025-12-30T22:52:56Z

rust/lance/src/dataset/write/merge_insert/exec/indexed_lookup.rs

+/// - Source columns
+/// - Target columns (including `_rowid`, `_rowaddr`)
+#[derive(Debug)]
+pub struct IndexedLookupExec {


I ended up wrapping this pipeline of source -> project -> MapIndexExec -> AddRowAddrExec -> TakeExec -> project as this execution node, since otherwise DF optimizer keeps adding additional repartitioning steps in between. Not sure if there is any better way.

otherwise DF optimizer keeps adding additional repartitioning steps in between

I think the DataFusion optimizer does this based on ExecutionPlan::benefits_from_input_partitioning and ExecutionPlan::required_input_distribution. It compares those settings with the children nodes input partitioning to decide whether to add repartitioning. So also worth making sure you are stating the partitioning correctly.

jackye1995 · 2025-12-30T22:55:59Z

rust/lance/src/dataset/write/merge_insert/exec/indexed_lookup.rs

+    ///
+    /// The source_replay must be a ReplayExec created externally. This ensures
+    /// the same Arc is used for both the exposed child and internal DAG.
+    fn build_join_pipeline_with_replay(


for fully indexed case, I ended up doing the join within this exec node so that it can more easily pass around the replay node.

github-actions bot added enhancement New feature or request java labels Dec 29, 2025

jackye1995 force-pushed the merge-insert-indexed branch 2 times, most recently from aa2ad6d to 45a32bf Compare December 30, 2025 00:59

jackye1995 force-pushed the merge-insert-indexed branch from 45a32bf to 72c45a0 Compare December 30, 2025 05:59

jackye1995 marked this pull request as draft December 30, 2025 08:26

jackye1995 force-pushed the merge-insert-indexed branch 2 times, most recently from 97d2dc2 to c537499 Compare December 30, 2025 18:08

use hash join instead of inner loop

84b55ad

jackye1995 force-pushed the merge-insert-indexed branch from c537499 to 84b55ad Compare December 30, 2025 22:21

minor rename

c5516e9

jackye1995 commented Dec 30, 2025

View reviewed changes

jackye1995 marked this pull request as ready for review December 30, 2025 22:53

jackye1995 commented Dec 30, 2025

View reviewed changes

jackye1995 requested a review from wjones127 December 30, 2025 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support indexed v2 merge insert #5592

feat: support indexed v2 merge insert #5592

Uh oh!

jackye1995 commented Dec 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 29, 2025

Uh oh!

codecov bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

jackye1995 Dec 30, 2025

Uh oh!

wjones127 Dec 31, 2025

Uh oh!

jackye1995 Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: support indexed v2 merge insert #5592

Are you sure you want to change the base?

feat: support indexed v2 merge insert #5592

Uh oh!

Conversation

jackye1995 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 29, 2025

Code Review: feat: support indexed v2 merge insert

Summary

P0/P1 Issues

Minor Observations (Non-blocking)

Overall

Uh oh!

codecov bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jackye1995 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

wjones127 Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jackye1995 Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 commented Dec 29, 2025 •

edited

Loading

codecov bot commented Dec 30, 2025 •

edited

Loading