Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Upgrade DataFusion fork to 53#56

Merged
xudong963 merged 6 commits into
branch-53from
massive-upgrade-53-atlas
May 20, 2026
Merged

Upgrade DataFusion fork to 53#56
xudong963 merged 6 commits into
branch-53from
massive-upgrade-53-atlas

Conversation

@xudong963
Copy link
Copy Markdown
Collaborator

@xudong963 xudong963 commented May 17, 2026

Summary

  • Upgrade the Massive DataFusion fork to the DataFusion 53 line.
  • Carry forward the branch-52 fork-specific fixes in the 53-compatible source tree.
  • Fix the previous DF53 upgrade CI failures in force_hash_collisions, clippy, and order.slt plan expectations.

@xudong963 xudong963 force-pushed the massive-upgrade-53-atlas branch from a79aff5 to 5f59c7b Compare May 17, 2026 17:10
@github-actions github-actions Bot removed documentation Improvements or additions to documentation sql substrait ffi proto execution spark labels May 17, 2026
@xudong963 xudong963 changed the title Upgrade Massive DataFusion fork to 53 Upgrade DataFusion fork to 53 May 18, 2026
// RecordBatch::try_new_with_options checks that if the schema is NOT NULL
// the array cannot contain nulls, amongst other checks.
let (_stream_schema, arrays, num_rows) = b.into_parts();
//
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0a0302b Restore DF 51 SchemaAdapter cast behaviour in ParquetOpener (#45)

Ok(None)
}

/// Returns `Ok(None)` when the file is not inside a valid partition path
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b4dbb6a Skip files outside partition structure in hive-partitioned listing tables (#51)

{
// Use SortMergeJoin if hash join is not preferred
let join_on_len = join_on.len();
// Derive sort options from the left input's existing ordering
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8ca2242 feat: derive SMJ sort options from left child during plan creation (#43)

);
check_if_same_properties!(self, children);
Ok(Arc::new(InterleaveExec::try_new(children)?))
// Optimizer rewrites can change child partitioning after InterleaveExec
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c7ba34f Fix: InterleaveExec fallback to UnionExec when children partitioning diverges

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

74772cb Fix memory reservation starvation in sort-merge
8dcb444 Cherry-pick: Fix sort merge interleave overflow
795aa28 Cherry pick sort merge fixes 52

Copy link
Copy Markdown
Collaborator

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three branch-52 unit tests aren't in PR head:

test_done_drains_buffered_rows (sort-merge happy-path drain — the drain_in_progress_on_done field is preserved but only the error-path is tested),

test_no_extra_spm_from_output_requirement_single_partition (SPM idempotency — "don't re-add SPM when one already exists"), and

test_sort_pushdown_adds_spm_for_single_partition_requirement (the new
output_requirement_adds_merge_after_partition_preserving_sort covers a similar scenario but starts from UnionExec rather than SPM + Sort(preserve=true) + Repartition)

I am not sure if we need to add those tests to avoid regression?

Copy link
Copy Markdown
Collaborator

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 05a6c45 (apache#21947)

"Skip unnecessary plan rebuild in adjust_input_keys_ordering for non-join plans"

Should be landed which is in branch-52.

05a6c45

zhuqi-lucas and others added 3 commits May 20, 2026 13:58
…oin plans (apache#21947)

Closes apache#21946

`adjust_input_keys_ordering` returns `Transformed::yes` unconditionally
in the default else branch, even when `requirements.data` is empty and
no changes were made. This triggers unnecessary `with_new_children`
rebuilds on every node in the plan tree for non-join/non-aggregate
queries.

For plans with custom `ExecutionPlan` nodes whose `with_new_children` is
expensive (e.g. nodes that re-evaluate cost functions on rebuild), this
causes significant overhead.

Add an early return with `Transformed::no` when
`requirements.data.is_empty()` in the default else branch of
`adjust_input_keys_ordering`. This skips the unnecessary plan tree
rebuild for simple scan/filter/limit plans that have no join key
reordering requirements.

Yes, two unit tests added:
- `adjust_input_keys_ordering_no_transform_for_scan` — verifies a bare
parquet scan returns `Transformed::no`
- `adjust_input_keys_ordering_no_transform_for_filter_scan` — verifies a
filter→scan tree returns `Transformed::no` via `transform_down`

No. This is a performance optimization that does not change query
results or plan structure.

(cherry picked from commit 05a6c45)
@xudong963
Copy link
Copy Markdown
Collaborator Author

@zhuqi-lucas Nice, I added them back

@zhuqi-lucas
Copy link
Copy Markdown
Collaborator

@zhuqi-lucas Nice, I added them back

LGTM

@xudong963 xudong963 merged commit d66824f into branch-53 May 20, 2026
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants