[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140
[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140yaooqinn wants to merge 5 commits intoapache:masterfrom
Conversation
|
### What changes were proposed in this pull request? This PR adds a new optimizer rule that converts cross joins with `array_contains` filter into efficient inner joins using explode. This transforms: ```sql SELECT * FROM orders, items WHERE array_contains(orders.item_ids, items.id) ``` From O(N×M) cross join + filter to O(N+M) inner join: ```sql SELECT * FROM (SELECT *, explode(array_distinct(item_ids)) as _unnested FROM orders) t, items WHERE t._unnested = items.id ``` ### Why are the changes needed? Cross joins with `array_contains` are common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads. ### Does this PR introduce _any_ user-facing change? No, this is an internal optimizer improvement. ### How was this patch tested? - Unit tests: 15 tests covering basic transformation, type support, and edge cases - Microbenchmark: 10K×1K dataset showing ~10-15X speedup ### Was this patch authored or co-authored using generative AI tooling? Yes, GitHub Copilot CLI assisted with implementation.
dd2b1f1 to
c0494d6
Compare
…oinArrayContainsToInnerJoinBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…oinArrayContainsToInnerJoinBenchmark (JDK 17, Scala 2.13, split 1 of 1)
| val rightOut = right.outputSet | ||
|
|
||
| // Find first valid array_contains predicate | ||
| predicates.collectFirst { |
There was a problem hiding this comment.
Thanks for all the work @yaooqinn, qq do we want to consider cases where we have multiple array_contains predicates? Potentially with the same array?
There was a problem hiding this comment.
do we want to consider testing the explosion performance with large arrays and small right tables?
Addresses review comment about testing explosion performance with large arrays and small right tables. The new benchmark helps identify cases where the optimization might be counterproductive. Cost comparison: - Cross join + filter: O(left_rows * right_rows) - Explode + join: O(left_rows * array_size) When array_size >> right_table_size, cross join may be faster.
- Use same cross-product size (10M) as standard case - numOrders=10000, numItems=100, arraySize=1000 - This makes the regression case more comparable
|
Thank you @cboumalh, your concern makes sense. Let's give me more time to work in progress to improve. |
What changes were proposed in this pull request?
This PR adds a new optimizer rule that converts cross joins with
array_containsfilter into efficient inner joins using explode. This transforms:From O(N×M) cross join + filter to O(N+M) inner join:
Why are the changes needed?
Cross joins with
array_containsare common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads.Does this PR introduce any user-facing change?
No, this is an internal optimizer improvement.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Yes, GitHub Copilot CLI assisted with implementation.