Codestin Search App

yaooqinn · 2026-02-04T14:12:12Z

What changes were proposed in this pull request?

This PR adds a new optimizer rule that converts cross joins with array_contains filter into efficient inner joins using explode. This transforms:

SELECT * FROM orders, items WHERE array_contains(orders.item_ids, items.id)

From O(N×M) cross join + filter to O(N+M) inner join:

SELECT * FROM (SELECT *, explode(array_distinct(item_ids)) as _unnested FROM orders) t, items
WHERE t._unnested = items.id

Why are the changes needed?

Cross joins with array_contains are common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads.

Does this PR introduce any user-facing change?

No, this is an internal optimizer improvement.

How was this patch tested?

Unit tests
Microbenchmark

Was this patch authored or co-authored using generative AI tooling?

Yes, GitHub Copilot CLI assisted with implementation.

github-actions · 2026-02-04T14:12:21Z

⚠️ Pull Request Title Validation

This pull request title does not contain a JIRA issue ID.

Please update the title to either:

Include a JIRA ID: [SPARK-12345] Your description
Mark as minor change: [MINOR] Your description

For minor changes that don't require a JIRA ticket (e.g., typo fixes), please prefix the title with [MINOR].

This comment was automatically generated by GitHub Actions

### What changes were proposed in this pull request? This PR adds a new optimizer rule that converts cross joins with `array_contains` filter into efficient inner joins using explode. This transforms: ```sql SELECT * FROM orders, items WHERE array_contains(orders.item_ids, items.id) ``` From O(N×M) cross join + filter to O(N+M) inner join: ```sql SELECT * FROM (SELECT *, explode(array_distinct(item_ids)) as _unnested FROM orders) t, items WHERE t._unnested = items.id ``` ### Why are the changes needed? Cross joins with `array_contains` are common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads. ### Does this PR introduce _any_ user-facing change? No, this is an internal optimizer improvement. ### How was this patch tested? - Unit tests: 15 tests covering basic transformation, type support, and edge cases - Microbenchmark: 10K×1K dataset showing ~10-15X speedup ### Was this patch authored or co-authored using generative AI tooling? Yes, GitHub Copilot CLI assisted with implementation.

…oinArrayContainsToInnerJoinBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…oinArrayContainsToInnerJoinBenchmark (JDK 17, Scala 2.13, split 1 of 1)

cboumalh · 2026-02-04T17:38:13Z

...c/main/scala/org/apache/spark/sql/catalyst/optimizer/CrossJoinArrayContainsToInnerJoin.scala

+    val rightOut = right.outputSet
+
+    // Find first valid array_contains predicate
+    predicates.collectFirst {


Thanks for all the work @yaooqinn, qq do we want to consider cases where we have multiple array_contains predicates? Potentially with the same array?

cboumalh · 2026-02-04T17:53:48Z

...la/org/apache/spark/sql/execution/benchmark/CrossJoinArrayContainsToInnerJoinBenchmark.scala

do we want to consider testing the explosion performance with large arrays and small right tables?

Addresses review comment about testing explosion performance with large arrays and small right tables. The new benchmark helps identify cases where the optimization might be counterproductive. Cost comparison: - Cross join + filter: O(left_rows * right_rows) - Explode + join: O(left_rows * array_size) When array_size >> right_table_size, cross join may be faster.

- Use same cross-product size (10M) as standard case - numOrders=10000, numItems=100, arraySize=1000 - This makes the regression case more comparable

yaooqinn · 2026-02-06T06:37:17Z

Thank you @cboumalh, your concern makes sense. Let's give me more time to work in progress to improve.

yaooqinn marked this pull request as draft February 4, 2026 14:12

github-actions bot added the SQL label Feb 4, 2026

yaooqinn force-pushed the feature/crossjoin-array-contains-benchmark branch from dd2b1f1 to c0494d6 Compare February 4, 2026 14:56

yaooqinn added 2 commits February 4, 2026 15:10

Benchmark results for org.apache.spark.sql.execution.benchmark.CrossJ…

83afcea

…oinArrayContainsToInnerJoinBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.benchmark.CrossJ…

2333d6e

…oinArrayContainsToInnerJoinBenchmark (JDK 17, Scala 2.13, split 1 of 1)

cboumalh reviewed Feb 4, 2026

View reviewed changes

yaooqinn added 2 commits February 5, 2026 05:30

Update large array benchmark with fair comparison

21cd919

- Use same cross-product size (10M) as standard case - numOrders=10000, numItems=100, arraySize=1000 - This makes the regression case more comparable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140

[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140
yaooqinn wants to merge 5 commits intoapache:masterfrom
yaooqinn:feature/crossjoin-array-contains-benchmark

yaooqinn commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

cboumalh Feb 4, 2026 •

edited

Loading

Uh oh!

cboumalh Feb 4, 2026 •

edited

Loading

Uh oh!

yaooqinn commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaooqinn commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Pull Request Title Validation

Uh oh!

cboumalh Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cboumalh Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaooqinn commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading

cboumalh Feb 4, 2026 •

edited

Loading

cboumalh Feb 4, 2026 •

edited

Loading