Thanks to visit codestin.com
Credit goes to gitHub.com

Skip to content

[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140

Draft
yaooqinn wants to merge 5 commits intoapache:masterfrom
yaooqinn:feature/crossjoin-array-contains-benchmark
Draft

[SPARK-XXXXX][SQL] Add CrossJoinArrayContainsToInnerJoin optimizer rule#54140
yaooqinn wants to merge 5 commits intoapache:masterfrom
yaooqinn:feature/crossjoin-array-contains-benchmark

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Feb 4, 2026

What changes were proposed in this pull request?

This PR adds a new optimizer rule that converts cross joins with array_contains filter into efficient inner joins using explode. This transforms:

SELECT * FROM orders, items WHERE array_contains(orders.item_ids, items.id)

From O(N×M) cross join + filter to O(N+M) inner join:

SELECT * FROM (SELECT *, explode(array_distinct(item_ids)) as _unnested FROM orders) t, items
WHERE t._unnested = items.id

Why are the changes needed?

Cross joins with array_contains are common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads.

Does this PR introduce any user-facing change?

No, this is an internal optimizer improvement.

How was this patch tested?

  • Unit tests
  • Microbenchmark

Was this patch authored or co-authored using generative AI tooling?

Yes, GitHub Copilot CLI assisted with implementation.

@yaooqinn yaooqinn marked this pull request as draft February 4, 2026 14:12
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

⚠️ Pull Request Title Validation

This pull request title does not contain a JIRA issue ID.

Please update the title to either:

  • Include a JIRA ID: [SPARK-12345] Your description
  • Mark as minor change: [MINOR] Your description

For minor changes that don't require a JIRA ticket (e.g., typo fixes), please prefix the title with [MINOR].


This comment was automatically generated by GitHub Actions

@github-actions github-actions bot added the SQL label Feb 4, 2026
### What changes were proposed in this pull request?

This PR adds a new optimizer rule that converts cross joins with `array_contains` filter into efficient inner joins using explode. This transforms:

```sql
SELECT * FROM orders, items WHERE array_contains(orders.item_ids, items.id)
```

From O(N×M) cross join + filter to O(N+M) inner join:

```sql
SELECT * FROM (SELECT *, explode(array_distinct(item_ids)) as _unnested FROM orders) t, items
WHERE t._unnested = items.id
```

### Why are the changes needed?

Cross joins with `array_contains` are common patterns that can be very expensive for large datasets. This optimization provides 10-15X speedup on realistic workloads.

### Does this PR introduce _any_ user-facing change?

No, this is an internal optimizer improvement.

### How was this patch tested?

- Unit tests: 15 tests covering basic transformation, type support, and edge cases
- Microbenchmark: 10K×1K dataset showing ~10-15X speedup

### Was this patch authored or co-authored using generative AI tooling?

Yes, GitHub Copilot CLI assisted with implementation.
@yaooqinn yaooqinn force-pushed the feature/crossjoin-array-contains-benchmark branch from dd2b1f1 to c0494d6 Compare February 4, 2026 14:56
…oinArrayContainsToInnerJoinBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…oinArrayContainsToInnerJoinBenchmark (JDK 17, Scala 2.13, split 1 of 1)
val rightOut = right.outputSet

// Find first valid array_contains predicate
predicates.collectFirst {
Copy link
Contributor

@cboumalh cboumalh Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work @yaooqinn, qq do we want to consider cases where we have multiple array_contains predicates? Potentially with the same array?

Copy link
Contributor

@cboumalh cboumalh Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to consider testing the explosion performance with large arrays and small right tables?

Addresses review comment about testing explosion performance with large arrays
and small right tables. The new benchmark helps identify cases where the
optimization might be counterproductive.

Cost comparison:
- Cross join + filter: O(left_rows * right_rows)
- Explode + join: O(left_rows * array_size)

When array_size >> right_table_size, cross join may be faster.
- Use same cross-product size (10M) as standard case
- numOrders=10000, numItems=100, arraySize=1000
- This makes the regression case more comparable
@yaooqinn
Copy link
Member Author

yaooqinn commented Feb 6, 2026

Thank you @cboumalh, your concern makes sense. Let's give me more time to work in progress to improve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants