Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat: add array_exists with lambda expression support#3611

Open
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/array-exists-lambda
Open

feat: add array_exists with lambda expression support#3611
andygrove wants to merge 3 commits intoapache:mainfrom
andygrove:feat/array-exists-lambda

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Feb 27, 2026

Closes #3149

Summary

  • Add native support for array_exists(arr, x -> predicate(x)) in SQL and DataFrame API
  • First general-purpose lambda expression infrastructure, extensible to array_filter, array_transform, array_forall
  • Vectorized lambda evaluation: flattens list elements, evaluates lambda body once over expanded batch, reduces per row with SQL three-valued logic
  • Unsupported lambda bodies (e.g. containing UDFs) fall back to Spark correctly

Add native support for `array_exists(arr, x -> predicate(x))` in SQL
and DataFrame API. This is the first general-purpose lambda expression
infrastructure, which can later be extended to support `array_filter`,
`array_transform`, and `array_forall`.

The lambda body is serialized as a regular expression tree where
`NamedLambdaVariable` leaf nodes are serialized as `LambdaVariable`
proto messages. On the Rust side, `ArrayExistsExpr` evaluates the
lambda body vectorized over all elements in a single pass: it flattens
list values, expands the batch with repeat indices, appends elements
as a `__comet_lambda_var` column, evaluates once, and reduces per row
with SQL three-valued logic semantics.

Unsupported lambda bodies (e.g. containing UDFs) fall back to Spark.

Closes apache#3149
- Remove unused element_type proto field from ArrayExists
- Add LargeListArray support via decompose_list helper
- Use column index instead of name for lambda variable lookup
- Add TimestampNTZType to supported element types
- Restore CometNamedLambdaVariable as standalone serde object
- Remove SQL-based Scala tests (covered by SQL file tests)
- Add DataFrame tests for decimal and date element types
- Add negative test for unsupported element type fallback
- Add multi-column batch Rust unit test
Comment on lines +159 to +163
for (i, col) in batch.columns().iter().enumerate() {
let expanded = take(col.as_ref(), &repeat_indices_array, None)?;
expanded_columns.push(expanded);
expanded_fields.push(Arc::new(batch.schema().field(i).clone()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I believe this will also expand uncaptured columns (those not referenced in the lambda body)
To avoid that costly expansion, is possible to:

  1. Use a NullArray as it's creation is O(1) regardless of length,
  2. Only includes on the batch the captured columns and the lambda variable, and rewrite the lambda body adjusting columns indices, as done in http://github.com/apache/datafusion/pull/18329/changes#diff-ac23ff0fe78acd71875341026dd5907736e3e3f49e2c398a69e6b33cb6394ae8R92-R139

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: array_exists

2 participants