Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[VL][Delta] Add JVM Delta DV scan handoff#12198

Open
malinjawi wants to merge 2 commits into
apache:mainfrom
malinjawi:split/delta-dv-java-scan-handoff-pr-clean
Open

[VL][Delta] Add JVM Delta DV scan handoff#12198
malinjawi wants to merge 2 commits into
apache:mainfrom
malinjawi:split/delta-dv-java-scan-handoff-pr-clean

Conversation

@malinjawi
Copy link
Copy Markdown
Contributor

@malinjawi malinjawi commented May 31, 2026

What changes are proposed in this pull request?

This PR is the next split in the Delta deletion-vector scan stack after #12197, which has now landed.

It adds the JVM/Substrait/Velox handoff that consumes the essential Delta DV scan info extracted by #12197, materializes serialized DV payloads on the JVM side, and passes them to native scan execution.

Main changes:

  • add a Delta DV preprocessing rule for the Velox Delta component without replacing Delta's PrepareDeltaScan
  • reuse DeltaDeletionVectorScanInfo from [VL][Delta] Add DV scan info extraction utility #12197 to extract per-file DV metadata and serialized DV bytes from Delta-prepared scan files
  • add Delta local files Substrait nodes/builders carrying DeltaReadOptions
  • embed the serialized DV payload in DeltaReadOptions, instead of passing essential DV data through generic metadata columns
  • add a native DeltaSplitInfo path for Delta-specific split metadata
  • wire the handoff through VeloxIteratorApi, VeloxPlanConverter, WholeStageResultIterator, and SubstraitToVeloxPlan
  • strip Spark's synthetic DV predicate/internal columns only after the native split has the payload, so Velox applies the DV filter natively and avoids double filtering
  • add Spark 3.5 and Spark 4.0 focused handoff coverage

This PR is intentionally handoff-only:

  • [VL][Delta] Add DV scan info extraction utility #12197 reviewed and landed the DV scan info extraction utility
  • performance/benchmark iteration remains a follow-up after the correctness handoff shape is reviewed
  • DELETE/UPDATE/MERGE DV DML support remains in later split work

Issue: #11901

How was this patch tested?

Validation after rebasing on current upstream/main with #12197 included:

  • git diff --check upstream/main..HEAD
  • container ./dev/format-scala-code.sh check
  • container clang-format --dry-run -Werror on touched C++ files
  • container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels May 31, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from ebb6e2d to bdec52e Compare June 4, 2026 10:25
@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from bdec52e to 55365da Compare June 4, 2026 10:28
@malinjawi malinjawi marked this pull request as ready for review June 4, 2026 10:29
@malinjawi
Copy link
Copy Markdown
Contributor Author

Hey @zhztheplayer @zhouyuan this PR is ready for review. This should close the DV scan migration form the initial POC work.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have some initial broad questions about the use of reflections.

Comment on lines +62 to +66
Class
.forName(deltaDvPreprocessRuleClassName)
.getConstructor(classOf[SparkSession])
.newInstance(spark)
.asInstanceOf[Rule[LogicalPlan]]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason for using reflection here? Given it's against Gluten's own class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhztheplayer you are right since this rule is Gluten-owned reflection was not needed here.

I updated this path to construct PreprocessDeltaScanWithDeletionVectors directly. To keep the direct construction compiling across the existing Spark/Delta profile matrix, I also added no-op compatibility definitions for Delta 2.3/2.4. The actual DV preprocessing behavior remains in the Delta 3.3/4.x profiles.

Comment on lines +217 to +285
private def normalizeDeltaSplitMetadata(
partitionColumnCount: Int,
partitionFiles: Seq[PartitionedFile]): Option[NormalizedDeltaSplitMetadata] = {
try {
// scalastyle:off classforname
val moduleClass = Class.forName(deltaScanInfoClassName)
// scalastyle:on classforname
val module = moduleClass.getField("MODULE$").get(null)
val extractAllMethod = moduleClass.getMethod(
"extractAllFromJava",
classOf[SparkSession],
classOf[Int],
classOf[java.util.List[_]])
val scanInfos = extractAllMethod
.invoke(module, activeSparkSession, Int.box(partitionColumnCount), partitionFiles.asJava)
.asInstanceOf[java.util.List[_]]
.asScala
.toSeq
val splitMetadata = scanInfos.map(toDeltaSplitMetadata)
if (splitMetadata.exists(_._2.hasDeletionVector())) {
Some((splitMetadata.map(_._1), splitMetadata.map(_._2)))
} else {
None
}
} catch {
case _: ClassNotFoundException | _: NoSuchMethodException =>
None
}
}

private def toDeltaSplitMetadata(
scanInfo: Any): (java.util.Map[String, Object], DeltaFileReadOptions) = {
val metadata = scanInfo
.getClass
.getMethod("normalizedOtherMetadataColumns")
.invoke(scanInfo)
.asInstanceOf[scala.collection.Map[String, Object]]
.asJava
val deletionVectorInfo = scanInfo.getClass.getMethod("deletionVectorInfo").invoke(scanInfo)
val rowIndexFilterType = deletionVectorInfo
.getClass
.getMethod("rowIndexFilterType")
.invoke(deletionVectorInfo)
.toString
val hasDeletionVector = deletionVectorInfo
.getClass
.getMethod("hasDeletionVector")
.invoke(deletionVectorInfo)
.asInstanceOf[Boolean]
val cardinality = deletionVectorInfo
.getClass
.getMethod("cardinality")
.invoke(deletionVectorInfo)
.asInstanceOf[JLong]
.longValue()
val serializedDeletionVector = deletionVectorInfo
.getClass
.getMethod("serializedDeletionVector")
.invoke(deletionVectorInfo)
.asInstanceOf[Array[Byte]]

(
metadata,
new DeltaFileReadOptions(
toDeltaRowIndexFilterType(rowIndexFilterType),
hasDeletionVector,
cardinality,
serializedDeletionVector))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reflection is inevitable, can we encapsulate the reflection code in a utility class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer I moved this reflection out of VeloxIteratorApi into DeltaSplitMetadataExtractor.

Some reflection I think is still needed here because VeloxIteratorApi is in the common Velox source set, while DeltaDeletionVectorScanInfo is only compiled in Delta-specific source profiles. The extractor now isolates that bridge and caches the class/method lookups, so the iterator path only calls the utility and does not carry the reflection block inline.

Let me know if you know a workaround?

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 74293de to 64e1a4b Compare June 4, 2026 21:42
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 64e1a4b to 9bdc10c Compare June 4, 2026 22:01
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi requested a review from zhztheplayer June 5, 2026 11:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants