Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[VL][Delta] Add JVM Delta DV scan handoff#12198

Closed
malinjawi wants to merge 0 commit into
apache:mainfrom
malinjawi:split/delta-dv-java-scan-handoff-pr-clean
Closed

[VL][Delta] Add JVM Delta DV scan handoff#12198
malinjawi wants to merge 0 commit into
apache:mainfrom
malinjawi:split/delta-dv-java-scan-handoff-pr-clean

Conversation

@malinjawi

@malinjawi malinjawi commented May 31, 2026

Copy link
Copy Markdown
Contributor

What changes are proposed in this pull request?

This PR is the next split in the Delta deletion-vector scan stack after #12197, which has now landed.

It adds the JVM/Substrait/Velox handoff that consumes the essential Delta DV scan info extracted by #12197, materializes serialized DV payloads on the JVM side, and passes them to native scan execution.

Main changes:

  • add a Delta DV preprocessing rule for the Velox Delta component without replacing Delta's PrepareDeltaScan
  • reuse DeltaDeletionVectorScanInfo from [VL][Delta] Add DV scan info extraction utility #12197 to extract per-file DV metadata and serialized DV bytes from Delta-prepared scan files
  • add Delta local files Substrait nodes/builders carrying DeltaReadOptions
  • embed the serialized DV payload in DeltaReadOptions, instead of passing essential DV data through generic metadata columns
  • add a native DeltaSplitInfo path for Delta-specific split metadata
  • wire the handoff through VeloxIteratorApi, VeloxPlanConverter, WholeStageResultIterator, and SubstraitToVeloxPlan
  • strip Spark's synthetic DV predicate/internal columns only after the native split has the payload, so Velox applies the DV filter natively and avoids double filtering
  • add Spark 3.5 and Spark 4.0 focused handoff coverage

This PR is intentionally handoff-only:

  • [VL][Delta] Add DV scan info extraction utility #12197 reviewed and landed the DV scan info extraction utility
  • performance/benchmark iteration remains a follow-up after the correctness handoff shape is reviewed
  • DELETE/UPDATE/MERGE DV DML support remains in later split work

Issue: #11901

How was this patch tested?

Validation after rebasing on current upstream/main with #12197 included:

  • git diff --check upstream/main..HEAD
  • container ./dev/format-scala-code.sh check
  • container clang-format --dry-run -Werror on touched C++ files
  • container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
  • container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

@github-actions github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels May 31, 2026
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch 2 times, most recently from bdec52e to 55365da Compare June 4, 2026 10:28
@malinjawi malinjawi marked this pull request as ready for review June 4, 2026 10:29
@malinjawi

Copy link
Copy Markdown
Contributor Author

Hey @zhztheplayer @zhouyuan this PR is ready for review. This should close the DV scan migration form the initial POC work.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer zhztheplayer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have some initial broad questions about the use of reflections.

Comment on lines +62 to +66
Class
.forName(deltaDvPreprocessRuleClassName)
.getConstructor(classOf[SparkSession])
.newInstance(spark)
.asInstanceOf[Rule[LogicalPlan]]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason for using reflection here? Given it's against Gluten's own class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhztheplayer you are right since this rule is Gluten-owned reflection was not needed here.

I updated this path to construct PreprocessDeltaScanWithDeletionVectors directly. To keep the direct construction compiling across the existing Spark/Delta profile matrix, I also added no-op compatibility definitions for Delta 2.3/2.4. The actual DV preprocessing behavior remains in the Delta 3.3/4.x profiles.

Comment on lines +217 to +285
private def normalizeDeltaSplitMetadata(
partitionColumnCount: Int,
partitionFiles: Seq[PartitionedFile]): Option[NormalizedDeltaSplitMetadata] = {
try {
// scalastyle:off classforname
val moduleClass = Class.forName(deltaScanInfoClassName)
// scalastyle:on classforname
val module = moduleClass.getField("MODULE$").get(null)
val extractAllMethod = moduleClass.getMethod(
"extractAllFromJava",
classOf[SparkSession],
classOf[Int],
classOf[java.util.List[_]])
val scanInfos = extractAllMethod
.invoke(module, activeSparkSession, Int.box(partitionColumnCount), partitionFiles.asJava)
.asInstanceOf[java.util.List[_]]
.asScala
.toSeq
val splitMetadata = scanInfos.map(toDeltaSplitMetadata)
if (splitMetadata.exists(_._2.hasDeletionVector())) {
Some((splitMetadata.map(_._1), splitMetadata.map(_._2)))
} else {
None
}
} catch {
case _: ClassNotFoundException | _: NoSuchMethodException =>
None
}
}

private def toDeltaSplitMetadata(
scanInfo: Any): (java.util.Map[String, Object], DeltaFileReadOptions) = {
val metadata = scanInfo
.getClass
.getMethod("normalizedOtherMetadataColumns")
.invoke(scanInfo)
.asInstanceOf[scala.collection.Map[String, Object]]
.asJava
val deletionVectorInfo = scanInfo.getClass.getMethod("deletionVectorInfo").invoke(scanInfo)
val rowIndexFilterType = deletionVectorInfo
.getClass
.getMethod("rowIndexFilterType")
.invoke(deletionVectorInfo)
.toString
val hasDeletionVector = deletionVectorInfo
.getClass
.getMethod("hasDeletionVector")
.invoke(deletionVectorInfo)
.asInstanceOf[Boolean]
val cardinality = deletionVectorInfo
.getClass
.getMethod("cardinality")
.invoke(deletionVectorInfo)
.asInstanceOf[JLong]
.longValue()
val serializedDeletionVector = deletionVectorInfo
.getClass
.getMethod("serializedDeletionVector")
.invoke(deletionVectorInfo)
.asInstanceOf[Array[Byte]]

(
metadata,
new DeltaFileReadOptions(
toDeltaRowIndexFilterType(rowIndexFilterType),
hasDeletionVector,
cardinality,
serializedDeletionVector))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reflection is inevitable, can we encapsulate the reflection code in a utility class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer I moved this reflection out of VeloxIteratorApi into DeltaSplitMetadataExtractor.

Some reflection I think is still needed here because VeloxIteratorApi is in the common Velox source set, while DeltaDeletionVectorScanInfo is only compiled in Delta-specific source profiles. The extractor now isolates that bridge and caches the class/method lookups, so the iterator path only calls the utility and does not carry the reflection block inline.

Let me know if you know a workaround?

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 74293de to 64e1a4b Compare June 4, 2026 21:42
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 64e1a4b to 9bdc10c Compare June 4, 2026 22:01
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi requested a review from zhztheplayer June 5, 2026 11:20
Comment on lines +23 to +26
/**
* Delta 4 prepares DV scan metadata before Gluten offload, so this compatibility rule does not need
* to add backend-visible DV columns.
*/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?

@zhztheplayer

No, we do NOT add new columns to the table scan output for Delta 3.3. The key insight is we're passing DV data as file-level metadata, not as scan output columns.
The design works differently"

What Delta 3.3's PreprocessTableWithDVs does:

  • Prepares Delta's internal scan metadata structures
  • Makes DV information accessible via Delta's file APIs
  • Does NOT inject __delta_internal_* columns into the scan output

What our code does:

  • Extracts DV metadata from Delta's prepared file structures
  • Passes it to Velox as per-file metadata through DeltaLocalFilesNode → Substrait protobuf
  • Velox receives: serializedDeletionVector, cardinality, rowIndexFilterType for each file
  • No columns added to scan output - tests verify the plan contains no __delta_internal_* columns

Delta 4.0 difference:

  • Delta 4.0 prepares this metadata earlier in its own pipeline
  • We just extract and forward it the same way
  • The no-op rule exists because Delta 4.0 doesn't need the explicit preprocessTablesWithDVs() call

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malinjawi thanks for explanation. I may be misunderstanding but PreprocessTableWithDVs is shadowing a Delta class, we should be extremely carefully when doing so. What's the main difference for this class, compared to the one from vanilla Delta?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer Good catch on the shadowing concern.

The shadowing PreprocessTableWithDVs is necessary because:

  1. Mixed format handling: Delta's optimizer creates scans with DeltaParquetFileFormat, but Gluten's write path produces tables with GlutenDeltaParquetFileFormat. The DV preprocessing rule must handle BOTH formats.

  2. Package-private API access: We need direct access to Delta's package-private APIs (TahoeFileIndex, deletionVectorsReadable(), DeltaSQLConf, etc.) without reflection overhead. Being in org.apache.spark.sql.delta package provides this.

We could eliminate shadowing by making GlutenDeltaParquetFileFormat extend DeltaParquetFileFormat instead of GlutenParquetFileFormat, then override only write-specific methods. This would let Delta's original PreprocessTableWithDVs handle both formats naturally.

But the tradoeff would be this breaks the clean format hierarchy separation:

  • Current: GlutenParquetFileFormatGlutenDeltaParquetFileFormat (Gluten hierarchy)
  • Alternative: DeltaParquetFileFormatGlutenDeltaParquetFileFormat (mixed hierarchy)

Right now I would say the code maintains format hierarchy separation, avoids reflection for package-private API access, and type-safe handling of both formats in DV preprocessing. But this Requires shadowing Delta's trait (maintenance burden if Delta changes internals)

Would you prefer we explore the alternative approach, or is the current shadowing acceptable given the current state?

@malinjawi malinjawi requested a review from zhztheplayer June 9, 2026 07:48
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@malinjawi malinjawi closed this Jun 9, 2026
@malinjawi malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 29dd1fa to bdaad2a Compare June 9, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants