Codestin Search App

malinjawi · 2026-05-31T09:22:55Z

What changes are proposed in this pull request?

This PR is the next split in the Delta deletion-vector scan stack after #12197, which has now landed.

It adds the JVM/Substrait/Velox handoff that consumes the essential Delta DV scan info extracted by #12197, materializes serialized DV payloads on the JVM side, and passes them to native scan execution.

Main changes:

add a Delta DV preprocessing rule for the Velox Delta component without replacing Delta's PrepareDeltaScan
reuse DeltaDeletionVectorScanInfo from [VL][Delta] Add DV scan info extraction utility #12197 to extract per-file DV metadata and serialized DV bytes from Delta-prepared scan files
add Delta local files Substrait nodes/builders carrying DeltaReadOptions
embed the serialized DV payload in DeltaReadOptions, instead of passing essential DV data through generic metadata columns
add a native DeltaSplitInfo path for Delta-specific split metadata
wire the handoff through VeloxIteratorApi, VeloxPlanConverter, WholeStageResultIterator, and SubstraitToVeloxPlan
strip Spark's synthetic DV predicate/internal columns only after the native split has the payload, so Velox applies the DV filter natively and avoids double filtering
add Spark 3.5 and Spark 4.0 focused handoff coverage

This PR is intentionally handoff-only:

[VL][Delta] Add DV scan info extraction utility #12197 reviewed and landed the DV scan info extraction utility
performance/benchmark iteration remains a follow-up after the correctness handoff shape is reviewed
DELETE/UPDATE/MERGE DV DML support remains in later split work

Issue: #11901

How was this patch tested?

Validation after rebasing on current upstream/main with #12197 included:

git diff --check upstream/main..HEAD
container ./dev/format-scala-code.sh check
container clang-format --dry-run -Werror on touched C++ files
container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests
container ./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests

Was this patch authored or co-authored using generative AI tooling?

Generated-by: IBM BOB

github-actions · 2026-05-31T09:23:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-02T19:36:13Z

Run Gluten Clickhouse CI on x86

malinjawi · 2026-06-04T10:58:31Z

Hey @zhztheplayer @zhouyuan this PR is ready for review. This should close the DV scan migration form the initial POC work.

github-actions · 2026-06-04T11:01:38Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-04T11:01:39Z

Run Gluten Clickhouse CI on x86

zhztheplayer

Thanks. I have some initial broad questions about the use of reflections.

zhztheplayer · 2026-06-04T15:23:07Z

+    Class
+      .forName(deltaDvPreprocessRuleClassName)
+      .getConstructor(classOf[SparkSession])
+      .newInstance(spark)
+      .asInstanceOf[Rule[LogicalPlan]]


Is there any particular reason for using reflection here? Given it's against Gluten's own class?

Thanks @zhztheplayer you are right since this rule is Gluten-owned reflection was not needed here.

I updated this path to construct PreprocessDeltaScanWithDeletionVectors directly. To keep the direct construction compiling across the existing Spark/Delta profile matrix, I also added no-op compatibility definitions for Delta 2.3/2.4. The actual DV preprocessing behavior remains in the Delta 3.3/4.x profiles.

zhztheplayer · 2026-06-04T15:25:36Z

+  private def normalizeDeltaSplitMetadata(
+      partitionColumnCount: Int,
+      partitionFiles: Seq[PartitionedFile]): Option[NormalizedDeltaSplitMetadata] = {
+    try {
+      // scalastyle:off classforname
+      val moduleClass = Class.forName(deltaScanInfoClassName)
+      // scalastyle:on classforname
+      val module = moduleClass.getField("MODULE$").get(null)
+      val extractAllMethod = moduleClass.getMethod(
+        "extractAllFromJava",
+        classOf[SparkSession],
+        classOf[Int],
+        classOf[java.util.List[_]])
+      val scanInfos = extractAllMethod
+        .invoke(module, activeSparkSession, Int.box(partitionColumnCount), partitionFiles.asJava)
+        .asInstanceOf[java.util.List[_]]
+        .asScala
+        .toSeq
+      val splitMetadata = scanInfos.map(toDeltaSplitMetadata)
+      if (splitMetadata.exists(_._2.hasDeletionVector())) {
+        Some((splitMetadata.map(_._1), splitMetadata.map(_._2)))
+      } else {
+        None
+      }
+    } catch {
+      case _: ClassNotFoundException | _: NoSuchMethodException =>
+        None
+    }
+  }
+
+  private def toDeltaSplitMetadata(
+      scanInfo: Any): (java.util.Map[String, Object], DeltaFileReadOptions) = {
+    val metadata = scanInfo
+      .getClass
+      .getMethod("normalizedOtherMetadataColumns")
+      .invoke(scanInfo)
+      .asInstanceOf[scala.collection.Map[String, Object]]
+      .asJava
+    val deletionVectorInfo = scanInfo.getClass.getMethod("deletionVectorInfo").invoke(scanInfo)
+    val rowIndexFilterType = deletionVectorInfo
+      .getClass
+      .getMethod("rowIndexFilterType")
+      .invoke(deletionVectorInfo)
+      .toString
+    val hasDeletionVector = deletionVectorInfo
+      .getClass
+      .getMethod("hasDeletionVector")
+      .invoke(deletionVectorInfo)
+      .asInstanceOf[Boolean]
+    val cardinality = deletionVectorInfo
+      .getClass
+      .getMethod("cardinality")
+      .invoke(deletionVectorInfo)
+      .asInstanceOf[JLong]
+      .longValue()
+    val serializedDeletionVector = deletionVectorInfo
+      .getClass
+      .getMethod("serializedDeletionVector")
+      .invoke(deletionVectorInfo)
+      .asInstanceOf[Array[Byte]]
+
+    (
+      metadata,
+      new DeltaFileReadOptions(
+        toDeltaRowIndexFilterType(rowIndexFilterType),
+        hasDeletionVector,
+        cardinality,
+        serializedDeletionVector))
+  }


If reflection is inevitable, can we encapsulate the reflection code in a utility class?

@zhztheplayer I moved this reflection out of VeloxIteratorApi into DeltaSplitMetadataExtractor.

Some reflection I think is still needed here because VeloxIteratorApi is in the common Velox source set, while DeltaDeletionVectorScanInfo is only compiled in Delta-specific source profiles. The extractor now isolates that bridge and caches the class/method lookups, so the iterator path only calls the utility and does not carry the reflection block inline.

Let me know if you know a workaround?

github-actions · 2026-06-04T21:03:14Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-04T21:42:40Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-04T22:01:56Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2026-06-08T10:38:58Z

+/**
+ * Delta 4 prepares DV scan metadata before Gluten offload, so this compatibility rule does not need
+ * to add backend-visible DV columns.
+ */


@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?

@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?

@zhztheplayer

No, we do NOT add new columns to the table scan output for Delta 3.3. The key insight is we're passing DV data as file-level metadata, not as scan output columns.
The design works differently"

What Delta 3.3's PreprocessTableWithDVs does:

Prepares Delta's internal scan metadata structures

Makes DV information accessible via Delta's file APIs

Does NOT inject __delta_internal_* columns into the scan output

What our code does:

Extracts DV metadata from Delta's prepared file structures

Passes it to Velox as per-file metadata through DeltaLocalFilesNode → Substrait protobuf

Velox receives: serializedDeletionVector, cardinality, rowIndexFilterType for each file

No columns added to scan output - tests verify the plan contains no __delta_internal_* columns

Delta 4.0 difference:

Delta 4.0 prepares this metadata earlier in its own pipeline

We just extract and forward it the same way

The no-op rule exists because Delta 4.0 doesn't need the explicit preprocessTablesWithDVs() call

@malinjawi thanks for explanation. I may be misunderstanding but PreprocessTableWithDVs is shadowing a Delta class, we should be extremely carefully when doing so. What's the main difference for this class, compared to the one from vanilla Delta?

@zhztheplayer Good catch on the shadowing concern.

The shadowing PreprocessTableWithDVs is necessary because:

Mixed format handling: Delta's optimizer creates scans with DeltaParquetFileFormat, but Gluten's write path produces tables with GlutenDeltaParquetFileFormat. The DV preprocessing rule must handle BOTH formats.

Package-private API access: We need direct access to Delta's package-private APIs (TahoeFileIndex, deletionVectorsReadable(), DeltaSQLConf, etc.) without reflection overhead. Being in org.apache.spark.sql.delta package provides this.

We could eliminate shadowing by making GlutenDeltaParquetFileFormat extend DeltaParquetFileFormat instead of GlutenParquetFileFormat, then override only write-specific methods. This would let Delta's original PreprocessTableWithDVs handle both formats naturally.

But the tradoeff would be this breaks the clean format hierarchy separation:

Current: GlutenParquetFileFormat → GlutenDeltaParquetFileFormat (Gluten hierarchy)

Alternative: DeltaParquetFileFormat → GlutenDeltaParquetFileFormat (mixed hierarchy)

Right now I would say the code maintains format hierarchy separation, avoids reflection for package-private API access, and type-safe handling of both formats in DV preprocessing. But this Requires shadowing Delta's trait (maintenance burden if Delta changes internals)

Would you prefer we explore the alternative approach, or is the current shadowing acceptable given the current state?

github-actions · 2026-06-09T11:17:08Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-09T12:00:21Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX DATA_LAKE labels May 31, 2026

This was referenced Jun 1, 2026

[VL][Delta] Add JVM Delta DV scan handoff #12131

Open

[VL][Delta] Guard DV DML row-index scans #12215

Draft

[VL][Delta] Add persistent DV DELETE correctness path #12216

Draft

[VL][Delta] Add DELETE DV diagnostics benchmark #12217

Draft

malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from c1fe399 to ebb6e2d Compare June 2, 2026 19:23

malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch 2 times, most recently from bdec52e to 55365da Compare June 4, 2026 10:28

malinjawi marked this pull request as ready for review June 4, 2026 10:29

zhztheplayer reviewed Jun 4, 2026

View reviewed changes

malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 74293de to 64e1a4b Compare June 4, 2026 21:42

malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 64e1a4b to 9bdc10c Compare June 4, 2026 22:01

malinjawi requested a review from zhztheplayer June 5, 2026 11:20

zhztheplayer reviewed Jun 8, 2026

View reviewed changes

malinjawi requested a review from zhztheplayer June 9, 2026 07:48

malinjawi closed this Jun 9, 2026

malinjawi force-pushed the split/delta-dv-java-scan-handoff-pr-clean branch from 29dd1fa to bdaad2a Compare June 9, 2026 13:12

malinjawi mentioned this pull request Jun 9, 2026

[VL][Delta] Add JVM Delta DV scan handoff #12269

Open

Conversation

malinjawi commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

malinjawi commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

malinjawi commented May 31, 2026 •

edited

Loading