[VL][Delta] Add JVM Delta DV scan handoff#12198
Conversation
|
Run Gluten Clickhouse CI on x86 |
c1fe399 to
ebb6e2d
Compare
|
Run Gluten Clickhouse CI on x86 |
bdec52e to
55365da
Compare
|
Hey @zhztheplayer @zhouyuan this PR is ready for review. This should close the DV scan migration form the initial POC work. |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
zhztheplayer
left a comment
There was a problem hiding this comment.
Thanks. I have some initial broad questions about the use of reflections.
| Class | ||
| .forName(deltaDvPreprocessRuleClassName) | ||
| .getConstructor(classOf[SparkSession]) | ||
| .newInstance(spark) | ||
| .asInstanceOf[Rule[LogicalPlan]] |
There was a problem hiding this comment.
Is there any particular reason for using reflection here? Given it's against Gluten's own class?
There was a problem hiding this comment.
Thanks @zhztheplayer you are right since this rule is Gluten-owned reflection was not needed here.
I updated this path to construct PreprocessDeltaScanWithDeletionVectors directly. To keep the direct construction compiling across the existing Spark/Delta profile matrix, I also added no-op compatibility definitions for Delta 2.3/2.4. The actual DV preprocessing behavior remains in the Delta 3.3/4.x profiles.
| private def normalizeDeltaSplitMetadata( | ||
| partitionColumnCount: Int, | ||
| partitionFiles: Seq[PartitionedFile]): Option[NormalizedDeltaSplitMetadata] = { | ||
| try { | ||
| // scalastyle:off classforname | ||
| val moduleClass = Class.forName(deltaScanInfoClassName) | ||
| // scalastyle:on classforname | ||
| val module = moduleClass.getField("MODULE$").get(null) | ||
| val extractAllMethod = moduleClass.getMethod( | ||
| "extractAllFromJava", | ||
| classOf[SparkSession], | ||
| classOf[Int], | ||
| classOf[java.util.List[_]]) | ||
| val scanInfos = extractAllMethod | ||
| .invoke(module, activeSparkSession, Int.box(partitionColumnCount), partitionFiles.asJava) | ||
| .asInstanceOf[java.util.List[_]] | ||
| .asScala | ||
| .toSeq | ||
| val splitMetadata = scanInfos.map(toDeltaSplitMetadata) | ||
| if (splitMetadata.exists(_._2.hasDeletionVector())) { | ||
| Some((splitMetadata.map(_._1), splitMetadata.map(_._2))) | ||
| } else { | ||
| None | ||
| } | ||
| } catch { | ||
| case _: ClassNotFoundException | _: NoSuchMethodException => | ||
| None | ||
| } | ||
| } | ||
|
|
||
| private def toDeltaSplitMetadata( | ||
| scanInfo: Any): (java.util.Map[String, Object], DeltaFileReadOptions) = { | ||
| val metadata = scanInfo | ||
| .getClass | ||
| .getMethod("normalizedOtherMetadataColumns") | ||
| .invoke(scanInfo) | ||
| .asInstanceOf[scala.collection.Map[String, Object]] | ||
| .asJava | ||
| val deletionVectorInfo = scanInfo.getClass.getMethod("deletionVectorInfo").invoke(scanInfo) | ||
| val rowIndexFilterType = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("rowIndexFilterType") | ||
| .invoke(deletionVectorInfo) | ||
| .toString | ||
| val hasDeletionVector = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("hasDeletionVector") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[Boolean] | ||
| val cardinality = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("cardinality") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[JLong] | ||
| .longValue() | ||
| val serializedDeletionVector = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("serializedDeletionVector") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[Array[Byte]] | ||
|
|
||
| ( | ||
| metadata, | ||
| new DeltaFileReadOptions( | ||
| toDeltaRowIndexFilterType(rowIndexFilterType), | ||
| hasDeletionVector, | ||
| cardinality, | ||
| serializedDeletionVector)) | ||
| } |
There was a problem hiding this comment.
If reflection is inevitable, can we encapsulate the reflection code in a utility class?
There was a problem hiding this comment.
@zhztheplayer I moved this reflection out of VeloxIteratorApi into DeltaSplitMetadataExtractor.
Some reflection I think is still needed here because VeloxIteratorApi is in the common Velox source set, while DeltaDeletionVectorScanInfo is only compiled in Delta-specific source profiles. The extractor now isolates that bridge and caches the class/method lookups, so the iterator path only calls the utility and does not carry the reflection block inline.
Let me know if you know a workaround?
|
Run Gluten Clickhouse CI on x86 |
74293de to
64e1a4b
Compare
|
Run Gluten Clickhouse CI on x86 |
64e1a4b to
9bdc10c
Compare
|
Run Gluten Clickhouse CI on x86 |
| /** | ||
| * Delta 4 prepares DV scan metadata before Gluten offload, so this compatibility rule does not need | ||
| * to add backend-visible DV columns. | ||
| */ |
There was a problem hiding this comment.
@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?
There was a problem hiding this comment.
@malinjawi does this mean we need to add new columns to table scan for Delta 3.3? if yes, can you help me understand the design here?
No, we do NOT add new columns to the table scan output for Delta 3.3. The key insight is we're passing DV data as file-level metadata, not as scan output columns.
The design works differently"
What Delta 3.3's PreprocessTableWithDVs does:
- Prepares Delta's internal scan metadata structures
- Makes DV information accessible via Delta's file APIs
- Does NOT inject
__delta_internal_*columns into the scan output
What our code does:
- Extracts DV metadata from Delta's prepared file structures
- Passes it to Velox as per-file metadata through
DeltaLocalFilesNode→ Substrait protobuf - Velox receives:
serializedDeletionVector,cardinality,rowIndexFilterTypefor each file - No columns added to scan output - tests verify the plan contains no
__delta_internal_*columns
Delta 4.0 difference:
- Delta 4.0 prepares this metadata earlier in its own pipeline
- We just extract and forward it the same way
- The no-op rule exists because Delta 4.0 doesn't need the explicit
preprocessTablesWithDVs()call
There was a problem hiding this comment.
@malinjawi thanks for explanation. I may be misunderstanding but PreprocessTableWithDVs is shadowing a Delta class, we should be extremely carefully when doing so. What's the main difference for this class, compared to the one from vanilla Delta?
There was a problem hiding this comment.
@zhztheplayer Good catch on the shadowing concern.
The shadowing PreprocessTableWithDVs is necessary because:
-
Mixed format handling: Delta's optimizer creates scans with
DeltaParquetFileFormat, but Gluten's write path produces tables withGlutenDeltaParquetFileFormat. The DV preprocessing rule must handle BOTH formats. -
Package-private API access: We need direct access to Delta's package-private APIs (
TahoeFileIndex,deletionVectorsReadable(),DeltaSQLConf, etc.) without reflection overhead. Being inorg.apache.spark.sql.deltapackage provides this.
We could eliminate shadowing by making GlutenDeltaParquetFileFormat extend DeltaParquetFileFormat instead of GlutenParquetFileFormat, then override only write-specific methods. This would let Delta's original PreprocessTableWithDVs handle both formats naturally.
But the tradoeff would be this breaks the clean format hierarchy separation:
- Current:
GlutenParquetFileFormat→GlutenDeltaParquetFileFormat(Gluten hierarchy) - Alternative:
DeltaParquetFileFormat→GlutenDeltaParquetFileFormat(mixed hierarchy)
Right now I would say the code maintains format hierarchy separation, avoids reflection for package-private API access, and type-safe handling of both formats in DV preprocessing. But this Requires shadowing Delta's trait (maintenance burden if Delta changes internals)
Would you prefer we explore the alternative approach, or is the current shadowing acceptable given the current state?
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
29dd1fa to
bdaad2a
Compare
What changes are proposed in this pull request?
This PR is the next split in the Delta deletion-vector scan stack after #12197, which has now landed.
It adds the JVM/Substrait/Velox handoff that consumes the essential Delta DV scan info extracted by #12197, materializes serialized DV payloads on the JVM side, and passes them to native scan execution.
Main changes:
PrepareDeltaScanDeltaDeletionVectorScanInfofrom [VL][Delta] Add DV scan info extraction utility #12197 to extract per-file DV metadata and serialized DV bytes from Delta-prepared scan filesDeltaReadOptionsDeltaReadOptions, instead of passing essential DV data through generic metadata columnsDeltaSplitInfopath for Delta-specific split metadataVeloxIteratorApi,VeloxPlanConverter,WholeStageResultIterator, andSubstraitToVeloxPlanThis PR is intentionally handoff-only:
Issue: #11901
How was this patch tested?
Validation after rebasing on current
upstream/mainwith #12197 included:git diff --check upstream/main..HEAD./dev/format-scala-code.sh checkclang-format --dry-run -Werroron touched C++ files./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTestsWas this patch authored or co-authored using generative AI tooling?
Generated-by: IBM BOB