[VL][Delta] Add JVM Delta DV scan handoff#12198
Conversation
|
Run Gluten Clickhouse CI on x86 |
c1fe399 to
ebb6e2d
Compare
|
Run Gluten Clickhouse CI on x86 |
ebb6e2d to
bdec52e
Compare
bdec52e to
55365da
Compare
|
Hey @zhztheplayer @zhouyuan this PR is ready for review. This should close the DV scan migration form the initial POC work. |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
zhztheplayer
left a comment
There was a problem hiding this comment.
Thanks. I have some initial broad questions about the use of reflections.
| Class | ||
| .forName(deltaDvPreprocessRuleClassName) | ||
| .getConstructor(classOf[SparkSession]) | ||
| .newInstance(spark) | ||
| .asInstanceOf[Rule[LogicalPlan]] |
There was a problem hiding this comment.
Is there any particular reason for using reflection here? Given it's against Gluten's own class?
There was a problem hiding this comment.
Thanks @zhztheplayer you are right since this rule is Gluten-owned reflection was not needed here.
I updated this path to construct PreprocessDeltaScanWithDeletionVectors directly. To keep the direct construction compiling across the existing Spark/Delta profile matrix, I also added no-op compatibility definitions for Delta 2.3/2.4. The actual DV preprocessing behavior remains in the Delta 3.3/4.x profiles.
| private def normalizeDeltaSplitMetadata( | ||
| partitionColumnCount: Int, | ||
| partitionFiles: Seq[PartitionedFile]): Option[NormalizedDeltaSplitMetadata] = { | ||
| try { | ||
| // scalastyle:off classforname | ||
| val moduleClass = Class.forName(deltaScanInfoClassName) | ||
| // scalastyle:on classforname | ||
| val module = moduleClass.getField("MODULE$").get(null) | ||
| val extractAllMethod = moduleClass.getMethod( | ||
| "extractAllFromJava", | ||
| classOf[SparkSession], | ||
| classOf[Int], | ||
| classOf[java.util.List[_]]) | ||
| val scanInfos = extractAllMethod | ||
| .invoke(module, activeSparkSession, Int.box(partitionColumnCount), partitionFiles.asJava) | ||
| .asInstanceOf[java.util.List[_]] | ||
| .asScala | ||
| .toSeq | ||
| val splitMetadata = scanInfos.map(toDeltaSplitMetadata) | ||
| if (splitMetadata.exists(_._2.hasDeletionVector())) { | ||
| Some((splitMetadata.map(_._1), splitMetadata.map(_._2))) | ||
| } else { | ||
| None | ||
| } | ||
| } catch { | ||
| case _: ClassNotFoundException | _: NoSuchMethodException => | ||
| None | ||
| } | ||
| } | ||
|
|
||
| private def toDeltaSplitMetadata( | ||
| scanInfo: Any): (java.util.Map[String, Object], DeltaFileReadOptions) = { | ||
| val metadata = scanInfo | ||
| .getClass | ||
| .getMethod("normalizedOtherMetadataColumns") | ||
| .invoke(scanInfo) | ||
| .asInstanceOf[scala.collection.Map[String, Object]] | ||
| .asJava | ||
| val deletionVectorInfo = scanInfo.getClass.getMethod("deletionVectorInfo").invoke(scanInfo) | ||
| val rowIndexFilterType = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("rowIndexFilterType") | ||
| .invoke(deletionVectorInfo) | ||
| .toString | ||
| val hasDeletionVector = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("hasDeletionVector") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[Boolean] | ||
| val cardinality = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("cardinality") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[JLong] | ||
| .longValue() | ||
| val serializedDeletionVector = deletionVectorInfo | ||
| .getClass | ||
| .getMethod("serializedDeletionVector") | ||
| .invoke(deletionVectorInfo) | ||
| .asInstanceOf[Array[Byte]] | ||
|
|
||
| ( | ||
| metadata, | ||
| new DeltaFileReadOptions( | ||
| toDeltaRowIndexFilterType(rowIndexFilterType), | ||
| hasDeletionVector, | ||
| cardinality, | ||
| serializedDeletionVector)) | ||
| } |
There was a problem hiding this comment.
If reflection is inevitable, can we encapsulate the reflection code in a utility class?
There was a problem hiding this comment.
@zhztheplayer I moved this reflection out of VeloxIteratorApi into DeltaSplitMetadataExtractor.
Some reflection I think is still needed here because VeloxIteratorApi is in the common Velox source set, while DeltaDeletionVectorScanInfo is only compiled in Delta-specific source profiles. The extractor now isolates that bridge and caches the class/method lookups, so the iterator path only calls the utility and does not carry the reflection block inline.
Let me know if you know a workaround?
|
Run Gluten Clickhouse CI on x86 |
74293de to
64e1a4b
Compare
|
Run Gluten Clickhouse CI on x86 |
64e1a4b to
9bdc10c
Compare
|
Run Gluten Clickhouse CI on x86 |
What changes are proposed in this pull request?
This PR is the next split in the Delta deletion-vector scan stack after #12197, which has now landed.
It adds the JVM/Substrait/Velox handoff that consumes the essential Delta DV scan info extracted by #12197, materializes serialized DV payloads on the JVM side, and passes them to native scan execution.
Main changes:
PrepareDeltaScanDeltaDeletionVectorScanInfofrom [VL][Delta] Add DV scan info extraction utility #12197 to extract per-file DV metadata and serialized DV bytes from Delta-prepared scan filesDeltaReadOptionsDeltaReadOptions, instead of passing essential DV data through generic metadata columnsDeltaSplitInfopath for Delta-specific split metadataVeloxIteratorApi,VeloxPlanConverter,WholeStageResultIterator, andSubstraitToVeloxPlanThis PR is intentionally handoff-only:
Issue: #11901
How was this patch tested?
Validation after rebasing on current
upstream/mainwith #12197 included:git diff --check upstream/main..HEAD./dev/format-scala-code.sh checkclang-format --dry-run -Werroron touched C++ files./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-3.5,backends-velox,hadoop-3.3,spark-ut,delta -DskipTests./build/mvn test-compile -pl backends-velox -am -Pjava-17,spark-4.0,scala-2.13,backends-velox,hadoop-3.3,spark-ut,delta -DskipTestsWas this patch authored or co-authored using generative AI tooling?
Generated-by: IBM BOB