-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Spark 4.0: Row Lineage support #13310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 4.0: Row Lineage support #13310
Conversation
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
7dcb5a2
to
be78ef5
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
54145af
to
3a6d7fa
Compare
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some early comments
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Show resolved
Hide resolved
.../v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineageFromMetadata.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMetadataColumn.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Outdated
Show resolved
Hide resolved
...sions/src/test/java/org/apache/iceberg/spark/extensions/SparkRowLevelOperationsTestBase.java
Outdated
Show resolved
Hide resolved
public void beforeEach() { | ||
assumeThat(formatVersion).isGreaterThanOrEqualTo(3); | ||
// ToDo: Remove these as row lineage inheritance gets implemented in the other readers | ||
assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe worth overriding parameters()
in TestRowLevelOperationsWithLineage
and defining a smaller test matrix, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup agreed! I need to rebase and incorporate hte latest test changes I made which define a smaller test matrix (and will also remove the changes I made to SparkRowLevelOperationsTestBase)
...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkMetadataColumn.java
Show resolved
Hide resolved
public NamedReference[] requiredMetadataAttributes() { | ||
NamedReference specId = Expressions.column(MetadataColumns.SPEC_ID.name()); | ||
NamedReference partition = Expressions.column(MetadataColumns.PARTITION_COLUMN_NAME); | ||
if (TableUtil.supportsRowLineage(table)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm fine either way but I think it would be could to align how this is done here (stores named references in an array) vs in SparkCopyOnWriteOperation
(which stores named references in a list)
.writeProperties(writeProperties) | ||
.build(); | ||
|
||
Function<InternalRow, InternalRow> extractRowLineage = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe rowLineageExtractor
or something along those lines? I only mention this because extractRowLineage
sounds like a boolean flag
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWrite.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Outdated
Show resolved
Hide resolved
b5c5dd6
to
0d51fb5
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java
Show resolved
Hide resolved
spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkAvroReader.java
Show resolved
Hide resolved
dad59a1
to
31cfce2
Compare
…stently for surfacing metadata columns, and include test refactorings that were done in 3.4/3.5
…isting metadata row
31cfce2
to
2eb84ca
Compare
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
…mMetadata to RowLineageExtractor
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/ExtractRowLineage.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkPositionDeltaWrite.java
Outdated
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java
Show resolved
Hide resolved
…ow lineage decoration
This change implements spark 4.0 with Iceberg v3's row lineage feature; this approach uses the new conditional nullification mechanism introduced in 4.0 instead of custom rules that we implemented for 3.5