Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

amogh-jahagirdar
Copy link
Contributor

This change implements spark 4.0 with Iceberg v3's row lineage feature; this approach uses the new conditional nullification mechanism introduced in 4.0 instead of custom rules that we implemented for 3.5

@github-actions github-actions bot added the spark label Jun 14, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-4.0-row-lineage branch 2 times, most recently from 7dcb5a2 to be78ef5 Compare June 15, 2025 23:05
@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-4.0-row-lineage branch 2 times, most recently from 54145af to 3a6d7fa Compare June 17, 2025 19:21
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review June 30, 2025 19:54
@amogh-jahagirdar amogh-jahagirdar added this to the Iceberg 1.10.0 milestone Jul 7, 2025
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some early comments

public void beforeEach() {
assumeThat(formatVersion).isGreaterThanOrEqualTo(3);
// ToDo: Remove these as row lineage inheritance gets implemented in the other readers
assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth overriding parameters() in TestRowLevelOperationsWithLineage and defining a smaller test matrix, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup agreed! I need to rebase and incorporate hte latest test changes I made which define a smaller test matrix (and will also remove the changes I made to SparkRowLevelOperationsTestBase)

public NamedReference[] requiredMetadataAttributes() {
NamedReference specId = Expressions.column(MetadataColumns.SPEC_ID.name());
NamedReference partition = Expressions.column(MetadataColumns.PARTITION_COLUMN_NAME);
if (TableUtil.supportsRowLineage(table)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm fine either way but I think it would be could to align how this is done here (stores named references in an array) vs in SparkCopyOnWriteOperation (which stores named references in a list)

.writeProperties(writeProperties)
.build();

Function<InternalRow, InternalRow> extractRowLineage =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rowLineageExtractor or something along those lines? I only mention this because extractRowLineage sounds like a boolean flag

@amogh-jahagirdar amogh-jahagirdar force-pushed the spark-4.0-row-lineage branch 2 times, most recently from b5c5dd6 to 0d51fb5 Compare July 8, 2025 20:18
…stently for surfacing metadata columns, and include test refactorings that were done in 3.4/3.5
@stevenzwu stevenzwu merged commit d94b036 into apache:main Jul 14, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants