Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Apr 29, 2025

This change adds support for row lineage when performing operations on tables with the default Spark vectorized reader.

@amogh-jahagirdar amogh-jahagirdar changed the title Vectorized parquet row lineage Row lineage Vectorized Parquet Reader Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar changed the title Row lineage Vectorized Parquet Reader Spark, Arrow: Support for Row lineage when doing Vectorized Parquet reads Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 5982dc7 to b5e6d2e Compare April 29, 2025 06:01
@amogh-jahagirdar
Copy link
Contributor Author

I've separated out a large chunk of the test refactoring to use Iceberg Records in writeAndValidate instead of Avro Records into #12925 . I think we should get that in first, then I can rebase and this becomes a more focused change.

@amogh-jahagirdar amogh-jahagirdar changed the title Spark, Arrow: Support for Row lineage when doing Vectorized Parquet reads Spark 3.5, Arrow: Support for Row lineage when doing Vectorized Parquet reads Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch 7 times, most recently from dc9b308 to 6d432c4 Compare April 29, 2025 17:06
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 8cc09bd to 896a93b Compare April 30, 2025 21:14
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from e23a906 to 110c80f Compare April 30, 2025 22:29
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review April 30, 2025 22:30
@amogh-jahagirdar amogh-jahagirdar changed the title Spark 3.5, Arrow: Support for Row lineage when doing Vectorized Parquet reads Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader Apr 30, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 110c80f to 2c4f57f Compare April 30, 2025 22:32
Comment on lines +464 to +506
public static VectorizedArrowReader rowIds(Long baseRowId, VectorizedArrowReader idReader) {
if (baseRowId != null) {
return new RowIdVectorReader(baseRowId, idReader);
} else {
return nulls();
}
}

public static VectorizedArrowReader lastUpdated(
Long baseRowId, Long fileLastUpdated, VectorizedArrowReader seqReader) {
if (fileLastUpdated != null && baseRowId != null) {
return new LastUpdatedSeqVectorReader(fileLastUpdated, seqReader);
} else {
return nulls();
}
}

public static VectorizedReader<?> replaceWithMetadataReader(
Types.NestedField icebergField,
VectorizedReader<?> reader,
Map<Integer, ?> idToConstant,
boolean setArrowValidityVector) {
int id = icebergField.fieldId();
if (id == MetadataColumns.ROW_ID.fieldId()) {
Long baseRowId = (Long) idToConstant.get(id);
return rowIds(baseRowId, (VectorizedArrowReader) reader);
} else if (id == MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId()) {
Long baseRowId = (Long) idToConstant.get(id);
Long fileSeqNumber = (Long) idToConstant.get(id);
return VectorizedArrowReader.lastUpdated(
baseRowId, fileSeqNumber, (VectorizedArrowReader) reader);
} else if (idToConstant.containsKey(id)) {
// containsKey is used because the constant may be null
return new ConstantVectorReader<>(icebergField, idToConstant.get(id));
} else if (id == MetadataColumns.ROW_POSITION.fieldId()) {
if (setArrowValidityVector) {
return positionsWithSetArrowValidityVector();
} else {
return VectorizedArrowReader.positions();
}
} else if (id == MetadataColumns.IS_DELETED.fieldId()) {
return new DeletedVectorReader();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all pretty similar to what was done with the ParquetValueReader

Comment on lines +102 to +105
VectorizedReader<?> reader =
VectorizedArrowReader.replaceWithMetadataReader(
field, readersById.get(field.fieldId()), idToConstant, setArrowValidityVector);
reorderedFields.add(defaultReader(field, reader));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the refactoring done in https://github.com/apache/iceberg/pull/12836/files

@@ -113,33 +113,6 @@ public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row
}
}

public static void assertEqualsBatch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed, it's public but it's in tests in Spark so I think we're OK to directly remove this? Don't think we're breaking any compatibilities guarantees the project provides by removing it.

@@ -91,7 +91,6 @@ public void beforeEach() {
assumeThat(formatVersion).isGreaterThanOrEqualTo(3);
// ToDo: Remove these as row lineage inheritance gets implemented in the other readers
assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET);
assumeThat(vectorized).isFalse();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably have an additional test which creates a lot of records (more than a single batch in a vectorized read), performs a modification on records with even IDs and asserts the row lineage state. Right now we're effectively just testing a single batch within this test. Theoretically everything should just work but it'll be better to have an explicit test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant