Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Apr 29, 2025

This change adds support for row lineage when performing operations on tables with the default Spark vectorized reader.

@amogh-jahagirdar amogh-jahagirdar changed the title Vectorized parquet row lineage Row lineage Vectorized Parquet Reader Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar changed the title Row lineage Vectorized Parquet Reader Spark, Arrow: Support for Row lineage when doing Vectorized Parquet reads Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 5982dc7 to b5e6d2e Compare April 29, 2025 06:01
@amogh-jahagirdar
Copy link
Contributor Author

I've separated out a large chunk of the test refactoring to use Iceberg Records in writeAndValidate instead of Avro Records into #12925 . I think we should get that in first, then I can rebase and this becomes a more focused change.

@amogh-jahagirdar amogh-jahagirdar changed the title Spark, Arrow: Support for Row lineage when doing Vectorized Parquet reads Spark 3.5, Arrow: Support for Row lineage when doing Vectorized Parquet reads Apr 29, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch 7 times, most recently from dc9b308 to 6d432c4 Compare April 29, 2025 17:06
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 8cc09bd to 896a93b Compare April 30, 2025 21:14
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from e23a906 to 110c80f Compare April 30, 2025 22:29
@amogh-jahagirdar amogh-jahagirdar marked this pull request as ready for review April 30, 2025 22:30
@amogh-jahagirdar amogh-jahagirdar changed the title Spark 3.5, Arrow: Support for Row lineage when doing Vectorized Parquet reads Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader Apr 30, 2025
@amogh-jahagirdar amogh-jahagirdar force-pushed the vectorized-parquet-row-lineage branch from 110c80f to 2c4f57f Compare April 30, 2025 22:32
Comment on lines 464 to 506
public static VectorizedArrowReader rowIds(Long baseRowId, VectorizedArrowReader idReader) {
if (baseRowId != null) {
return new RowIdVectorReader(baseRowId, idReader);
} else {
return nulls();
}
}

public static VectorizedArrowReader lastUpdated(
Long baseRowId, Long fileLastUpdated, VectorizedArrowReader seqReader) {
if (fileLastUpdated != null && baseRowId != null) {
return new LastUpdatedSeqVectorReader(fileLastUpdated, seqReader);
} else {
return nulls();
}
}

public static VectorizedReader<?> replaceWithMetadataReader(
Types.NestedField icebergField,
VectorizedReader<?> reader,
Map<Integer, ?> idToConstant,
boolean setArrowValidityVector) {
int id = icebergField.fieldId();
if (id == MetadataColumns.ROW_ID.fieldId()) {
Long baseRowId = (Long) idToConstant.get(id);
return rowIds(baseRowId, (VectorizedArrowReader) reader);
} else if (id == MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId()) {
Long baseRowId = (Long) idToConstant.get(id);
Long fileSeqNumber = (Long) idToConstant.get(id);
return VectorizedArrowReader.lastUpdated(
baseRowId, fileSeqNumber, (VectorizedArrowReader) reader);
} else if (idToConstant.containsKey(id)) {
// containsKey is used because the constant may be null
return new ConstantVectorReader<>(icebergField, idToConstant.get(id));
} else if (id == MetadataColumns.ROW_POSITION.fieldId()) {
if (setArrowValidityVector) {
return positionsWithSetArrowValidityVector();
} else {
return VectorizedArrowReader.positions();
}
} else if (id == MetadataColumns.IS_DELETED.fieldId()) {
return new DeletedVectorReader();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all pretty similar to what was done with the ParquetValueReader

Comment on lines +102 to +105
VectorizedReader<?> reader =
VectorizedArrowReader.replaceWithMetadataReader(
field, readersById.get(field.fieldId()), idToConstant, setArrowValidityVector);
reorderedFields.add(defaultReader(field, reader));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the refactoring done in https://github.com/apache/iceberg/pull/12836/files


BigIntVector rowIds = allocateBigIntVector(ROW_ID_ARROW_FIELD, numValsToRead);
ArrowBuf dataBuffer = rowIds.getDataBuffer();
for (int i = 0; i < numValsToRead; i += 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < numValsToRead; i += 1) {
for (int i = 0; i < numValsToRead; i++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change it if you feel strongly about it, but I mostly just followed the increment pattern of i += 1 already in this class (and this package it looks like). If we do change it, I'd change it for the other instances in this class just to keep things consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually didn't realize that we have so many places that do i += 1 in for loops. It's not a big deal and I don't feel strong about it but it would be great to fix this throughout the codebase in a separate PR


@Override
public void close() {
// don't close vectors as they are not owned by readers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that the vectors are being closed in the read() method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this comment a bit more specific to say the "result vectors" since I previously copy/pasted from the other reader but I think the intent is just to be for the result vectors, lmk what you think.

I believe we can and should safely close the intermediate vectors used for calculations on the inheritane path. For instance I think for the position vectors used for calculating row IDs in case it's null or the underlying materialized row id reader, those can safely be closed after reading a batch since those vectors are scoped to the read. After that, they don't need to be used externally and I think it makes sense to close them as soon as we know it's not needed anymore.

I think the part that cannot be closed are the vectors containing the contents of the results of the reader (e.g. the BigIntVector rowIds = allocateBigIntVector(...) since then we'd be freeing contents before external readers could use them.

@amogh-jahagirdar amogh-jahagirdar requested a review from nastra May 26, 2025 15:43
@amogh-jahagirdar
Copy link
Contributor Author

Thanks @nastra I will go ahead and merge

@amogh-jahagirdar amogh-jahagirdar merged commit 73b179c into apache:main Jun 5, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants