-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928
Conversation
5982dc7
to
b5e6d2e
Compare
I've separated out a large chunk of the test refactoring to use Iceberg Records in |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
Outdated
Show resolved
Hide resolved
dc9b308
to
6d432c4
Compare
8cc09bd
to
896a93b
Compare
e23a906
to
110c80f
Compare
110c80f
to
2c4f57f
Compare
public static VectorizedArrowReader rowIds(Long baseRowId, VectorizedArrowReader idReader) { | ||
if (baseRowId != null) { | ||
return new RowIdVectorReader(baseRowId, idReader); | ||
} else { | ||
return nulls(); | ||
} | ||
} | ||
|
||
public static VectorizedArrowReader lastUpdated( | ||
Long baseRowId, Long fileLastUpdated, VectorizedArrowReader seqReader) { | ||
if (fileLastUpdated != null && baseRowId != null) { | ||
return new LastUpdatedSeqVectorReader(fileLastUpdated, seqReader); | ||
} else { | ||
return nulls(); | ||
} | ||
} | ||
|
||
public static VectorizedReader<?> replaceWithMetadataReader( | ||
Types.NestedField icebergField, | ||
VectorizedReader<?> reader, | ||
Map<Integer, ?> idToConstant, | ||
boolean setArrowValidityVector) { | ||
int id = icebergField.fieldId(); | ||
if (id == MetadataColumns.ROW_ID.fieldId()) { | ||
Long baseRowId = (Long) idToConstant.get(id); | ||
return rowIds(baseRowId, (VectorizedArrowReader) reader); | ||
} else if (id == MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId()) { | ||
Long baseRowId = (Long) idToConstant.get(id); | ||
Long fileSeqNumber = (Long) idToConstant.get(id); | ||
return VectorizedArrowReader.lastUpdated( | ||
baseRowId, fileSeqNumber, (VectorizedArrowReader) reader); | ||
} else if (idToConstant.containsKey(id)) { | ||
// containsKey is used because the constant may be null | ||
return new ConstantVectorReader<>(icebergField, idToConstant.get(id)); | ||
} else if (id == MetadataColumns.ROW_POSITION.fieldId()) { | ||
if (setArrowValidityVector) { | ||
return positionsWithSetArrowValidityVector(); | ||
} else { | ||
return VectorizedArrowReader.positions(); | ||
} | ||
} else if (id == MetadataColumns.IS_DELETED.fieldId()) { | ||
return new DeletedVectorReader(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all pretty similar to what was done with the ParquetValueReader
VectorizedReader<?> reader = | ||
VectorizedArrowReader.replaceWithMetadataReader( | ||
field, readersById.get(field.fieldId()), idToConstant, setArrowValidityVector); | ||
reorderedFields.add(defaultReader(field, reader)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the refactoring done in https://github.com/apache/iceberg/pull/12836/files
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
Show resolved
Hide resolved
...ons/src/test/java/org/apache/iceberg/spark/extensions/TestRowLevelOperationsWithLineage.java
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
|
||
BigIntVector rowIds = allocateBigIntVector(ROW_ID_ARROW_FIELD, numValsToRead); | ||
ArrowBuf dataBuffer = rowIds.getDataBuffer(); | ||
for (int i = 0; i < numValsToRead; i += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int i = 0; i < numValsToRead; i += 1) { | |
for (int i = 0; i < numValsToRead; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change it if you feel strongly about it, but I mostly just followed the increment pattern of i += 1 already in this class (and this package it looks like). If we do change it, I'd change it for the other instances in this class just to keep things consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually didn't realize that we have so many places that do i += 1
in for loops. It's not a big deal and I don't feel strong about it but it would be great to fix this throughout the codebase in a separate PR
|
||
@Override | ||
public void close() { | ||
// don't close vectors as they are not owned by readers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that the vectors are being closed in the read()
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this comment a bit more specific to say the "result vectors" since I previously copy/pasted from the other reader but I think the intent is just to be for the result vectors, lmk what you think.
I believe we can and should safely close the intermediate vectors used for calculations on the inheritane path. For instance I think for the position vectors used for calculating row IDs in case it's null or the underlying materialized row id reader, those can safely be closed after reading a batch since those vectors are scoped to the read
. After that, they don't need to be used externally and I think it makes sense to close them as soon as we know it's not needed anymore.
I think the part that cannot be closed are the vectors containing the contents of the results of the reader (e.g. the BigIntVector rowIds = allocateBigIntVector(...)
since then we'd be freeing contents before external readers could use them.
...c/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java
Outdated
Show resolved
Hide resolved
Thanks @nastra I will go ahead and merge |
This change adds support for row lineage when performing operations on tables with the default Spark vectorized reader.