-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928
Conversation
5982dc7
to
b5e6d2e
Compare
I've separated out a large chunk of the test refactoring to use Iceberg Records in |
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java
Outdated
Show resolved
Hide resolved
dc9b308
to
6d432c4
Compare
8cc09bd
to
896a93b
Compare
e23a906
to
110c80f
Compare
110c80f
to
2c4f57f
Compare
public static VectorizedArrowReader rowIds(Long baseRowId, VectorizedArrowReader idReader) { | ||
if (baseRowId != null) { | ||
return new RowIdVectorReader(baseRowId, idReader); | ||
} else { | ||
return nulls(); | ||
} | ||
} | ||
|
||
public static VectorizedArrowReader lastUpdated( | ||
Long baseRowId, Long fileLastUpdated, VectorizedArrowReader seqReader) { | ||
if (fileLastUpdated != null && baseRowId != null) { | ||
return new LastUpdatedSeqVectorReader(fileLastUpdated, seqReader); | ||
} else { | ||
return nulls(); | ||
} | ||
} | ||
|
||
public static VectorizedReader<?> replaceWithMetadataReader( | ||
Types.NestedField icebergField, | ||
VectorizedReader<?> reader, | ||
Map<Integer, ?> idToConstant, | ||
boolean setArrowValidityVector) { | ||
int id = icebergField.fieldId(); | ||
if (id == MetadataColumns.ROW_ID.fieldId()) { | ||
Long baseRowId = (Long) idToConstant.get(id); | ||
return rowIds(baseRowId, (VectorizedArrowReader) reader); | ||
} else if (id == MetadataColumns.LAST_UPDATED_SEQUENCE_NUMBER.fieldId()) { | ||
Long baseRowId = (Long) idToConstant.get(id); | ||
Long fileSeqNumber = (Long) idToConstant.get(id); | ||
return VectorizedArrowReader.lastUpdated( | ||
baseRowId, fileSeqNumber, (VectorizedArrowReader) reader); | ||
} else if (idToConstant.containsKey(id)) { | ||
// containsKey is used because the constant may be null | ||
return new ConstantVectorReader<>(icebergField, idToConstant.get(id)); | ||
} else if (id == MetadataColumns.ROW_POSITION.fieldId()) { | ||
if (setArrowValidityVector) { | ||
return positionsWithSetArrowValidityVector(); | ||
} else { | ||
return VectorizedArrowReader.positions(); | ||
} | ||
} else if (id == MetadataColumns.IS_DELETED.fieldId()) { | ||
return new DeletedVectorReader(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all pretty similar to what was done with the ParquetValueReader
VectorizedReader<?> reader = | ||
VectorizedArrowReader.replaceWithMetadataReader( | ||
field, readersById.get(field.fieldId()), idToConstant, setArrowValidityVector); | ||
reorderedFields.add(defaultReader(field, reader)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the refactoring done in https://github.com/apache/iceberg/pull/12836/files
@@ -113,33 +113,6 @@ public static void assertEqualsSafe(Types.StructType struct, Record rec, Row row | |||
} | |||
} | |||
|
|||
public static void assertEqualsBatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed, it's public but it's in tests in Spark so I think we're OK to directly remove this? Don't think we're breaking any compatibilities guarantees the project provides by removing it.
@@ -91,7 +91,6 @@ public void beforeEach() { | |||
assumeThat(formatVersion).isGreaterThanOrEqualTo(3); | |||
// ToDo: Remove these as row lineage inheritance gets implemented in the other readers | |||
assumeThat(fileFormat).isEqualTo(FileFormat.PARQUET); | |||
assumeThat(vectorized).isFalse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably have an additional test which creates a lot of records (more than a single batch in a vectorized read), performs a modification on records with even IDs and asserts the row lineage state. Right now we're effectively just testing a single batch within this test. Theoretically everything should just work but it'll be better to have an explicit test.
This change adds support for row lineage when performing operations on tables with the default Spark vectorized reader.