-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
amogh-jahagirdar
merged 6 commits into
apache:main
from
amogh-jahagirdar:vectorized-parquet-row-lineage
Jun 5, 2025
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
908c14f
Spark, Arrow: Support for Row lineage when doing Vectorized Parquet r…
amogh-jahagirdar 896a93b
fixes
amogh-jahagirdar 11dac99
remove unused methods
amogh-jahagirdar 2c4f57f
bit more cleanup
amogh-jahagirdar e63f33e
make sure we're closing intermediate batches while reading
amogh-jahagirdar 544302b
Add a test which tests many records, cleanup inline comments
amogh-jahagirdar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,11 +24,9 @@ | |
import java.util.function.Function; | ||
import java.util.stream.IntStream; | ||
import org.apache.arrow.memory.BufferAllocator; | ||
import org.apache.iceberg.MetadataColumns; | ||
import org.apache.iceberg.Schema; | ||
import org.apache.iceberg.arrow.ArrowAllocation; | ||
import org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.ConstantVectorReader; | ||
import org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.DeletedVectorReader; | ||
import org.apache.iceberg.parquet.TypeWithSchemaVisitor; | ||
import org.apache.iceberg.parquet.VectorizedReader; | ||
import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList; | ||
|
@@ -101,33 +99,26 @@ public VectorizedReader<?> message( | |
Lists.newArrayListWithExpectedSize(icebergFields.size()); | ||
|
||
for (Types.NestedField field : icebergFields) { | ||
int id = field.fieldId(); | ||
VectorizedReader<?> reader = readersById.get(id); | ||
if (idToConstant.containsKey(id)) { | ||
reorderedFields.add(constantReader(field, idToConstant.get(id))); | ||
} else if (id == MetadataColumns.ROW_POSITION.fieldId()) { | ||
if (setArrowValidityVector) { | ||
reorderedFields.add(VectorizedArrowReader.positionsWithSetArrowValidityVector()); | ||
} else { | ||
reorderedFields.add(VectorizedArrowReader.positions()); | ||
} | ||
} else if (id == MetadataColumns.IS_DELETED.fieldId()) { | ||
reorderedFields.add(new DeletedVectorReader()); | ||
} else if (reader != null) { | ||
reorderedFields.add(reader); | ||
} else if (field.initialDefault() != null) { | ||
reorderedFields.add( | ||
constantReader(field, convert.apply(field.type(), field.initialDefault()))); | ||
} else if (field.isOptional()) { | ||
reorderedFields.add(VectorizedArrowReader.nulls()); | ||
} else { | ||
throw new IllegalArgumentException( | ||
String.format("Missing required field: %s", field.name())); | ||
} | ||
VectorizedReader<?> reader = | ||
VectorizedArrowReader.replaceWithMetadataReader( | ||
field, readersById.get(field.fieldId()), idToConstant, setArrowValidityVector); | ||
reorderedFields.add(defaultReader(field, reader)); | ||
Comment on lines
+102
to
+105
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as the refactoring done in https://github.com/apache/iceberg/pull/12836/files |
||
} | ||
return vectorizedReader(reorderedFields); | ||
} | ||
|
||
private VectorizedReader<?> defaultReader(Types.NestedField field, VectorizedReader<?> reader) { | ||
if (reader != null) { | ||
return reader; | ||
} else if (field.initialDefault() != null) { | ||
return constantReader(field, convert.apply(field.type(), field.initialDefault())); | ||
} else if (field.isOptional()) { | ||
return VectorizedArrowReader.nulls(); | ||
} | ||
|
||
throw new IllegalArgumentException(String.format("Missing required field: %s", field.name())); | ||
} | ||
|
||
private <T> ConstantVectorReader<T> constantReader(Types.NestedField field, T constant) { | ||
return new ConstantVectorReader<>(field, constant); | ||
} | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change it if you feel strongly about it, but I mostly just followed the increment pattern of i += 1 already in this class (and this package it looks like). If we do change it, I'd change it for the other instances in this class just to keep things consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually didn't realize that we have so many places that do
i += 1
in for loops. It's not a big deal and I don't feel strong about it but it would be great to fix this throughout the codebase in a separate PR