-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Parquet variant array write #12847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet variant array write #12847
Conversation
06a398f
to
678ef3c
Compare
678ef3c
to
5234050
Compare
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantWriters.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, maybe put the toString method for VariantArray in another PR as Ryan said?
Thanks for reviewing. I need this to debug while working on this PR. So think of keeping this here. |
} | ||
|
||
@Override | ||
public void write(int parentRepetition, VariantValue value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me.
parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantWriters.java
Show resolved
Hide resolved
Variant.of(EMPTY_METADATA, EMPTY_ARRAY), | ||
Variant.of(EMPTY_METADATA, TEST_ARRAY), | ||
Variant.of(EMPTY_METADATA, TEST_NESTED_ARRAY), | ||
Variant.of(TEST_METADATA, TEST_OBJECT_IN_ARRAY), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test uses ParquetVariantUtil.ParquetSchemaProducer
to produce a shredding schema, but that hasn't been updated to support arrays so the shredded case for arrays is not testing the new shredded writer. You can check by running the tests with coverage.
I think it would be a good idea to update ParquetSchemaProducer
to shred an array if it has a consistent element type across all elements. Then you'd need to ensure that the test cases here exercise that path by making some of the arrays have one element or a consistent type across elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a little more, I think you'll probably want at least one test where the array is shredded, but values are not (there is no consistent type for elements). Then the mixed test will test whether existing shredding works for arrays with partial shredding.
Here's what I'm thinking for test cases:
- An array of
"string"
,"iceberg"
- An array of
"string"
,"iceberg"
,34
- An array of objects with a consistent schema that is the shredding schema
- An array of objects with the previous case's schema, along with numbers and strings
- An array of arrays like you have, but with a consistent inner array element type
- An object with arrays
The idea is to use the schemas produced from one test (like list) to test other cases in the mixed test (like list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rdblue . Friendly reminder: I also noticed the lack of array support in ParquetVariantUtil.ParquetSchemaProducer
. However, I also observed code redundancy in ParquetSchemaProducer
across TestVariantReaders
and ParquetVariantUtil
.
To address this, I moved the ParquetSchemaProducer
class outside of the ParquetVariantUtil
class to facilitate code reuse and ease of future modifications.
I submit a PR for this: #12916. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @XBaith. I think that we need to include the changes to ParquetSchemaProducer
here. Otherwise the changes aren't exercised and we need those changes to properly test this PR.
Would it be alright with you if your changes were picked into this PR and you were listed as a co-author?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I was making the changes in TestVariantWriter before ParquetSchemaProducer was refactored and rebase lost that change.
I was using the most common type rather than requiring the unique type across all the array element since that should be closer to what engines will do. And we will the cases that some elements are shredded while some are not. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more test coverage, with one difference that we are shredding to the most common type in an array. Let me know your thoughts on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with implementing the array
method for ParquetSchemaProducer
in this PR.
However, I intentionally did not make changes to this part in my PR since #12916 mainly focuses on refactoring.
public Type array(VariantArray array, List<Type> elementResults) { | ||
return null; | ||
if (elementResults.isEmpty()) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not support shredding an array of encoded variants? This could be a list with only an inner value
column?
} | ||
|
||
// Choose most common type as shredding type and build 3-level list | ||
Type defaultTYpe = elementResults.get(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: defaultTYpe
.stream() | ||
.max(Map.Entry.comparingByValue()) | ||
.map(Map.Entry::getKey) | ||
.orElse(defaultTYpe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the list is not empty, then this should never be used. Why default it?
// Choose most common type as shredding type and build 3-level list | ||
Type defaultTYpe = elementResults.get(0); | ||
Type shredType = | ||
elementResults.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay, but it seems strange to me to rely on equals
and hashCode
for types and counting. It would be simpler to only shred values if there is a uniform type, which is when we would benefit from shredding.
private static final ByteBuffer NESTED_ARRAY_BUFFER = | ||
VariantTestUtil.createArray( | ||
array(Variants.of("string"), Variants.of("iceberg")), | ||
array(Variants.of("string"), Variants.of("iceberg"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use different string values for the second array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I had a few minor comments, but mostly they are concerned with how we determine the shredded type of an array. Everything else is correct so I'll merge this and we can think about changing the heuristic in future PRs. |
Implement Variant array write for Parquet.
#12512 has implemented the Variant readers for array. This PR is to implement the Variant writer for arrays. We are adding writing non-shredded and shredded arrays.