Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Apr 19, 2025

Implement Variant array write for Parquet.

#12512 has implemented the Variant readers for array. This PR is to implement the Variant writer for arrays. We are adding writing non-shredded and shredded arrays.

@aihuaxu aihuaxu force-pushed the parquet-variant-array-write branch 3 times, most recently from 06a398f to 678ef3c Compare April 25, 2025 23:37
@aihuaxu aihuaxu marked this pull request as ready for review April 25, 2025 23:38
@aihuaxu aihuaxu force-pushed the parquet-variant-array-write branch from 678ef3c to 5234050 Compare April 25, 2025 23:41
Copy link
Contributor

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, maybe put the toString method for VariantArray in another PR as Ryan said?

@aihuaxu
Copy link
Contributor Author

aihuaxu commented Apr 27, 2025

Looks good to me, maybe put the toString method for VariantArray in another PR as Ryan said?

Thanks for reviewing. I need this to debug while working on this PR. So think of keeping this here.

}

@Override
public void write(int parentRepetition, VariantValue value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me.

Variant.of(EMPTY_METADATA, EMPTY_ARRAY),
Variant.of(EMPTY_METADATA, TEST_ARRAY),
Variant.of(EMPTY_METADATA, TEST_NESTED_ARRAY),
Variant.of(TEST_METADATA, TEST_OBJECT_IN_ARRAY),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses ParquetVariantUtil.ParquetSchemaProducer to produce a shredding schema, but that hasn't been updated to support arrays so the shredded case for arrays is not testing the new shredded writer. You can check by running the tests with coverage.

I think it would be a good idea to update ParquetSchemaProducer to shred an array if it has a consistent element type across all elements. Then you'd need to ensure that the test cases here exercise that path by making some of the arrays have one element or a consistent type across elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a little more, I think you'll probably want at least one test where the array is shredded, but values are not (there is no consistent type for elements). Then the mixed test will test whether existing shredding works for arrays with partial shredding.

Here's what I'm thinking for test cases:

  • An array of "string", "iceberg"
  • An array of "string", "iceberg", 34
  • An array of objects with a consistent schema that is the shredding schema
  • An array of objects with the previous case's schema, along with numbers and strings
  • An array of arrays like you have, but with a consistent inner array element type
  • An object with arrays

The idea is to use the schemas produced from one test (like list) to test other cases in the mixed test (like list).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rdblue . Friendly reminder: I also noticed the lack of array support in ParquetVariantUtil.ParquetSchemaProducer. However, I also observed code redundancy in ParquetSchemaProducer across TestVariantReaders and ParquetVariantUtil.

To address this, I moved the ParquetSchemaProducer class outside of the ParquetVariantUtil class to facilitate code reuse and ease of future modifications.

I submit a PR for this: #12916. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @XBaith. I think that we need to include the changes to ParquetSchemaProducer here. Otherwise the changes aren't exercised and we need those changes to properly test this PR.

Would it be alright with you if your changes were picked into this PR and you were listed as a co-author?

Copy link
Contributor Author

@aihuaxu aihuaxu Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I was making the changes in TestVariantWriter before ParquetSchemaProducer was refactored and rebase lost that change.

I was using the most common type rather than requiring the unique type across all the array element since that should be closer to what engines will do. And we will the cases that some elements are shredded while some are not. Let me know your thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I checked #12916. Actually it can be separated out since it is a refactoring and TestVariantWriter and TestVariantReader should provide the same coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more test coverage, with one difference that we are shredding to the most common type in an array. Let me know your thoughts on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with implementing the array method for ParquetSchemaProducer in this PR.

However, I intentionally did not make changes to this part in my PR since #12916 mainly focuses on refactoring.

public Type array(VariantArray array, List<Type> elementResults) {
return null;
if (elementResults.isEmpty()) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not support shredding an array of encoded variants? This could be a list with only an inner value column?

}

// Choose most common type as shredding type and build 3-level list
Type defaultTYpe = elementResults.get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: defaultTYpe

.stream()
.max(Map.Entry.comparingByValue())
.map(Map.Entry::getKey)
.orElse(defaultTYpe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the list is not empty, then this should never be used. Why default it?

// Choose most common type as shredding type and build 3-level list
Type defaultTYpe = elementResults.get(0);
Type shredType =
elementResults.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay, but it seems strange to me to rely on equals and hashCode for types and counting. It would be simpler to only shred values if there is a uniform type, which is when we would benefit from shredding.

private static final ByteBuffer NESTED_ARRAY_BUFFER =
VariantTestUtil.createArray(
array(Variants.of("string"), Variants.of("iceberg")),
array(Variants.of("string"), Variants.of("iceberg")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use different string values for the second array?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@rdblue
Copy link
Contributor

rdblue commented Apr 29, 2025

I had a few minor comments, but mostly they are concerned with how we determine the shredded type of an array. Everything else is correct so I'll merge this and we can think about changing the heuristic in future PRs.

@rdblue rdblue merged commit 242717c into apache:main Apr 29, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants