-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Build: Bump Parquet-Java to 1.16.0 #13941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looked into teh failed test, and it looks like it splits a single partition into two files: ![]() Turns out there is a small difference in the size due to:
|
66bd095
to
6309f60
Compare
List<FileScanTask> files = | ||
StreamSupport.stream(table.newScan().planFiles().spliterator(), false) | ||
.collect(Collectors.toList()); | ||
assertThat(files.size()).as("Did not have the expected number of files").isEqualTo(numExpected); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I thought below is the preferred assertion style with better error msg
assertThat(files).as("Did not have the expected number of files").hasSize(numExpected);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to easily set a breakpoint and inspect files. I left it like that since it might be helpful in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I will leave it up to you.
changing the assertion doesn't prevent the capability of setting a breakpoint and inspecting files, since the files are collected before the assertion, as new lines 2171-2173 stay as they are in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<String> list = Arrays.asList("a", "b", "c");
assertThat(list).hasSize(4);
Above code will fail with the following helpful error msg
Expected size: 4 but was: 3 in:
["a", "b", "c"]
parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java
Outdated
Show resolved
Hide resolved
1.16.0 not here yet https://repo.maven.apache.org/maven2/org/apache/parquet/parquet-avro/ 1.16.0RC2 already has 3 binding votes, just not officially released yet |
β¦fd-test-parquet-1-16
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java
Outdated
Show resolved
Hide resolved
// added to Parquet | ||
// Preconditions.checkArgument( | ||
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType); | ||
} else if (sType instanceof VariantType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to allow both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we don't need both, but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema. The important part is that we set the annotations in TypeToMessageType.java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema.
Do we want to support this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is for us supporting any Variants written by Spark in 4.0, otherwise we couldn't possibly import them right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't think of that use-case. Spark 4 was released before the annotation, so I think you're right there π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe in the Iceberg Schema / Spark Type first and if the annotation is missing, I think we should still just read and be happy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko can we also add some comment to explain why we are checking both? other non-primitive types (like list and map) only checks the annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.
Also should we switch to fallback the old way as
LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and added the comments π
f960d54
to
7a2bf7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. let' see how CI goes once Parquet binaries are released.
push an empty commit to trigger ci, it takes a while :) |
// added to Parquet | ||
// Preconditions.checkArgument( | ||
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType); | ||
} else if (sType instanceof VariantType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.
Also should we switch to fallback the old way as
LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
thanks @Fokko for the contribution, and @aihuaxu @RussellSpitzer @nastra @kevinjqliu for the reviews |
Note that you can "re-run" failed tests from the github workloads ui too (if you are on the PMC) |
I think it should also be available for committers. |
Parquet 1.16.0 is out: https://lists.apache.org/thread/hs3tkc3k9vtcogq08yk7zl2p855voyvc