Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 28, 2025

@Fokko
Copy link
Contributor Author

Fokko commented Sep 2, 2025

Looked into teh failed test, and it looks like it splits a single partition into two files:

image

Turns out there is a small difference in the size due to:

➜  parquet-java git:(d5f86d7c) βœ— git bisect bad                                                                              
d5f86d7c0e9894510e8af6dfd37444843e6d1bc4 is the first bad commit
commit d5f86d7c0e9894510e8af6dfd37444843e6d1bc4
Author: Gang Wu <[[email protected]](mailto:[email protected])>
Date:   Tue Jan 21 16:18:19 2025 +0800

    GH-3133: Fix SizeStatistics to handle omitted histogram (#3134)

 .../apache/parquet/column/statistics/SizeStatistics.java |  6 ++++--
 .../parquet/column/statistics/TestSizeStatistics.java    | 16 ++++++++++++++++
 .../format/converter/ParquetMetadataConverter.java       | 10 ++++++++--

@Fokko Fokko force-pushed the fd-test-parquet-1-16 branch from 66bd095 to 6309f60 Compare September 2, 2025 09:15
@Fokko Fokko marked this pull request as ready for review September 2, 2025 16:01
List<FileScanTask> files =
StreamSupport.stream(table.newScan().planFiles().spliterator(), false)
.collect(Collectors.toList());
assertThat(files.size()).as("Did not have the expected number of files").isEqualTo(numExpected);
Copy link
Contributor

@stevenzwu stevenzwu Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I thought below is the preferred assertion style with better error msg

assertThat(files).as("Did not have the expected number of files").hasSize(numExpected);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to easily set a breakpoint and inspect files. I left it like that since it might be helpful in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will leave it up to you.

changing the assertion doesn't prevent the capability of setting a breakpoint and inspecting files, since the files are collected before the assertion, as new lines 2171-2173 stay as they are in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    List<String> list = Arrays.asList("a", "b", "c");
    assertThat(list).hasSize(4);

Above code will fail with the following helpful error msg

Expected size: 4 but was: 3 in:
["a", "b", "c"]

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Sep 2, 2025

Could not find org.apache.parquet:parquet-avro:1.16.0.
Searched in the following locations:
- https://repo.maven.apache.org/maven2/org/apache/parquet/parquet-avro/1.16.0/parquet-avro-1.16.0.pom
- file:/home/runner/.m2/repository/org/apache/parquet/parquet-avro/1.16.0/parquet-avro-1.16.0.pom

1.16.0 not here yet https://repo.maven.apache.org/maven2/org/apache/parquet/parquet-avro/

1.16.0RC2 already has 3 binding votes, just not officially released yet
https://lists.apache.org/thread/rb0gorvx1lysch6yxks72h94kqhsp719

@github-actions github-actions bot added the API label Sep 2, 2025
@Fokko Fokko changed the title Parquet: Test out the Parquet-Java 1.16.0 release Build: Bump Parquet-Java to 1.16.0 Sep 2, 2025
// added to Parquet
// Preconditions.checkArgument(
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType);
} else if (sType instanceof VariantType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to allow both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we don't need both, but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema. The important part is that we set the annotations in TypeToMessageType.java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema.

Do we want to support this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally not :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is for us supporting any Variants written by Spark in 4.0, otherwise we couldn't possibly import them right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't think of that use-case. Spark 4 was released before the annotation, so I think you're right there πŸ‘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in the Iceberg Schema / Spark Type first and if the annotation is missing, I think we should still just read and be happy :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko can we also add some comment to explain why we are checking both? other non-primitive types (like list and map) only checks the annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.

Also should we switch to fallback the old way as

LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and added the comments πŸ‘

@Fokko Fokko force-pushed the fd-test-parquet-1-16 branch from f960d54 to 7a2bf7d Compare September 2, 2025 19:18
Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. let' see how CI goes once Parquet binaries are released.

@kevinjqliu
Copy link
Contributor

@kevinjqliu
Copy link
Contributor

push an empty commit to trigger ci, it takes a while :)

// added to Parquet
// Preconditions.checkArgument(
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType);
} else if (sType instanceof VariantType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.

Also should we switch to fallback the old way as

LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stevenzwu stevenzwu merged commit 12ab7fc into apache:main Sep 3, 2025
43 checks passed
@stevenzwu
Copy link
Contributor

thanks @Fokko for the contribution, and @aihuaxu @RussellSpitzer @nastra @kevinjqliu for the reviews

@RussellSpitzer
Copy link
Member

push an empty commit to trigger ci, it takes a while :)

Note that you can "re-run" failed tests from the github workloads ui too (if you are on the PMC)

@Fokko
Copy link
Contributor Author

Fokko commented Sep 3, 2025

(if you are on the PMC)

I think it should also be available for committers.

@Fokko Fokko deleted the fd-test-parquet-1-16 branch September 3, 2025 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants