Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

amogh-jahagirdar
Copy link
Contributor

This change updates the Spark Parquet Vectorized read tests to write and validate against Iceberg Records instead of Avro generic records. Iceberg generic record is the interface that we should be testing against since it avoids the intricacies around Avro data types, which end up bubbling through when we build expectations currently. Refer to ab92d6e#diff-e649f357cf9965d322086f95bc78451fa1e61e4612f3d8af2114d93a1bd657aa for similar changes made for the non-vectorized Parquet reader.

This refactoring is done to prepare for the row lineage changes required for the vectorized reader. Since there are a bit more changes involved here, I've separated out the test part first so the row lineage changes are easier to review.

Comment on lines +320 to +337
case LONG:
assertThat(actual).as("Should be a long").isInstanceOf(Long.class);
if (expected instanceof Integer) {
assertThat(actual).as("Values didn't match").isEqualTo(((Number) expected).longValue());
} else {
assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected);
}
break;
case DOUBLE:
assertThat(actual).as("Should be a double").isInstanceOf(Double.class);
if (expected instanceof Float) {
assertThat(Double.doubleToLongBits((double) actual))
.as("Values didn't match")
.isEqualTo(Double.doubleToLongBits(((Number) expected).doubleValue()));
} else {
assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected);
}
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because of the int to long, float to double type promotion tests. This is the same as what exists in TestHelpers#assertEqualsUnsafe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should double check why this wasn't failing before.

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard, my first comment still stands.

I confused myself, we only do testReadsForTypePromotedColumns for the vectorized parquet reader test so it makes sense that this wasn't failing before because before we were calling TestHelpers#assertEqualsUnsafe which has this same logic to ensure the right assertions in the context of type promoted columns

@amogh-jahagirdar amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch from c2c5827 to a0e2ce1 Compare April 29, 2025 01:25
Comment on lines 267 to 269
RecordComparator comparator = new RecordComparator();
List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build());

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @rdblue I changed this logic because now some duplicate records will be generated (now that the random generation of null fields is ever so slightly different). This surfaced an issue in how we assert here because expected is a list of all the records which we wrote. However, the current implementation takes the actual results, put them in a set, which will dedupe them and the assertion below on size would fail since the set would have fewer elements.

For this test we just should compare the exact lists, and to do that I sort the records and just do an element wise comparison. It's only 1000 elements so the sort and element wise check seems inconsequential in terms of time.

@amogh-jahagirdar amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch from a0e2ce1 to 2fc136e Compare April 29, 2025 01:42
@amogh-jahagirdar amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch 6 times, most recently from 881c84f to 3321c03 Compare April 29, 2025 13:39
Comment on lines +267 to +274
Comparator<Record> recordComparator =
Comparator.comparing((Record r) -> r.get(0, Long.class))
.thenComparing(
(Record r) -> r.get(1, String.class), Comparator.nullsFirst(String::compareTo));
List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build());

expected.sort(recordComparator);
records.sort(recordComparator);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #12925 (comment) for why I changed this

Iterable<GenericData.Record> nonDictionaryData =
RandomData.generate(schema, 10000, 0L, RandomData.DEFAULT_NULL_PERCENTAGE);
try (FileAppender<GenericData.Record> writer = getParquetWriter(schema, plainEncodingFile)) {
Iterable<Record> nonDictionaryData = RandomGenericData.generate(schema, 10000, 0L);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code used to pass in an explicit null percentage of 5% as an explicit value to the generation function. Since 5% is already the default when it's not specified, I made the change to just call the utility without an explicit null percentage.

@amogh-jahagirdar
Copy link
Contributor Author

Thanks @nastra for reviewing! I'll go ahead and merge and rebase the vectorized reader changes for lineage

@amogh-jahagirdar amogh-jahagirdar merged commit d0cf7f5 into apache:main Apr 30, 2025
42 checks passed
anuragmantri added a commit to anuragmantri/iceberg that referenced this pull request Jul 25, 2025
…ord instead of Avro GenericRecord (apache#12925) (apache#1580)

Co-authored-by: Amogh Jahagirdar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants