-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925
Conversation
case LONG: | ||
assertThat(actual).as("Should be a long").isInstanceOf(Long.class); | ||
if (expected instanceof Integer) { | ||
assertThat(actual).as("Values didn't match").isEqualTo(((Number) expected).longValue()); | ||
} else { | ||
assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected); | ||
} | ||
break; | ||
case DOUBLE: | ||
assertThat(actual).as("Should be a double").isInstanceOf(Double.class); | ||
if (expected instanceof Float) { | ||
assertThat(Double.doubleToLongBits((double) actual)) | ||
.as("Values didn't match") | ||
.isEqualTo(Double.doubleToLongBits(((Number) expected).doubleValue())); | ||
} else { | ||
assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected); | ||
} | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed because of the int to long, float to double type promotion tests. This is the same as what exists in TestHelpers#assertEqualsUnsafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should double check why this wasn't failing before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard, my first comment still stands.
I confused myself, we only do testReadsForTypePromotedColumns
for the vectorized parquet reader test so it makes sense that this wasn't failing before because before we were calling TestHelpers#assertEqualsUnsafe
which has this same logic to ensure the right assertions in the context of type promoted columns
c2c5827
to
a0e2ce1
Compare
RecordComparator comparator = new RecordComparator(); | ||
List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rdblue I changed this logic because now some duplicate records will be generated (now that the random generation of null fields is ever so slightly different). This surfaced an issue in how we assert here because expected is a list of all the records which we wrote. However, the current implementation takes the actual results, put them in a set, which will dedupe them and the assertion below on size would fail since the set would have fewer elements.
For this test we just should compare the exact lists, and to do that I sort the records and just do an element wise comparison. It's only 1000 elements so the sort and element wise check seems inconsequential in terms of time.
a0e2ce1
to
2fc136e
Compare
881c84f
to
3321c03
Compare
Comparator<Record> recordComparator = | ||
Comparator.comparing((Record r) -> r.get(0, Long.class)) | ||
.thenComparing( | ||
(Record r) -> r.get(1, String.class), Comparator.nullsFirst(String::compareTo)); | ||
List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build()); | ||
|
||
expected.sort(recordComparator); | ||
records.sort(recordComparator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #12925 (comment) for why I changed this
Iterable<GenericData.Record> nonDictionaryData = | ||
RandomData.generate(schema, 10000, 0L, RandomData.DEFAULT_NULL_PERCENTAGE); | ||
try (FileAppender<GenericData.Record> writer = getParquetWriter(schema, plainEncodingFile)) { | ||
Iterable<Record> nonDictionaryData = RandomGenericData.generate(schema, 10000, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code used to pass in an explicit null percentage of 5% as an explicit value to the generation function. Since 5% is already the default when it's not specified, I made the change to just call the utility without an explicit null percentage.
3321c03
to
2f1dca1
Compare
Thanks @nastra for reviewing! I'll go ahead and merge and rebase the vectorized reader changes for lineage |
…ord instead of Avro GenericRecord (apache#12925) (apache#1580) Co-authored-by: Amogh Jahagirdar <[email protected]>
This change updates the Spark Parquet Vectorized read tests to write and validate against Iceberg Records instead of Avro generic records. Iceberg generic record is the interface that we should be testing against since it avoids the intricacies around Avro data types, which end up bubbling through when we build expectations currently. Refer to ab92d6e#diff-e649f357cf9965d322086f95bc78451fa1e61e4612f3d8af2114d93a1bd657aa for similar changes made for the non-vectorized Parquet reader.
This refactoring is done to prepare for the row lineage changes required for the vectorized reader. Since there are a bit more changes involved here, I've separated out the test part first so the row lineage changes are easier to review.