Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925

amogh-jahagirdar · 2025-04-28T19:16:24Z

This change updates the Spark Parquet Vectorized read tests to write and validate against Iceberg Records instead of Avro generic records. Iceberg generic record is the interface that we should be testing against since it avoids the intricacies around Avro data types, which end up bubbling through when we build expectations currently. Refer to ab92d6e#diff-e649f357cf9965d322086f95bc78451fa1e61e4612f3d8af2114d93a1bd657aa for similar changes made for the non-vectorized Parquet reader.

This refactoring is done to prepare for the row lineage changes required for the vectorized reader. Since there are a bit more changes involved here, I've separated out the test part first so the row lineage changes are easier to review.

Avro

data/src/test/java/org/apache/iceberg/data/RandomGenericData.java

amogh-jahagirdar · 2025-04-28T19:33:01Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/GenericsHelpers.java

+      case LONG:
+        assertThat(actual).as("Should be a long").isInstanceOf(Long.class);
+        if (expected instanceof Integer) {
+          assertThat(actual).as("Values didn't match").isEqualTo(((Number) expected).longValue());
+        } else {
+          assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected);
+        }
+        break;
+      case DOUBLE:
+        assertThat(actual).as("Should be a double").isInstanceOf(Double.class);
+        if (expected instanceof Float) {
+          assertThat(Double.doubleToLongBits((double) actual))
+              .as("Values didn't match")
+              .isEqualTo(Double.doubleToLongBits(((Number) expected).doubleValue()));
+        } else {
+          assertThat(actual).as("Primitive value should be equal to expected").isEqualTo(expected);
+        }
+        break;


This is needed because of the int to long, float to double type promotion tests. This is the same as what exists in TestHelpers#assertEqualsUnsafe

I should double check why this wasn't failing before.

Disregard, my first comment still stands.

I confused myself, we only do testReadsForTypePromotedColumns for the vectorized parquet reader test so it makes sense that this wasn't failing before because before we were calling TestHelpers#assertEqualsUnsafe which has this same logic to ensure the right assertions in the context of type promoted columns

amogh-jahagirdar · 2025-04-29T01:37:17Z

data/src/test/java/org/apache/iceberg/data/TestLocalScan.java

+    RecordComparator comparator = new RecordComparator();
+    List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build());
+


cc @rdblue I changed this logic because now some duplicate records will be generated (now that the random generation of null fields is ever so slightly different). This surfaced an issue in how we assert here because expected is a list of all the records which we wrote. However, the current implementation takes the actual results, put them in a set, which will dedupe them and the assertion below on size would fail since the set would have fewer elements.

For this test we just should compare the exact lists, and to do that I sort the records and just do an element wise comparison. It's only 1000 elements so the sort and element wise check seems inconsequential in terms of time.

amogh-jahagirdar · 2025-04-29T13:41:00Z

data/src/test/java/org/apache/iceberg/data/TestLocalScan.java

+    Comparator<Record> recordComparator =
+        Comparator.comparing((Record r) -> r.get(0, Long.class))
+            .thenComparing(
+                (Record r) -> r.get(1, String.class), Comparator.nullsFirst(String::compareTo));
+    List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build());
+
+    expected.sort(recordComparator);
+    records.sort(recordComparator);


See #12925 (comment) for why I changed this

amogh-jahagirdar · 2025-04-29T13:44:19Z

...pache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java

-    Iterable<GenericData.Record> nonDictionaryData =
-        RandomData.generate(schema, 10000, 0L, RandomData.DEFAULT_NULL_PERCENTAGE);
-    try (FileAppender<GenericData.Record> writer = getParquetWriter(schema, plainEncodingFile)) {
+    Iterable<Record> nonDictionaryData = RandomGenericData.generate(schema, 10000, 0L);


This code used to pass in an explicit null percentage of 5% as an explicit value to the generation function. Since 5% is already the default when it's not specified, I made the change to just call the utility without an explicit null percentage.

amogh-jahagirdar · 2025-04-30T17:08:45Z

Thanks @nastra for reviewing! I'll go ahead and merge and rebase the vectorized reader changes for lineage

…ord instead of Avro GenericRecord (apache#12925) (apache#1580) Co-authored-by: Amogh Jahagirdar <[email protected]>

Update Parquet vectorized read tests to uses Iceberg Record instead of

31a81c1

Avro

github-actions bot added spark data labels Apr 28, 2025

amogh-jahagirdar commented Apr 28, 2025

View reviewed changes

amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch from c2c5827 to a0e2ce1 Compare April 29, 2025 01:25

amogh-jahagirdar commented Apr 29, 2025

View reviewed changes

amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch from a0e2ce1 to 2fc136e Compare April 29, 2025 01:42

fix

25a1e77

amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch 6 times, most recently from 881c84f to 3321c03 Compare April 29, 2025 13:39

amogh-jahagirdar commented Apr 29, 2025

View reviewed changes

more cleanup

2f1dca1

amogh-jahagirdar force-pushed the refactor-vectorized-tests-generic-data branch from 3321c03 to 2f1dca1 Compare April 29, 2025 13:47

amogh-jahagirdar mentioned this pull request Apr 29, 2025

Spark 3.5, Arrow: Support for Row lineage when using the Parquet Vectorized reader #12928

Merged

amogh-jahagirdar requested review from rdblue and nastra April 29, 2025 18:52

nastra approved these changes Apr 30, 2025

View reviewed changes

amogh-jahagirdar merged commit d0cf7f5 into apache:main Apr 30, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925

Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925

Uh oh!

amogh-jahagirdar commented Apr 28, 2025

Uh oh!

Uh oh!

amogh-jahagirdar Apr 28, 2025

Uh oh!

amogh-jahagirdar Apr 28, 2025

Uh oh!

amogh-jahagirdar Apr 29, 2025 •

edited

Loading

Uh oh!

amogh-jahagirdar Apr 29, 2025 •

edited

Loading

Uh oh!

amogh-jahagirdar Apr 29, 2025

Uh oh!

amogh-jahagirdar Apr 29, 2025

Uh oh!

amogh-jahagirdar commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

		RecordComparator comparator = new RecordComparator();
		List<Record> records = Lists.newArrayList(IcebergGenerics.read(table).build());

Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925

Update Spark Parquet vectorized read tests to uses Iceberg Record instead of Avro GenericRecord #12925

Uh oh!

Conversation

amogh-jahagirdar commented Apr 28, 2025

Uh oh!

Uh oh!

amogh-jahagirdar Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

amogh-jahagirdar Apr 29, 2025 •

edited

Loading

amogh-jahagirdar Apr 29, 2025 •

edited

Loading