Spark: Support Parquet dictionary encoded UUIDs #13324

Fokko · 2025-06-16T11:35:25Z

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007

I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't caught previously.

Closes #4581

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007 I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet. For PyIceberg we only generate little data, so therefore this wasn't caught previously.

kevinjqliu

Generally LGTM

Is there a way to test this? Can we add a dictionary encoded UUID like this?

dingo4dev

@Fokko TIA.

Do we have any tests for the UUIDs partition read?

Fokko · 2025-06-18T05:43:54Z

@dingo4dev Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍

Fokko · 2025-06-18T08:14:49Z

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.

Test has been added and checked using breakpoints that it hits the newly added lines 👍

RussellSpitzer · 2025-07-08T20:51:06Z

...c/test/java/org/apache/iceberg/spark/data/vectorized/parquet/TestParquetVectorizedReads.java

+    Schema schema = new Schema(optional(100, "uuid", Types.UUIDType.get()));
+
+    File dataFile = File.createTempFile("junit", null, temp.toFile());
+    assertThat(dataFile.delete()).as("Delete should succeed").isTrue();


Why do we do this? Test looks fine just wondering why we create and then delete

There are multiple tests that use the same directory. Removing this line causes:

> Task :iceberg-spark:iceberg-spark-3.5_2.12:test FAILED TestParquetVectorizedReads > testUuidReads() FAILED org.apache.iceberg.exceptions.AlreadyExistsException: File already exists: /var/folders/q_/gj3w0fts15n9l1dz58630jg40000gp/T/junit-17282529623949094801/junit6226812078387717274.tmp at app//org.apache.iceberg.Files$LocalOutputFile.create(Files.java:56)

I've created an issue to fix this for all the tests: #13506

Took a look at the issue, @TempDir make sure each test get a unique directory so multiple tests won't affect each other. However, the reason these tests need to manually delete the directory is that @TempDir always creates the directory before the test starts, and getParquetWriter throws an exception if the target folder already exists.

What we actually need here is just a randomly generated path without creating anything at that location. Unfortunately, @TempDir doesn’t currently support that level of control.

I'm not a JUnit expert @liamzwbao, but this blog suggests that you can also recreate the directory after each test

Fokko · 2025-07-09T22:29:44Z

Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀

nastra · 2025-07-10T05:18:57Z

@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10

kevinjqliu · 2025-07-11T01:31:14Z

and perhaps spark 3.4 as well
https://grep.app/search?f.repo=apache%2Ficeberg&q=UTF8String+ofRow%28FixedSizeBinaryVector

apache#13324 for Spark 4.0

#13324 for Spark 4.0

Backport of apache#13324

Backport of #13324

github-actions bot added spark arrow labels Jun 16, 2025

Fokko mentioned this pull request Jun 16, 2025

Spark: Cannot read or write UUID columns #4581

Closed

Fokko force-pushed the fd-uuid branch from c000b5c to 97c150d Compare June 16, 2025 11:41

Fokko force-pushed the fd-uuid branch from 97c150d to f033e4d Compare June 16, 2025 14:44

Fokko mentioned this pull request Jun 16, 2025

fix: correct UUIDType partition representation for BucketTransform apache/iceberg-python#2003

Closed

kevinjqliu approved these changes Jun 17, 2025

View reviewed changes

dingo4dev reviewed Jun 18, 2025

View reviewed changes

Add another test

cc50da4

Merge branch 'main' of github.com:apache/iceberg into fd-uuid

bb8e871

Fokko requested a review from stevenzwu July 1, 2025 20:37

Fokko mentioned this pull request Jul 3, 2025

Add golden file tests for vectorized Parquet reads #13450

Merged

Fokko requested a review from RussellSpitzer July 8, 2025 20:40

RussellSpitzer reviewed Jul 8, 2025

View reviewed changes

RussellSpitzer approved these changes Jul 8, 2025

View reviewed changes

Fokko merged commit 09140e5 into apache:main Jul 9, 2025
39 checks passed

Fokko deleted the fd-uuid branch July 9, 2025 22:29

dingo4dev mentioned this pull request Jul 16, 2025

UUIDType with BucketTransform incorrectly converts int to str in PartitionKey apache/iceberg-python#2002

Open

3 tasks

Fokko added a commit to Fokko/iceberg that referenced this pull request Jul 16, 2025

Spark 4: Support Parquet dictionary encoded UUIDs

81d0c7b

apache#13324 for Spark 4.0

Fokko mentioned this pull request Jul 16, 2025

Spark 4: Support Parquet dictionary encoded UUIDs #13573

Merged

amogh-jahagirdar pushed a commit that referenced this pull request Jul 16, 2025

Spark 4: Support Parquet dictionary encoded UUIDs (#13573)

5ffc529

#13324 for Spark 4.0

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs

29011f8

Backport of apache#13324

Fokko mentioned this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs #13877

Merged

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs

e6e1d38

Backport of apache#13324

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs

64a1181

Backport of apache#13324

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs

d241b42

Backport of apache#13324

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs

b373fc2

Backport of apache#13324

huaxingao pushed a commit that referenced this pull request Aug 20, 2025

Spark 3.4: Support Parquet dictionary encoded UUIDs (#13877)

d5e3a56

Backport of #13324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Support Parquet dictionary encoded UUIDs #13324

Spark: Support Parquet dictionary encoded UUIDs #13324

Uh oh!

Fokko commented Jun 16, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

dingo4dev left a comment

Uh oh!

Fokko commented Jun 18, 2025

Uh oh!

Fokko commented Jun 18, 2025 •

edited

Loading

Uh oh!

RussellSpitzer Jul 8, 2025

Uh oh!

Fokko Jul 9, 2025

Uh oh!

Fokko Jul 9, 2025

Uh oh!

liamzwbao Jul 9, 2025 •

edited

Loading

Uh oh!

Fokko Jul 16, 2025

Uh oh!

Uh oh!

Fokko commented Jul 9, 2025

Uh oh!

nastra commented Jul 10, 2025

Uh oh!

kevinjqliu commented Jul 11, 2025

Uh oh!

Uh oh!

Spark: Support Parquet dictionary encoded UUIDs #13324

Spark: Support Parquet dictionary encoded UUIDs #13324

Uh oh!

Conversation

Fokko commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

dingo4dev left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented Jun 18, 2025

Uh oh!

Fokko commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

liamzwbao Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fokko commented Jul 9, 2025

Uh oh!

nastra commented Jul 10, 2025

Uh oh!

kevinjqliu commented Jul 11, 2025

Uh oh!

Uh oh!

Fokko commented Jun 16, 2025 •

edited

Loading

Fokko commented Jun 18, 2025 •

edited

Loading

liamzwbao Jul 9, 2025 •

edited

Loading