Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Jun 16, 2025

While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007

I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't caught previously.

Closes #4581

While fixing some issues on the PyIceberg ends to fully support UUIDs:
apache/iceberg-python#2007

I noticed this issue, and was suprised since UUID used to work with
Spark, but it turns out that the dictionary encoded UUID was not
implemented yet.

For PyIceberg we only generate little data, so therefore this wasn't
caught previously.
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Copy link

@dingo4dev dingo4dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko TIA.

Do we have any tests for the UUIDs partition read?

@Fokko
Copy link
Contributor Author

Fokko commented Jun 18, 2025

@dingo4dev Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍

@Fokko
Copy link
Contributor Author

Fokko commented Jun 18, 2025

Is there a way to test this? Can we add a dictionary encoded UUID like this?

Just found out that the test above is not testing this code path, since Spark projects a UUID into a String.

Test has been added and checked using breakpoints that it hits the newly added lines 👍

Schema schema = new Schema(optional(100, "uuid", Types.UUIDType.get()));

File dataFile = File.createTempFile("junit", null, temp.toFile());
assertThat(dataFile.delete()).as("Delete should succeed").isTrue();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this? Test looks fine just wondering why we create and then delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple tests that use the same directory. Removing this line causes:

> Task :iceberg-spark:iceberg-spark-3.5_2.12:test FAILED
TestParquetVectorizedReads > testUuidReads() FAILED
    org.apache.iceberg.exceptions.AlreadyExistsException: File already exists: /var/folders/q_/gj3w0fts15n9l1dz58630jg40000gp/T/junit-17282529623949094801/junit6226812078387717274.tmp
        at app//org.apache.iceberg.Files$LocalOutputFile.create(Files.java:56)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an issue to fix this for all the tests: #13506

Copy link
Contributor

@liamzwbao liamzwbao Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the issue, @TempDir make sure each test get a unique directory so multiple tests won't affect each other. However, the reason these tests need to manually delete the directory is that @TempDir always creates the directory before the test starts, and getParquetWriter throws an exception if the target folder already exists.

What we actually need here is just a randomly generated path without creating anything at that location. Unfortunately, @TempDir doesn’t currently support that level of control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a JUnit expert @liamzwbao, but this blog suggests that you can also recreate the directory after each test

@Fokko Fokko merged commit 09140e5 into apache:main Jul 9, 2025
39 checks passed
@Fokko
Copy link
Contributor Author

Fokko commented Jul 9, 2025

Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀

@Fokko Fokko deleted the fd-uuid branch July 9, 2025 22:29
@nastra
Copy link
Contributor

nastra commented Jul 10, 2025

@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10

@kevinjqliu
Copy link
Contributor

Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025
Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025
Fokko added a commit to Fokko/iceberg that referenced this pull request Aug 20, 2025
huaxingao pushed a commit that referenced this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark: Cannot read or write UUID columns
6 participants