-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Spark: Support Parquet dictionary encoded UUIDs #13324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007 I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet. For PyIceberg we only generate little data, so therefore this wasn't caught previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM
Is there a way to test this? Can we add a dictionary encoded UUID like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko TIA.
Do we have any tests for the UUIDs partition read?
@dingo4dev Yes, we have tests for plain encoded UUIDs, let me add one for dictionary encoded UUIDs as well 👍 |
Just found out that the test above is not testing this code path, since Spark projects a UUID into a String. Test has been added and checked using breakpoints that it hits the newly added lines 👍 |
Schema schema = new Schema(optional(100, "uuid", Types.UUIDType.get())); | ||
|
||
File dataFile = File.createTempFile("junit", null, temp.toFile()); | ||
assertThat(dataFile.delete()).as("Delete should succeed").isTrue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we do this? Test looks fine just wondering why we create and then delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple tests that use the same directory. Removing this line causes:
> Task :iceberg-spark:iceberg-spark-3.5_2.12:test FAILED
TestParquetVectorizedReads > testUuidReads() FAILED
org.apache.iceberg.exceptions.AlreadyExistsException: File already exists: /var/folders/q_/gj3w0fts15n9l1dz58630jg40000gp/T/junit-17282529623949094801/junit6226812078387717274.tmp
at app//org.apache.iceberg.Files$LocalOutputFile.create(Files.java:56)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've created an issue to fix this for all the tests: #13506
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look at the issue, @TempDir
make sure each test get a unique directory so multiple tests won't affect each other. However, the reason these tests need to manually delete the directory is that @TempDir
always creates the directory before the test starts, and getParquetWriter
throws an exception if the target folder already exists.
What we actually need here is just a randomly generated path without creating anything at that location. Unfortunately, @TempDir
doesn’t currently support that level of control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a JUnit expert @liamzwbao, but this blog suggests that you can also recreate the directory after each test
Thanks @RussellSpitzer, @kevinjqliu and @dingo4dev for the review 🚀 |
@Fokko are you planning to port this over to Spark 4? I think it would be good to get this out with 1.10 |
and perhaps spark 3.4 as well |
While fixing some issues on the PyIceberg ends to fully support UUIDs: apache/iceberg-python#2007
I noticed this issue, and was suprised since UUID used to work with Spark, but it turns out that the dictionary encoded UUID was not implemented yet.
For PyIceberg we only generate little data, so therefore this wasn't caught previously.
Closes #4581