-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add support for DELTA_BINARY_PACKED Parquet encoding #13391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rd/iceberg into parquet-v2-delta
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
@RussellSpitzer absolutely, within Besides this, we lack support for some Spark features like skipping or Spark's Beyond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still going through the changes, thank you @eric-maynard
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
Thanks for taking a look @amogh-jahagirdar, your comments should be addressed in the latest commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review @eric-maynard I checked this out locally and stepped through it, just had a minor comment but overall looks good to me. I'll hold in case @RussellSpitzer @rdblue @nastra or any others have any comments.
...java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
On further review, had a question on a code path that I want to confirm
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on board as well, I just was hoping we could get someone who is more familiar with the spark code to take a pass as well. This looks correct to me and all the changes from the Spark version of the code look logical.
Thanks @amogh-jahagirdar & @RussellSpitzer for the reviews! The PR should be updated to reflect the latest round of comments. If this merges, I will update #13450 to reflect support for the new encoding type and take that PR out of draft. |
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDeltaEncodedValuesReader.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @eric-maynard for the PR!
Thanks you @eric-maynard for continuing this work and thanks @amogh-jahagirdar and @huaxingao for reviewing. |
hey @eric-maynard could you backport the spark 4.0 changes to spark 3.5? We want to keep the 2 spark versions aligned in the upcoming 1.10 release. Here's some more context https://lists.apache.org/thread/8xzbg1wqft2grv8v1f13vb86vd8f7rjd I'm happy to help with the backport too. |
Hey @kevinjqliu, absolutely -- please see #13859 |
This adds support for the DELTA_BINARY_PACKED Parquet encoding.
The logic is taken from Spark's VectorizedDeltaBinaryPackedReader with adjustments made for compatibility with our existing Parquet reader.