Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

eric-maynard
Copy link
Contributor

This adds support for the DELTA_BINARY_PACKED Parquet encoding.

The logic is taken from Spark's VectorizedDeltaBinaryPackedReader with adjustments made for compatibility with our existing Parquet reader.

@github-actions github-actions bot added the spark label Jun 27, 2025
@github-actions github-actions bot added the data label Jul 1, 2025
@eric-maynard eric-maynard marked this pull request as ready for review July 1, 2025 20:57
@eric-maynard
Copy link
Contributor Author

@RussellSpitzer absolutely, within VectorizedDeltaEncodedValuesReader most of the divergences should be related to the different ways that Iceberg and Spark want to actually handle the decoded values. So compare the following pairs of code pointers:

  1. Iceberg / Spark
  2. Iceberg / Spark

Besides this, we lack support for some Spark features like skipping or Spark's readIntegersWithRebase, so lots of code is removed.

Beyond VectorizedDeltaEncodedValuesReader itself, only small changes are needed to actually plug VectorizedDeltaEncodedValuesReader into our reader stack, which already diverges from Spark's.

@amogh-jahagirdar amogh-jahagirdar self-requested a review July 11, 2025 21:27
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through the changes, thank you @eric-maynard

@eric-maynard
Copy link
Contributor Author

Thanks for taking a look @amogh-jahagirdar, your comments should be addressed in the latest commit.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @eric-maynard I checked this out locally and stepped through it, just had a minor comment but overall looks good to me. I'll hold in case @RussellSpitzer @rdblue @nastra or any others have any comments.

@amogh-jahagirdar amogh-jahagirdar self-requested a review July 21, 2025 21:23
@amogh-jahagirdar amogh-jahagirdar dismissed their stale review July 21, 2025 21:24

On further review, had a question on a code path that I want to confirm

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on board as well, I just was hoping we could get someone who is more familiar with the spark code to take a pass as well. This looks correct to me and all the changes from the Spark version of the code look logical.

@eric-maynard
Copy link
Contributor Author

eric-maynard commented Jul 21, 2025

Thanks @amogh-jahagirdar & @RussellSpitzer for the reviews! The PR should be updated to reflect the latest round of comments.

If this merges, I will update #13450 to reflect support for the new encoding type and take that PR out of draft.

Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @eric-maynard for the PR!

@RussellSpitzer RussellSpitzer merged commit c3d50e1 into apache:main Jul 30, 2025
41 of 42 checks passed
@RussellSpitzer
Copy link
Member

Thanks you @eric-maynard for continuing this work and thanks @amogh-jahagirdar and @huaxingao for reviewing.

@kevinjqliu
Copy link
Contributor

hey @eric-maynard could you backport the spark 4.0 changes to spark 3.5? We want to keep the 2 spark versions aligned in the upcoming 1.10 release. Here's some more context https://lists.apache.org/thread/8xzbg1wqft2grv8v1f13vb86vd8f7rjd

I'm happy to help with the backport too.

@eric-maynard
Copy link
Contributor Author

Hey @kevinjqliu, absolutely -- please see #13859

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants