-
Notifications
You must be signed in to change notification settings - Fork 1k
Improvements for parquet writing performance (25%-44%) #7824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3a807a5
to
11e76a3
Compare
Is this ready for review @jhorstmann ? |
@Dandandan I think this should be ready for review. I haven't gotten around to re-run all parquet writing benchmarks yet, maybe @alamb can also do his benchmark magic :) |
Benchmark results from a local run with
Very nice improvements for primitives, minor improvements once bloom filters or string types are involved. I think the last two benchmarks are named incorrectly, they actually write 3 columns of types int32, bool and utf8. |
@Dandandan wait a second with merging, I think I found another hotspot regarding the bloom filters |
Sure thing! |
This comment was marked as outdated.
This comment was marked as outdated.
Those are some pretty impressive numbers @jhorstmann |
Updated results for bloom filter benchmarks. There is a very weird effect on my machine where running only these benchmarks is much faster than running them interleaved with non-bloom benchmarks.
|
🤖 |
🤖: Benchmark completed Details
|
😮
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @jhorstmann -- that is amazing
if self.repeat_count >= 8 { | ||
// The current RLE run has ended and we've gathered enough. Flush first. | ||
assert_eq!(self.bit_packed_count, 0); | ||
debug_assert_eq!(self.bit_packed_count, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a good call to remove the runtime overhead in release mode
/// The arrow array | ||
array: ArrayRef, | ||
|
||
/// cached logical nulls of the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a good idea to avoid having to clone (even though it is just a few ars) each time
BTW @jhorstmann -- twitter noticed that the initial results in the description don't reflect what I just measured https://x.com/qrilka/status/1940751432560660823 Perhaps the description needs to be updated with your latest results? |
The PR description needs an update now that bloom filters are also improved. I also ran my initial benchmarks with |
@alamb I can't reproduce these benchmark numbers. The branch results looks plausible, but the numbers for main look much slower than what I'm getting locally. Can you verify that the main results include the changes from PR #7823 / commit a9f316b? it might also be that the benchmark names ending in |
My scripts compare the branch against |
9d6a5f5
to
e7f32e9
Compare
🤖 |
Ah, that makes sense, I think I created both branches in parallel. This PR is rebased now, that should give a better comparison. |
I restarted the benchmark |
🤖: Benchmark completed Details
|
Ah, 25% - 44% -- will adjust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Thanks @jhorstmann!
🚀 indeed |
🚀 🚀 🚀 |
This method was removed in apache#7824, which introduced an optimized code path for writing bloom filters on little-endian architectures. The method was however still used in the big-endian code-path. Due to the use of `#[cfg(target_endian)]` this went unnoticed in CI. Fixes apache#8207
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
The changes in this PR improve parquet writing performance for primitives by up to 45%.
What changes are included in this PR?
There was not a single bottleneck to fix, instead several small improvements contributed to the performance increase:
BitWriter::put_value
todebug_assert
since these should never be triggered by users of the code and are not required for soundness.TrustedLen
and so avoids multiple capacity checks andBitIndexIterator
is more optimized for collecting non-null indices.memset
and write all blocks as a single slice on little endian targets.Are these changes tested?
Logic should be covered by existing tests.
Are there any user-facing changes?
No, all changes are to implementation details and do not affect public apis.