-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add fast huf_dec with generic C and tuned aarch64 assembly #3155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The is C version of the fast decompression algorithm implemented in huf_decompress_amd64.S. Signed-off-by: Jun He <[email protected]> Change-Id: I964b109f4fd7fc9ca256b280e9add37c84f2e597
This is based on the fast HUF_4x1 decoding firstly introduced by Nick Terrell. It is manually tuned to balance performance across various Arm micro-architectures including N1/A72/A57. Signed-off-by: Jun He <[email protected]> Change-Id: I2de7afd44a4b43cfbedc80747aef4a36c6ae35eb
|
Whats the difference between the C version and the ASM version? It looks like the gains for the aarch64 version are smaller than the gains for the x86-64 version, which could make sense because aarch64 has way less register pressure, and that was the main constraint on x86-64. For those smaller gains, I'd be a bit hesitant about merging an ASM implementation. But would be more open to a generic C version. |
|
Thanks for the comments, @terrelln . I understand the maintenance effort for an assembly implementation. Following is the change comparison between C and ASM for silesia at L8. Pls check.
In 5 of 12 cases, ASm version achieves 1%+ better performance (~2% for sao on both N1 and A72). |
|
Is the C version the current zstd code, or the fast decode C you've written? If it is the latter, what is the difference between zstd's code and the tuned C? |
|
The C version is a kind of "rewritten asm in C" of huf_decompress_amd64.S. I started the porting on Arm with C, then hand tuned asm version for both 4x1 and 4x2. |
|
Hi @terrelln, anything I need to follow for this PR? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @JunHe77, and sorry for the delay in review!
I'm working on a PR for C decoders, similar to what you added here. I was able to eek out a bit more speed (on x86-64, will measure aarch64 as well), and clean up the code a little bit.
I will put my PR up next week, at which point please feel free to benchmark your C / aarch64 ASM implementations against the PR. If there are meaningful gains, we can look into merging.
|
Well received, @terrelln . Will do the benchmark once your PR is ready. Thanks. |
|
Hi @JunHe77, I've just merged PR #3449. I've found that on my M1 chip those loops perform at least as well as your C versions. But I would be happy to accept patches to the fast C loops if you can find a faster variant. If the assembly implementation is still significantly faster than the fast C variant, I would be willing to accept the aarch64 assembly implementation. But it would have to be disabled by default, and require the definition of a macro in the build process The reason it would have to be disabled by default, is that we have no way to continuously fuzz the code. All of our fuzzers run on x86-64 and i386. We want to minimize the amount of code that isn't fuzzed in zstd. I trust that your assembly is correct, but I don't feel comfortable including any non-trivial code in zstd that isn't fuzzed. If we merge it disabled by default, and later oss-fuzz adds support for aarch64, we could start fuzzing that code and switch it to enabled by default. |
This includes implementations of a generic C version of fast decode and a tuned 4x1 assembly version for Arm.
For silesia, observed 3.9% for sao, ~2% for mozilla/ooffice/osdb/x-ray.
As the author of the original algorithm, could you pls help to review this, @terrelln ? Thanks a lot.