Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@JunHe77
Copy link
Contributor

@JunHe77 JunHe77 commented Jun 7, 2022

This includes implementations of a generic C version of fast decode and a tuned 4x1 assembly version for Arm.
For silesia, observed 3.9% for sao, ~2% for mozilla/ooffice/osdb/x-ray.
As the author of the original algorithm, could you pls help to review this, @terrelln ? Thanks a lot.

JunHe77 added 2 commits June 8, 2022 23:23
The is C version of the fast decompression algorithm
implemented in huf_decompress_amd64.S.

Signed-off-by: Jun He <[email protected]>
Change-Id: I964b109f4fd7fc9ca256b280e9add37c84f2e597
This is based on the fast HUF_4x1 decoding firstly
introduced by Nick Terrell.
It is manually tuned to balance performance across
various Arm micro-architectures including N1/A72/A57.

Signed-off-by: Jun He <[email protected]>
Change-Id: I2de7afd44a4b43cfbedc80747aef4a36c6ae35eb
@terrelln
Copy link
Contributor

Whats the difference between the C version and the ASM version?

It looks like the gains for the aarch64 version are smaller than the gains for the x86-64 version, which could make sense because aarch64 has way less register pressure, and that was the main constraint on x86-64.

For those smaller gains, I'd be a bit hesitant about merging an ASM implementation. But would be more open to a generic C version.

@JunHe77
Copy link
Contributor Author

JunHe77 commented Aug 8, 2022

Thanks for the comments, @terrelln . I understand the maintenance effort for an assembly implementation. Following is the change comparison between C and ASM for silesia at L8. Pls check.

data file ASM/C on N1 ASM/C on A72
dickens -0.19% 1.22%
mozilla 1.17% 0.33%
mr 0.21% -1.38%
nci 0.29% 0.76%
ooffice 1.01% -0.12%
osdb 0.92% 0.54%
reymont 0.11% 1.27%
samba 0.19% 1.51%
sao 2.32% 1.87%
webster 0.17% 0.42%
x-ray 0.97% 1.42%
xml -0.21% 0.16%

In 5 of 12 cases, ASm version achieves 1%+ better performance (~2% for sao on both N1 and A72).

@terrelln
Copy link
Contributor

terrelln commented Aug 16, 2022

Is the C version the current zstd code, or the fast decode C you've written? If it is the latter, what is the difference between zstd's code and the tuned C?

@JunHe77
Copy link
Contributor Author

JunHe77 commented Aug 17, 2022

The C version is a kind of "rewritten asm in C" of huf_decompress_amd64.S. I started the porting on Arm with C, then hand tuned asm version for both 4x1 and 4x2.
For C version, both 4x1 and 4x2 showed uplift on Arm. While further boosts are observed on certain test data with 4x1_asm, no obvious boost is found with 4x2_asm. That's why there is only 4x1 asm version in this PR. 😄

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jan 11, 2023

Hi @terrelln, anything I need to follow for this PR? Thanks.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @JunHe77, and sorry for the delay in review!

I'm working on a PR for C decoders, similar to what you added here. I was able to eek out a bit more speed (on x86-64, will measure aarch64 as well), and clean up the code a little bit.

I will put my PR up next week, at which point please feel free to benchmark your C / aarch64 ASM implementations against the PR. If there are meaningful gains, we can look into merging.

@JunHe77
Copy link
Contributor Author

JunHe77 commented Jan 16, 2023

Well received, @terrelln . Will do the benchmark once your PR is ready. Thanks.

@terrelln
Copy link
Contributor

Hi @JunHe77, I've just merged PR #3449.

I've found that on my M1 chip those loops perform at least as well as your C versions. But I would be happy to accept patches to the fast C loops if you can find a faster variant.

If the assembly implementation is still significantly faster than the fast C variant, I would be willing to accept the aarch64 assembly implementation. But it would have to be disabled by default, and require the definition of a macro in the build process ZSTD_ENABLE_AARCH64_ASM to use.

The reason it would have to be disabled by default, is that we have no way to continuously fuzz the code. All of our fuzzers run on x86-64 and i386. We want to minimize the amount of code that isn't fuzzed in zstd. I trust that your assembly is correct, but I don't feel comfortable including any non-trivial code in zstd that isn't fuzzed.

If we merge it disabled by default, and later oss-fuzz adds support for aarch64, we could start fuzzing that code and switch it to enabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants