Add fast huf_dec with generic C and tuned aarch64 assembly #3155

JunHe77 · 2022-06-07T07:52:05Z

This includes implementations of a generic C version of fast decode and a tuned 4x1 assembly version for Arm.
For silesia, observed 3.9% for sao, ~2% for mozilla/ooffice/osdb/x-ray.
As the author of the original algorithm, could you pls help to review this, @terrelln ? Thanks a lot.

The is C version of the fast decompression algorithm implemented in huf_decompress_amd64.S. Signed-off-by: Jun He <[email protected]> Change-Id: I964b109f4fd7fc9ca256b280e9add37c84f2e597

This is based on the fast HUF_4x1 decoding firstly introduced by Nick Terrell. It is manually tuned to balance performance across various Arm micro-architectures including N1/A72/A57. Signed-off-by: Jun He <[email protected]> Change-Id: I2de7afd44a4b43cfbedc80747aef4a36c6ae35eb

terrelln · 2022-07-29T17:30:45Z

Whats the difference between the C version and the ASM version?

It looks like the gains for the aarch64 version are smaller than the gains for the x86-64 version, which could make sense because aarch64 has way less register pressure, and that was the main constraint on x86-64.

For those smaller gains, I'd be a bit hesitant about merging an ASM implementation. But would be more open to a generic C version.

JunHe77 · 2022-08-08T10:17:56Z

Thanks for the comments, @terrelln . I understand the maintenance effort for an assembly implementation. Following is the change comparison between C and ASM for silesia at L8. Pls check.

data file	ASM/C on N1	ASM/C on A72
dickens	-0.19%	1.22%
mozilla	1.17%	0.33%
mr	0.21%	-1.38%
nci	0.29%	0.76%
ooffice	1.01%	-0.12%
osdb	0.92%	0.54%
reymont	0.11%	1.27%
samba	0.19%	1.51%
sao	2.32%	1.87%
webster	0.17%	0.42%
x-ray	0.97%	1.42%
xml	-0.21%	0.16%

In 5 of 12 cases, ASm version achieves 1%+ better performance (~2% for sao on both N1 and A72).

terrelln · 2022-08-16T17:12:34Z

Is the C version the current zstd code, or the fast decode C you've written? If it is the latter, what is the difference between zstd's code and the tuned C?

JunHe77 · 2022-08-17T01:58:47Z

The C version is a kind of "rewritten asm in C" of huf_decompress_amd64.S. I started the porting on Arm with C, then hand tuned asm version for both 4x1 and 4x2.
For C version, both 4x1 and 4x2 showed uplift on Arm. While further boosts are observed on certain test data with 4x1_asm, no obvious boost is found with 4x2_asm. That's why there is only 4x1 asm version in this PR. 😄

JunHe77 · 2023-01-11T06:55:57Z

Hi @terrelln, anything I need to follow for this PR? Thanks.

terrelln

Thanks for the PR @JunHe77, and sorry for the delay in review!

I'm working on a PR for C decoders, similar to what you added here. I was able to eek out a bit more speed (on x86-64, will measure aarch64 as well), and clean up the code a little bit.

I will put my PR up next week, at which point please feel free to benchmark your C / aarch64 ASM implementations against the PR. If there are meaningful gains, we can look into merging.

JunHe77 · 2023-01-16T06:38:45Z

Well received, @terrelln . Will do the benchmark once your PR is ready. Thanks.

terrelln · 2023-01-25T21:59:54Z

Hi @JunHe77, I've just merged PR #3449.

I've found that on my M1 chip those loops perform at least as well as your C versions. But I would be happy to accept patches to the fast C loops if you can find a faster variant.

If the assembly implementation is still significantly faster than the fast C variant, I would be willing to accept the aarch64 assembly implementation. But it would have to be disabled by default, and require the definition of a macro in the build process ZSTD_ENABLE_AARCH64_ASM to use.

The reason it would have to be disabled by default, is that we have no way to continuously fuzz the code. All of our fuzzers run on x86-64 and i386. We want to minimize the amount of code that isn't fuzzed in zstd. I trust that your assembly is correct, but I don't feel comfortable including any non-trivial code in zstd that isn't fuzzed.

If we merge it disabled by default, and later oss-fuzz adds support for aarch64, we could start fuzzing that code and switch it to enabled by default.

facebook-github-bot added the CLA Signed label Jun 7, 2022

JunHe77 added 2 commits June 8, 2022 23:23

decomp: add generic fast huf4x* decoding

0110045

The is C version of the fast decompression algorithm implemented in huf_decompress_amd64.S. Signed-off-by: Jun He <[email protected]> Change-Id: I964b109f4fd7fc9ca256b280e9add37c84f2e597

JunHe77 force-pushed the huf_dec branch from 672e3ce to 9a7d7da Compare June 8, 2022 15:24

embg assigned terrelln Jun 8, 2022

JunHe77 mentioned this pull request Jun 24, 2022

Update Copyright Comments #3173

Merged

terrelln mentioned this pull request Jan 13, 2023

Add faster Huffman decoding in generic C #3425

Closed

terrelln reviewed Jan 14, 2023

View reviewed changes

iksaif mentioned this pull request Sep 18, 2023

Possible performance regressions on some CPUs after #3449 (C fast loops) #3762

Closed

terrelln closed this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Uh oh!

JunHe77 commented Jun 7, 2022

Uh oh!

terrelln commented Jul 29, 2022

Uh oh!

JunHe77 commented Aug 8, 2022

Uh oh!

terrelln commented Aug 16, 2022 •

edited

Loading

Uh oh!

JunHe77 commented Aug 17, 2022

Uh oh!

JunHe77 commented Jan 11, 2023

Uh oh!

terrelln left a comment

Uh oh!

JunHe77 commented Jan 16, 2023

Uh oh!

terrelln commented Jan 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Add fast huf_dec with generic C and tuned aarch64 assembly #3155

Uh oh!

Conversation

JunHe77 commented Jun 7, 2022

Uh oh!

terrelln commented Jul 29, 2022

Uh oh!

JunHe77 commented Aug 8, 2022

Uh oh!

terrelln commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JunHe77 commented Aug 17, 2022

Uh oh!

JunHe77 commented Jan 11, 2023

Uh oh!

terrelln left a comment

Choose a reason for hiding this comment

Uh oh!

JunHe77 commented Jan 16, 2023

Uh oh!

terrelln commented Jan 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrelln commented Aug 16, 2022 •

edited

Loading