Thanks to visit codestin.com
Credit goes to github.com

Skip to content

rvv: add base64 encoding #716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

WojciechMula
Copy link
Collaborator

@WojciechMula WojciechMula commented Mar 15, 2025

Addresses #380

@lemire
Copy link
Member

lemire commented Mar 15, 2025

❤️

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 17, 2025

I don't think the global LMUL choice of perform should depend on VLEN.

It's important to avoid LMUL>1 vrgather when possible, as it executes in LMUL^2 cycles.
If we know the range of elements accessed from an LMUL>1 vrgather to fit into a single LMUL=1 register, then we should use multiple LMUL=1 vrgathers instead.

Here is how I would implement base64-encode: https://github.com/camel-cdr/rvv-playground/blob/main/base64_encode.c

I decided to go with LMUL=4 as this maximises LMUL without spilling.

b64_encode_rvv uses four overlapping LMUL=1 reads and subsequently four LMUL=1 vrgathers, as now the element rearranging can happen in LMUL=1 boundaries.
Then the bits are combined as usual at LMUL=4, and the 64 element LUT is done with vrgathers of the minimal possible LMUL for the current VLEN.
For the minimum VLEN of 128, this means one LMUL=4 vrgather, for 256 two LMUL=2 vrgathers, and for VLEN>=512 four LMUL=1 vrgathers. This should perform best across implementations and vector lengths.
Note that I didn't include the offset based LUT, because if we go with the LMUL^2 shorthand, we end up with 4^2=16, but the offset based one has 5*4=20 at LMUL=4.

The inline VLEN detection has very little overhead on clang-20, but for simdutf it's probably better to do that upfront.

b64_encode_rvvseg is a variant that uses segmented load/stores. While it's much more compact, I wouldn't recommend it as a default, because it hasn't been established how high perf OoO cores will implement the segmented load/stores.
Maybe we could add a micro benchmark-based dispatch to simdutf, to select between such alternative implementations.
That may also help with things like pdep/pext, which perform horribly on some platforms.

@WojciechMula
Copy link
Collaborator Author

It's important to avoid LMUL>1 vrgather when possible, as it executes in LMUL^2 cycles.

Is it the rule imposed by the RVV standard? Sounds strange, TBH.

@WojciechMula
Copy link
Collaborator Author

Maybe we could add a micro benchmark-based dispatch to simdutf, to select between such alternative implementations.
That may also help with things like pdep/pext, which perform horribly on some platforms.

I have exactly the same idea. :)

@WojciechMula
Copy link
Collaborator Author

b64_encode_rvvseg is a variant that uses segmented load/stores. While it's much more compact, I wouldn't recommend it as a default, because it hasn't been established how high perf OoO cores will implement the segmented load/stores.

IMHO this is the only sane version of encoding, IMHO. The shuffling approach limits the input size to 64 bytes, as the index is a byte. This approach is VL/LMUL-agnostic.

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 17, 2025

Is it the rule imposed by the RVV standard? Sounds strange, TBH.

No, but it's an inherent consequence of LMUL.

With regular instructions you can just independently subdivide the work, executing a instruction with LMUL upps (or similar mechanism) with LMUL=1 datapath width.
For vrgather this isn't possible to scale up linearly, because you can permute from any to any element.
LMUL=8 vrgather is a fundamentally more powerful operation than 8 LMUL=1 vrgather operations, so we should only use it when necessary.
Most implementations have a LMUL=1 wide vrgather primitive, which is applied LMUL^2 times to do the full permute.
This is similar to NEONs TBLn instructions, an LMUL=4 vrgather is equivilant to four TBL4 instructions.

There are implementations that are alightly different:

  • for reference LMUL^2, so 1/4/16/64 for LMUL=1/2/4/8
  • Ventana afaik, has a vpermi2/TBL2 primitve, which gives an LMUL scaling of LMUL^2/2, so 1/2/4/32
  • Tenstorrent Ascalon has LMUL*log2(LMUL) scaling, so 1/2/4/24
  • SiFive has quadratic on most cores, but afaik some have an optimization where uops are skipped if the permute stays within a lane. This would give linear scaling for the LUT cases, if the LUT fits in a single LMUL=1 register.

All of this is to say, that you should be doing multiple vrgather at LMUL=1 instead of one at LMUL>1 when possible. This will have linear scaling, so 1/2/4/8, on all hardware implementations.

@WojciechMula
Copy link
Collaborator Author

All of this is to say, that you should be doing multiple vrgather at LMUL=1 instead of one at LMUL>1 when possible. This will have linear scaling, so 1/2/4/8, on all hardware implementations.

Thank you for in-depth explanation, TBH I didn't pay much attention to possible implementations.

This makes me think if employing methods that does not use lookup, but plain comparison wouldn't be better. I mean some variant of the "naive methods" from http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html#branchless-code-for-lookup-table.

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 17, 2025

IMHO this is the only sane version of encoding, IMHO. The shuffling approach limits the input size to 64 bytes, as the index is a byte. This approach is VL/LMUL-agnostic.

You can just do two versions, one with 8-bit indecies for VL<=256 and one with 16-bit indices, for VL>256.
But yes, the other one is neater.

The problem is that worst case segmented load/stores are implemented at one element per cycle.
A sane/good implementation would use LMULnf cycles (nf is the segment count), or more realistically LMULnf+nf cycles. LMUL*nf+nf would be a design that does the load into a buffer and a seperate writeback to the register file for every destination register.

Here are some benchmarks of four segment store:

Note however that none of the above are high performance cores, so well have to see how the good cores perform. Well're supposed to get a Tenstorrent Ascalon devkit next year, which is an 8 wide OoO core. Fingers crossed.

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 17, 2025

This makes me think if employing methods that does not use lookup, but plain comparison wouldn't be better.

I think this will only be faster for VLEN=128 on some systems with slower permute.

I expect LMUL=1 vrgather to perform like vperm on x86 on all relevant apps processors, with relatively low latency and good throughput.

So for VLEN>256, where we can use two LMUL=2 vrgathers, a alternative implementation would have to use 2-4 instructions per LMUL=1 vector register to be faster than the LMUL=2 vrgathers.
We can expect two LMUL=2 vrgathers to roughly take (2^2)2=8 cycles, while a seperate implementation would take 4n cycles, the 4x because of LMUL=4. It's likely that an implementation has a higher throughput on plain comparisons, let's say 2x, so consequently you would need to get the same result in 4 or less LMUL=4 instructions to be faster. (4*(4/2)=8)

That means it's only really interesting for VLEN=128, where we have an expected 4^2=16 cycles from the LMUL=4 vrgather. So an implementation could use less than 4-8 LMUL=4 instructions to beat the gather implementation. (4-8 for 1x-2x throughput of comparisons compared to permutations)

Btw, I say "cycles" here for simplicity, but I mean it as a relative performance measure.

The segmented load/store implementation is a lot cheaper, except for the load/stores them self, because the deinterleaving of bits can happen for the four sextets seperatelt in four LMUL=1 registers, so very little work is duplicated.

@camel-cdr
Copy link
Contributor

camel-cdr commented Mar 18, 2025

Ok, I've implemented the offset based LUT implementation and ran some benchmarks:

On the X60 the 64 element LUT is about 10% faster and on the C908 the 16 element one is about 20% faster.

This may sound contradictory to what I wrote above, but the cores have a weird architecture due to targeting a very low power design point:

They have a datapath width of VLEN/2, and two VLEN/2 wide ALUs for some instructions.
These instructions don't include vrgather, so a LMUL=4 vrgather takes (LMUL*2)^2=64 cycles, while a LMUL=4 vadd only takes 4.
Now we are caught on the bad side of an quadratic.
With an otherwise same.architecture, but the ALU width equal to VLEN, we sould have had 16 vs 2 instrad, which is a 2x better ratio than 64 vs 4.

Contrast the above implementation with the dual issue, somewhat OoO C910 core, which instead can execute two LMUL=1 vrgathers per cycle, which is the same dispatch rate as it's vector addion.
There we have a 8 vs 2 ratio.
I suspect that future cores will be.more similar to the C910 then the others above.

You can check the instruction throughputs on my rvv-bench website: https://camel-cdr.github.io/rvv-bench-results/index.html

I wasn't yet able to run the benchmark on the C910, because it implements a incompatible draft version of RVV, and I have to backport the codegen for it.

@camel-cdr
Copy link
Contributor

I managed to run the C910 benchmark now: https://camel-cdr.github.io/rvv-bench-results/milkv_pioneer/base64-encode.html
As expected, the 64 entry LUT without segmented load/stores performs best, because the C910 has very slow segmented load/stores, and a dual issue single cycle LMUL=1 vrgather implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants