-
Notifications
You must be signed in to change notification settings - Fork 87
rvv: add base64 encoding #716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
❤️ |
I don't think the global LMUL choice of It's important to avoid LMUL>1 vrgather when possible, as it executes in LMUL^2 cycles. Here is how I would implement base64-encode: https://github.com/camel-cdr/rvv-playground/blob/main/base64_encode.c I decided to go with LMUL=4 as this maximises LMUL without spilling.
The inline VLEN detection has very little overhead on clang-20, but for simdutf it's probably better to do that upfront.
|
Is it the rule imposed by the RVV standard? Sounds strange, TBH. |
I have exactly the same idea. :) |
IMHO this is the only sane version of encoding, IMHO. The shuffling approach limits the input size to 64 bytes, as the index is a byte. This approach is VL/LMUL-agnostic. |
No, but it's an inherent consequence of LMUL. With regular instructions you can just independently subdivide the work, executing a instruction with LMUL upps (or similar mechanism) with LMUL=1 datapath width. There are implementations that are alightly different:
All of this is to say, that you should be doing multiple vrgather at LMUL=1 instead of one at LMUL>1 when possible. This will have linear scaling, so 1/2/4/8, on all hardware implementations. |
Thank you for in-depth explanation, TBH I didn't pay much attention to possible implementations. This makes me think if employing methods that does not use lookup, but plain comparison wouldn't be better. I mean some variant of the "naive methods" from http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html#branchless-code-for-lookup-table. |
You can just do two versions, one with 8-bit indecies for VL<=256 and one with 16-bit indices, for VL>256. The problem is that worst case segmented load/stores are implemented at one element per cycle. Here are some benchmarks of four segment store:
Note however that none of the above are high performance cores, so well have to see how the good cores perform. Well're supposed to get a Tenstorrent Ascalon devkit next year, which is an 8 wide OoO core. Fingers crossed. |
I think this will only be faster for VLEN=128 on some systems with slower permute. I expect LMUL=1 vrgather to perform like vperm on x86 on all relevant apps processors, with relatively low latency and good throughput. So for VLEN>256, where we can use two LMUL=2 vrgathers, a alternative implementation would have to use 2-4 instructions per LMUL=1 vector register to be faster than the LMUL=2 vrgathers. That means it's only really interesting for VLEN=128, where we have an expected 4^2=16 cycles from the LMUL=4 vrgather. So an implementation could use less than 4-8 LMUL=4 instructions to beat the gather implementation. (4-8 for 1x-2x throughput of comparisons compared to permutations) Btw, I say "cycles" here for simplicity, but I mean it as a relative performance measure. The segmented load/store implementation is a lot cheaper, except for the load/stores them self, because the deinterleaving of bits can happen for the four sextets seperatelt in four LMUL=1 registers, so very little work is duplicated. |
Ok, I've implemented the offset based LUT implementation and ran some benchmarks:
On the X60 the 64 element LUT is about 10% faster and on the C908 the 16 element one is about 20% faster. This may sound contradictory to what I wrote above, but the cores have a weird architecture due to targeting a very low power design point: They have a datapath width of VLEN/2, and two VLEN/2 wide ALUs for some instructions. Contrast the above implementation with the dual issue, somewhat OoO C910 core, which instead can execute two LMUL=1 vrgathers per cycle, which is the same dispatch rate as it's vector addion. You can check the instruction throughputs on my rvv-bench website: https://camel-cdr.github.io/rvv-bench-results/index.html I wasn't yet able to run the benchmark on the C910, because it implements a incompatible draft version of RVV, and I have to backport the codegen for it. |
I managed to run the C910 benchmark now: https://camel-cdr.github.io/rvv-bench-results/milkv_pioneer/base64-encode.html |
Addresses #380