-
Notifications
You must be signed in to change notification settings - Fork 12k
Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290
Conversation
… 256-bit, 128-bit vector lengths
… 256-bit, 128-bit vector lengths
Please also provide numbers from |
Thanks for the comment! I ran perplexity with SVE 128, 256, and 512-bit and non-SVE(Neon). The following is the command and partial logs. -------------------- Q_8 --------------------------Q_8 Graviton Neon$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q8_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_print_timings: load time = 375.16 ms Q_8 Graviton SVE 256 (Merged)$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | llama_print_timings: load time = 364.19 ms Q_8 SVE 128 GRACE VM 32$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | llama_print_timings: load time = 202.84 ms Q_8 SVE 512 FX700$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 48 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | llama_print_timings: load time = 6316.84 ms ------------------ Q_4 ------------------------------------Q_4 Neon Graviton$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | llama_print_timings: load time = 241.27 ms Q_4 SVE 128 GRACE VM 32$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | llama_print_timings: load time = 135.31 ms Q_4 SVE 256 Graviton$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 | llama_print_timings: load time = 222.06 ms Q_4 SVE 512 FX700./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q4_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4 system_info: n_threads = 32 / 48 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | |
Thanks @Vithulep , I (accidentally) compiled master with your PR with |
Can you try replacing the statement I have used svcntb() while running so never faced segmentation fault. |
With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model. |
@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now. |
I want to investigate it further; can you give me compiler configuration of your clang compiler and other system information. Thanks for reporting it. |
Sure.
few more details i was able to collect last night:
|
Here are the numbers from ##Q_8 On A FX700 48 Core machine On A GRACEVM 32 Core Machine |
@Vithulep Since there are quite a number of instruction sets to keep track of, can you help me understand how these kernels compare to the I'm interested in comparison in terms of:
|
Previously in #5780, It was specifically implemented for SVE 256-bit vector length. If the architecture has support for SVE 128-bit or 512-bit vector length, then it will use NEON Intrinsic. So, to get best performance out of the architecture which support other vector length than SVE 256-bit, we have implemented SVE-VLA. In this case it will use SVE intrinsic depending on the available vector length. And INT8MM kernel is currently implemented in NEON Intrinsic. Which is used in combination with SVE or NEON Intrinsic. |
Thanks, this is helpful. Since this PR adds 256-bit SVE support as well, have you compared the performance against the 256-bit SVE implementation on I would happily run the benches myself, but I don't have access to hardware that supports SVE. |
In my PR, I have implemented SVE for 128-bit and 512-bit vector length, and SVE 256-bit is taken from #7433 and I have put all this implementation in switch case. So, there is no performance difference in case of SVE 256-bit. Here are results for Q8 SVE 256-bit and similar results for Q4 also. |
- fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts
Thank you, I understand now. I've sent a few style changes: Vithulep#1. We can merge after accepted. |
ggml : style changes + fix 512-bit nb loop check
Thanks. I have cross-checked it. |
Co-authored-by: Georgi Gerganov <[email protected]>
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
This PR introduces VLA support for SVE (Scalable Vector Extensions) kernels for the q4_0_q8_0 and q8_0_q8_0 vector dot on the Arm architecture. This Consider 128-bit, 256-bit and 512-bit vector lengths. The VLA is implemented using switch case. The SVE 256-bit Implementation is taken from PR #7433 . SVE implementation of 128 and 512 bit is implemented from scratch.
If we compare neon with the machine which support 128 (GRACEVM-32 CPU), 256 (Graviton 3), and 512 -bit (FX700) vector length, then we can observe, for 128-bit SVE (GRACEVM-32 CPU) is ~ X1.1 to X1.5 times faster than Neon.
Similarly for 512-bit SVE (FX700) is ~ X4.7 to X2 times faster than Neon. And from PR#7433 , We know that SVE 256-bit (Graviton 3) ~X1.1 to X1.5 times faster than Neon.
Here are the performance measured on FX700 48 core (A64FX) Machine and GRACEVM- 32 CPU Machine.
### Q4_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin
### Q8_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin
Performance measured on A FX700 48 core (A64FX) Machine.
Performance measured on A GRACEVM- 32 CPU Machine.