-
Notifications
You must be signed in to change notification settings - Fork 12k
ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you post the results of the following command: llama-bench --model ${PATH_TO_MODEL} -p 1,2,4,8,512 -t 8,16,32,48 |
Thank you for your quick reply. Original(NEON)
This PR(SVE)
|
change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]>
…rg#11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
…rg#11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
…rg#11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
@fj-y-saito I suspect that there is a bug in this change. Recently the Github CI Arm runners started supporting SVE and I noticed that one of the https://github.com/ggml-org/llama.cpp/actions/runs/13738980652/job/38426122244?pr=12269#step:6:12864 |
OK I'll check. |
I think the test program is missing This fix now passes the test in my environment. |
This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433.
Verifying Features
This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.
performance check
Performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.
The command used to measure the performance is