ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

fj-y-saito · 2025-01-14T05:00:37Z

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #7433.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON	SVE (this PR)
6.5772 +/- 0.04061	6.5772 +/- 0.04061

performance check

Performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

batch size	Original(NEON)	This PR(SVE)	Ratio
1	5.02	8.79	1.75
2	5.26	9.30	1.77
4	5.32	9.55	1.80
8	5.19	9.47	1.82

The command used to measure the performance is

llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

ggml/src/ggml-cpu/ggml-cpu-quants.c

ggerganov · 2025-01-14T18:15:03Z

Could you post the results of the following command:

llama-bench --model ${PATH_TO_MODEL} -p 1,2,4,8,512 -t 8,16,32,48

fj-y-saito · 2025-01-15T03:09:38Z

Thank you for your quick reply.
The result is below.

Original(NEON)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp1 |          0.97 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp2 |          1.05 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp4 |          1.10 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp8 |          1.12 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         pp512 |          1.10 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         tg128 |          0.97 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp1 |          1.93 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp2 |          2.08 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp4 |          2.17 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp8 |          2.23 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         pp512 |          2.21 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         tg128 |          1.91 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp1 |          3.74 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp2 |          4.08 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp4 |          4.32 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp8 |          4.44 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         pp512 |          4.41 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         tg128 |          3.71 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          2.74 ± 2.23 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |          3.58 ± 1.87 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |          4.16 ± 1.57 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |          6.05 ± 0.14 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |          6.56 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          3.40 ± 1.02 |

This PR(SVE)

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp1 |          1.86 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp2 |          2.09 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp4 |          2.25 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |           pp8 |          2.34 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         pp512 |          2.26 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |       8 |         tg128 |          1.83 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp1 |          3.56 ± 0.15 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp2 |          4.12 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp4 |          4.46 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |           pp8 |          4.63 ± 0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         pp512 |          4.52 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      16 |         tg128 |          3.56 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp1 |          6.75 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp2 |          7.91 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp4 |          8.73 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |           pp8 |          9.20 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         pp512 |          9.00 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      32 |         tg128 |          6.66 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          7.71 ± 3.27 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |          7.44 ± 4.56 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |          4.13 ± 0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |          8.48 ± 3.17 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         13.33 ± 0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          4.41 ± 0.69 |

change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]>

…rg#11227) * Add SVE support for q4_K_q8_K * Update ggml/src/ggml-cpu/ggml-cpu-quants.c change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov · 2025-03-08T16:06:09Z

@fj-y-saito I suspect that there is a bug in this change. Recently the Github CI Arm runners started supporting SVE and I noticed that one of the Q4_K tests is failing:

https://github.com/ggml-org/llama.cpp/actions/runs/13738980652/job/38426122244?pr=12269#step:6:12864

fj-y-saito · 2025-03-10T00:13:25Z

OK I'll check.

fj-y-saito · 2025-03-10T05:30:54Z

I think the test program is missing ggml_cpu_init().
I think this should be added below.
https://github.com/ggml-org/llama.cpp/blob/master/tests/test-quantize-fns.cpp#L130
Without it, I found that ggml_arm_arch_feature.sve_cnt was not initialized and was not working as expected.

This fix now passes the test in my environment.

Add SVE support for q4_K_q8_K

5f98b7c

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 14, 2025

ggerganov approved these changes Jan 14, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-quants.c Outdated Show resolved Hide resolved

Update ggml/src/ggml-cpu/ggml-cpu-quants.c

964f811

change to use K_SCALE_SIZE Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov merged commit c67cc98 into ggml-org:master Jan 16, 2025
48 checks passed

a-ghorbani mentioned this pull request Jan 16, 2025

llama.cpp sync for SVE support for Q4_K_Ms mybigday/llama.rn#109

Closed

Vithulep mentioned this pull request Feb 17, 2025

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot #11917

Merged

Vithulep mentioned this pull request Feb 25, 2025

ggml: aarch64: implement SVE kernels for q2_k_q8_k vector dot #12064

Merged

This was referenced Mar 10, 2025

Refactoring '-o' option #12278

Merged

tests : fix test-quantize-fns to init the CPU backend #12306

Merged

fj-y-saito mentioned this pull request Mar 13, 2025

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

Uh oh!

fj-y-saito commented Jan 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

ggerganov commented Jan 14, 2025

Uh oh!

fj-y-saito commented Jan 15, 2025

Uh oh!

Uh oh!

ggerganov commented Mar 8, 2025

Uh oh!

fj-y-saito commented Mar 10, 2025

Uh oh!

fj-y-saito commented Mar 10, 2025

Uh oh!

Uh oh!

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot #11227

Uh oh!

Conversation

fj-y-saito commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verifying Features

performance check

Uh oh!

Uh oh!

ggerganov commented Jan 14, 2025

Uh oh!

fj-y-saito commented Jan 15, 2025

Original(NEON)

This PR(SVE)

Uh oh!

Uh oh!

ggerganov commented Mar 8, 2025

Uh oh!

fj-y-saito commented Mar 10, 2025

Uh oh!

fj-y-saito commented Mar 10, 2025

Uh oh!

Uh oh!

fj-y-saito commented Jan 14, 2025 •

edited

Loading