Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ggml: aarch64: implement SVE kernels for q6_K_q8_K vector dot #12361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2025

Conversation

fj-y-saito
Copy link
Contributor

This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q6_K_q8_K vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #11227.

Verifying Features

This PR contains the SVE implementation of the vector dot used to compute the Q4_K quantization.
By running a Q4_K_M quantized model of Llama-3.1-8B, I confirmed that the values match.
The values of the NEON and SVE implementations were compared one after another, and it was confirmed that the values always match.
I also verified that the perplexity matches between the NEON and SVE implementations.

NEON SVE (this PR)
6.5778 +/- 0.04061 6.5778 +/- 0.04061

performance check

Performance was measured with FX700.
Performance is improved as follows (measured with llama-bench).

original

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp1 |          2.60   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp2 |          3.07   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp4 |          3.28   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp8 |          3.42   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         pp512 |          3.39   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         tg128 |          2.60   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp1 |          5.24   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp2 |          6.01   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp4 |          6.51   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp8 |          6.80   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         pp512 |          6.74   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         tg128 |          5.17   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp1 |          7.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp2 |          8.40   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp4 |          9.19   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp8 |          9.64   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         pp512 |         10.07   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         tg128 |          7.26   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |          9.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |         10.74   0.07 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |         11.95   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |         12.63   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         13.35   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |          9.21   0.01 |

This PR

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp1 |          4.02   0.06 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp2 |          4.50   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp4 |          4.72   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |           pp8 |          4.85   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         pp512 |          4.58   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      12 |         tg128 |          3.97   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp1 |          7.78   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp2 |          8.69   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp4 |          9.24   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |           pp8 |          9.55   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         pp512 |          9.10   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      24 |         tg128 |          7.60   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp1 |         10.60   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp2 |         12.11   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp4 |         13.10   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |           pp8 |         13.66   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         pp512 |         13.56   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      36 |         tg128 |         10.38   0.01 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp1 |         13.15   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp2 |         15.19   0.03 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp4 |         16.70   0.28 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |           pp8 |         17.73   0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         pp512 |         17.95   0.02 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CPU        |      48 |         tg128 |         12.80   0.05 |

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 13, 2025
Comment on lines +8292 to +8295
isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_1, q8bytes_1), scale_lane_1);
isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_2, q8bytes_2), scale_lane_2);
isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_3, q8bytes_3), scale_lane_3);
isum_tmp = svmla_s32_x(pg32_8, isum_tmp, svdot_s32(vzero, q6bytes_4, q8bytes_4), scale_lane_4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to try is to have 4 separate accumulators here. Don't have a machine that supports SVE to give this a try.

Copy link
Contributor Author

@fj-y-saito fj-y-saito Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I implemented this fix and measured the elapsed time of ggml_vec_dot_q6_K_q8_K in perf, I found a performance degradation of about 5%.
So I think it's better to leave it as it is.

I consider the following:
By providing a separate accumulator:

  • The seven dependencies on the critical path in the loop were reduced by one through the fix.
  • Three add instructions were added to sum up the sepalated accumulators outside the for loop (reducing the dependencies to two).

Instruction dependencies are reduced to 7->1, while the number of instructions is increased by 3.
If the number of loop rotations is large, the proposed modification is expected to improve performance. However, in this case, since the number of loop count was only 2, performance degradation due to the increase in the number of instructions was dominant.

@ggerganov ggerganov merged commit d9a1452 into ggml-org:master Mar 18, 2025
47 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
@a-ghorbani
Copy link
Contributor

Hey, is this implementation expected to give a boost on mobile devices as well (which afaik usually have a 128-bit SVE vector), or is this meant for server-grade CPUs (with larger SVE widths)?I tested it on my Pixel 9 (which has SVE and SVE2 at a 128-bit) and couldn't see any performance improvement. so wanted to check if I am missing something or it is expected.

@fj-y-saito
Copy link
Contributor Author

This imprementation will improve performance if the processor meets the following conditions:

SIMD width is 128

  • Number of NEON pipelines < Number of SVE pipelines

SIMD width is 256 or more

  • Number of NEON pipelines < Number of SVE pipelines x 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants