ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

Vithulep · 2025-06-11T04:28:05Z

This PR adds SVE kernel support for Int8 datatype specific to ARM architecture.
Major code changes:

Implement SVE intrinsics code for quantize_row_q8_0()
Implement SVE intrinsics code for dequantize_row_q8_0()

Performance

The performance remained nearly the same before and after the PR changes; however, this PR introduces an SVE intrinsic implementation in Mamba int8, achieving comparable performance.

Task1: Prompt Length: 128 tokens, Generated Tokens: 1 token

Threads	Baseline (pre-PR) (Tokens/sec)	This PR(SVE) (Tokens/sec)
8	28.03	27.78
16	52	51.61
32	94.46	93.54
64	150.05	153.04

Task2: Prompt Length: 1024 tokens, Generated Tokens: 1 token

Threads	Baseline (pre-PR) (Tokens/sec)	This PR(SVE) (Tokens/sec)
8	27.45	27.16
16	49.84	49.47
32	87.59	86.78
64	141.09	141.03

Task3: Prompt Length: 8192 tokens, Generated Tokens: 1 token

Threads	Baseline (pre-PR) (Tokens/sec)	This PR(SVE) (Tokens/sec)
8	27.49	27.21
16	51.14	50.61
32	89.4	88.52
64	141.97	141.3

The command used to measure the performance is

 ./build/bin/llama-bench -m falcon-mamba-7b-q8_0.gguf -t 8,16,32,64 -p 128, 1024, 8192 -n 0

Perplexity

There is no change in model accuracy as a result of this PR. And below is the summary.

Baseline (pre-PR)	This PR(SVE)
7.6508 +/- 0.67260	7.6508 +/- 0.67260

Command:  ./build/bin/llama-perplexity -s 0 -np 128 -t 64 -m falcon-mamba-7b-q8_0.gguf -c 128 -b 128 --chunks 16 -f scripts/wikitext-2-raw/wiki.test.raw

…ow_q8_0()

…n_dequantize_function

Vithulep · 2025-06-11T10:46:51Z

The CI failures seem CMake-related and are also occurring in other PRs. Since I haven’t modified any CMake files, they don’t appear to be caused by this PR.

compilade · 2025-06-13T05:24:42Z

they don’t appear to be caused by this PR.

@Vithulep There seems to be one failure related to Q8_0 quants in one of the ARM runners for the test-quantize-fns test, see https://github.com/ggml-org/llama.cpp/actions/runs/15626597321/job/44021925255?pr=14117#step:6:21067

q8_0 reference implementation error: FAILED (0.000175)

Not sure if it's directly related, but it might.

compilade · 2025-06-13T05:40:44Z

ggml/src/ggml-quants.c

@@ -340,20 +340,37 @@ void dequantize_row_q5_1(const block_q5_1 * GGML_RESTRICT x, float * GGML_RESTRI
    }
 }

+// SVE Support added for Scaler Implementation


Dequantization from Q8_0 is not performed during inference, and so it's not as time-sensitive (hence the plain scalar code, which is also simpler to maintain).
Only quantization of the intermediate tensors matmultiplied with types having Q8_0 as their vec_dot_type is exercised during the perplexity and speed benchmarks you've shared.

Did you test the dequantization changes for correctness outside of inference?

dequantize_row_q8_0() kernel is called during Inference but very small no. of times. 1 call for every token generation. Hence Its effect can't be seen in speedup.

dequantize_row_q8_0() kernel is called during Inference but very small no. of times. 1 call for every token generation.

Ah yes I had forgotten that ggml_get_rows dequantizes what it extracts. It's used when converting tokens to embeddings at the beginning of the model graph. It's not really a bottleneck, though.

Thanks, I was wrong about it not being called.

… (ggml-org#330) * Check for reverse prompt by characters instead of tokens (ggml-org#292) * Update main.cpp Wording. * Cleanup. * Remove unnecessary use of std::stringstream. --------- Co-authored-by: Johnman <tjohnman@github> Co-authored-by: Georgi Gerganov <[email protected]>

…ke systems (ggml-org#11770)

wenlujon · 2025-06-18T06:36:34Z

just curious, since there's no performance gain (seems even slight drop) with adding sve version, why replacing the NEON version?

Vithulep · 2025-06-25T08:43:15Z

they don’t appear to be caused by this PR.

@Vithulep There seems to be one failure related to Q8_0 quants in one of the ARM runners for the test-quantize-fns test, see https://github.com/ggml-org/llama.cpp/actions/runs/15626597321/job/44021925255?pr=14117#step:6:21067
q8_0 reference implementation error: FAILED (0.000175)
Not sure if it's directly related, but it might.

Hi @compilade, Hi fixed this error, but getting timeout error test-quantize-perf. Not sure why this? Do you have any Idea about this.

abhijain1204fujitsu · 2025-06-30T17:39:40Z

Hi @compilade can you please support on the above timeout error.

ggml/src/ggml-cpu/arch/arm/quants.c

Vithulep · 2025-07-04T06:23:17Z

Hi @compilade, @ggerganov, I reproduce the check https://github.com/ggml-org/llama.cpp/actions/runs/16066769913/job/45342657348?pr=14117#step:6:21066 test-arg-parser on my machine which is failing by same steps shared earlier.
It is passed correctly.

Below is the snapshot
./bin/test-arg-parser --type q8_0

test-arg-parser: make sure there is no duplicated arguments in any examples

test-arg-parser: test invalid usage

error while handling argument "-m": expected value for argument

usage:
-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)


to show complete usage, run with -h
error while handling argument "-ngl": stoi

usage:
-ngl,  --gpu-layers, --n-gpu-layers N   number of layers to store in VRAM
                                        (env: LLAMA_ARG_N_GPU_LAYERS)


to show complete usage, run with -h
error while handling argument "-sm": invalid value

usage:
-sm,   --split-mode {none,layer,row}    how to split the model across multiple GPUs, one of:
                                        - none: use one GPU only
                                        - layer (default): split layers and KV across GPUs
                                        - row: split rows across GPUs
                                        (env: LLAMA_ARG_SPLIT_MODE)


to show complete usage, run with -h
error: invalid argument: --draft
test-arg-parser: test valid usage

test-arg-parser: test environment variables (valid + invalid usages)

error while handling environment variable "LLAMA_ARG_THREADS": stoi


test-arg-parser: test environment variables being overwritten

warn: LLAMA_ARG_MODEL environment variable is set, but will be overwritten by command line argument -m
test-arg-parser: test curl-related functions

test-arg-parser: test good URL

test-arg-parser: test bad URL

test-arg-parser: test max size error
  expected error: error: cannot make GET request: Maximum file size exceeded

test-arg-parser: test timeout error
  expected error: error: cannot make GET request: Timeout was reached

`test-arg-parser: all tests OK`

ggerganov · 2025-07-04T06:43:16Z

just curious, since there's no performance gain (seems even slight drop) with adding sve version, why replacing the NEON version?

I am also wondering this. Is there a hardware that has SVE but does not have NEON?

pvname added 4 commits June 10, 2025 15:17

Implementation of SVE for kernel quantize_row_q8_0() and dequantize_r…

397f615

…ow_q8_0()

Remove spaces

c6158b0

row removed

9922ee7

row removed

139f717

Vithulep changed the title ~~ggml: aarch64: Implement SVE Q8 kernels for vector functions~~ ggml: aarch64: Implement SVE Int 8 Quantization kernels Jun 11, 2025

Vithulep changed the title ~~ggml: aarch64: Implement SVE Int 8 Quantization kernels~~ ggml: aarch64: Implement SVE Kernels for Int 8 Quantization Jun 11, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 11, 2025

pvname and others added 6 commits June 11, 2025 10:27

Updated cmake file

88cf63c

Updated cmake file

f321910

added row

de0a047

Merge branch 'ggml-org:master' into Q8_0_SVE_implementaion_quantize_a…

1536f89

…n_dequantize_function

updated cmake file

3405a58

updated cmake file

72e532c

Added comments

3a3007f

compilade reviewed Jun 13, 2025

View reviewed changes

tjohnman and others added 5 commits June 17, 2025 15:37

metal : fix ggml_metal_log vargs (ggml-org#4373)

7d1a7c5

docs: utilize the forward slash (/) as the path separator for Unix-li…

eca058e

…ke systems (ggml-org#11770)

Removed SVE Implementation of Dequantized row

2add787

remove white spaces

b1487ec

pvname added 2 commits June 25, 2025 13:20

Updated Quantize_row_q8_0() function

d46f363

removed white spaces

1d47d63

compilade reviewed Jul 2, 2025

View reviewed changes

ggml/src/ggml-cpu/arch/arm/quants.c Show resolved Hide resolved

change quant.c

a318c58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

Uh oh!

Vithulep commented Jun 11, 2025

Uh oh!

Vithulep commented Jun 11, 2025

Uh oh!

compilade commented Jun 13, 2025 •

edited

Loading

Uh oh!

compilade Jun 13, 2025

Uh oh!

Vithulep Jun 17, 2025

Uh oh!

compilade Jun 17, 2025

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

Vithulep commented Jun 25, 2025

Uh oh!

abhijain1204fujitsu commented Jun 30, 2025

Uh oh!

Uh oh!

Vithulep commented Jul 4, 2025

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!

ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

Are you sure you want to change the base?

ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117

Uh oh!

Conversation

Vithulep commented Jun 11, 2025

Performance

Perplexity

Uh oh!

Vithulep commented Jun 11, 2025

Uh oh!

compilade commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Vithulep Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

compilade Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

wenlujon commented Jun 18, 2025

Uh oh!

Vithulep commented Jun 25, 2025

Uh oh!

abhijain1204fujitsu commented Jun 30, 2025

Uh oh!

Uh oh!

Vithulep commented Jul 4, 2025

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!

compilade commented Jun 13, 2025 •

edited

Loading