-
Notifications
You must be signed in to change notification settings - Fork 12.2k
ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ggml: aarch64: Implement SVE Kernels for Int 8 Quantization #14117
Conversation
The CI failures seem CMake-related and are also occurring in other PRs. Since I haven’t modified any CMake files, they don’t appear to be caused by this PR. |
@Vithulep There seems to be one failure related to
Not sure if it's directly related, but it might. |
ggml/src/ggml-quants.c
Outdated
@@ -340,20 +340,37 @@ void dequantize_row_q5_1(const block_q5_1 * GGML_RESTRICT x, float * GGML_RESTRI | |||
} | |||
} | |||
|
|||
// SVE Support added for Scaler Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dequantization from Q8_0
is not performed during inference, and so it's not as time-sensitive (hence the plain scalar code, which is also simpler to maintain).
Only quantization of the intermediate tensors matmultiplied with types having Q8_0
as their vec_dot_type
is exercised during the perplexity and speed benchmarks you've shared.
Did you test the dequantization changes for correctness outside of inference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dequantize_row_q8_0() kernel is called during Inference but very small no. of times. 1 call for every token generation. Hence Its effect can't be seen in speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dequantize_row_q8_0()
kernel is called during Inference but very small no. of times. 1 call for every token generation.
Ah yes I had forgotten that ggml_get_rows
dequantizes what it extracts. It's used when converting tokens to embeddings at the beginning of the model graph. It's not really a bottleneck, though.
Thanks, I was wrong about it not being called.
… (ggml-org#330) * Check for reverse prompt by characters instead of tokens (ggml-org#292) * Update main.cpp Wording. * Cleanup. * Remove unnecessary use of std::stringstream. --------- Co-authored-by: Johnman <tjohnman@github> Co-authored-by: Georgi Gerganov <[email protected]>
just curious, since there's no performance gain (seems even slight drop) with adding sve version, why replacing the NEON version? |
Hi @compilade, Hi fixed this error, but getting timeout error test-quantize-perf. Not sure why this? Do you have any Idea about this. |
Hi @compilade can you please support on the above timeout error. |
for (int j = 0; j < QK8_0; j+=ggml_f32_epr) { | ||
srcv1 = svld1_f32(pg, x + i*32 + j); | ||
asrcv1 = svabs_f32_m(inactive1, pg, srcv1); | ||
sv_max = svmax_f32_m(pg, sv_max, asrcv1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why exactly, but this is an infinite loop at least on a c7g
instance (on AWS). The timeout in the CI is likely caused by a similar situation.
This is reproducible on a c7g
instance by running
$ git clone --branch Q8_0_SVE_implementaion_quantize_an_dequantize_function https://github.com/Vithulep/llama.cpp.git
$ cd llama.cpp
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ make -j2 test-quantize-perf
$ ./bin/test-quantize-perf --type q8_0
q8_0
quantize_row_q_reference
4096 values (0.02 MB)
min cycles/32 vals : 0.00
avg cycles/32 vals : 0.00
float32 throughput : 1.93 GB/s
quantized throughput : 0.51 GB/s
quantize_row_q
4096 values (0.02 MB)
<hangs here>
The infinite loop can be located with perf
(a sampling profiler)
$ echo 2 | sudo tee /proc/sys/kernel/perf_event_paranoid
$ perf record --call-graph=fp -- ./bin/test-quantize-perf --type q8_0
q8_0
quantize_row_q_reference
4096 values (0.02 MB)
min cycles/32 vals : 0.00
avg cycles/32 vals : 0.00
float32 throughput : 1.44 GB/s
quantized throughput : 0.38 GB/s
quantize_row_q
4096 values (0.02 MB)
^C[ perf record: Woken up 19 times to write data ]
[ perf record: Captured and wrote 4.678 MB perf.data (38305 samples) ]
$ perf report # then locate the problematic function and press "a" to see the assembly
For some reason, it seems like ggml_cpu_get_sve_cnt()
returns 0
on that hardware, but only in test-quantize-perf
, not in llama-cli
. Not sure why.
This PR adds SVE kernel support for Int8 datatype specific to ARM architecture.
Major code changes:
Performance
The performance remained nearly the same before and after the PR changes; however, this PR introduces an SVE intrinsic implementation in Mamba int8, achieving comparable performance.
Task1: Prompt Length: 128 tokens, Generated Tokens: 1 token
Task2: Prompt Length: 1024 tokens, Generated Tokens: 1 token
Task3: Prompt Length: 8192 tokens, Generated Tokens: 1 token
The command used to measure the performance is
Perplexity
There is no change in model accuracy as a result of this PR. And below is the summary.