Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

Vithulep · 2024-09-03T06:54:39Z

This PR introduces VLA support for SVE (Scalable Vector Extensions) kernels for the q4_0_q8_0 and q8_0_q8_0 vector dot on the Arm architecture. This Consider 128-bit, 256-bit and 512-bit vector lengths. The VLA is implemented using switch case. The SVE 256-bit Implementation is taken from PR #7433 . SVE implementation of 128 and 512 bit is implemented from scratch.

If we compare neon with the machine which support 128 (GRACEVM-32 CPU), 256 (Graviton 3), and 512 -bit (FX700) vector length, then we can observe, for 128-bit SVE (GRACEVM-32 CPU) is ~ X1.1 to X1.5 times faster than Neon.
Similarly for 512-bit SVE (FX700) is ~ X4.7 to X2 times faster than Neon. And from PR#7433 , We know that SVE 256-bit (Graviton 3) ~X1.1 to X1.5 times faster than Neon.

Here are the performance measured on FX700 48 core (A64FX) Machine and GRACEVM- 32 CPU Machine.

### Q4_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Performance measured on A FX700 48 core (A64FX) Machine.

Performance measured on A GRACEVM- 32 CPU Machine.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

… 256-bit, 128-bit vector lengths

ggerganov · 2024-09-03T07:15:22Z

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

Vithulep · 2024-09-05T05:22:38Z

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

Thanks for the comment! I ran perplexity with SVE 128, 256, and 512-bit and non-SVE(Neon). The following is the command and partial logs.

-------------------- Q_8 --------------------------

Q_8 Graviton Neon

$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q8_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 375.16 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 12245.33 ms / 512 tokens ( 23.92 ms per token, 41.81 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 12299.17 ms / 513 tokens

Q_8 Graviton SVE 256 (Merged)

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 364.19 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 6494.23 ms / 512 tokens ( 12.68 ms per token, 78.84 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 6546.73 ms / 513 tokens

Q_8 SVE 128 GRACE VM 32

$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 202.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2822.60 ms / 512 tokens ( 5.51 ms per token, 181.39 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 2864.59 ms / 513 tokens

Q_8 SVE 512 FX700

$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 6316.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 15873.91 ms / 512 tokens ( 31.00 ms per token, 32.25 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 16000.65 ms / 513 tokens

------------------ Q_4 ------------------------------------

Q_4 Neon Graviton

$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 241.27 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 12103.49 ms / 512 tokens ( 23.64 ms per token, 42.30 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 12156.36 ms / 513 tokens

Q_4 SVE 128 GRACE VM 32

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 135.31 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2854.41 ms / 512 tokens ( 5.58 ms per token, 179.37 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 2894.22 ms / 513 tokens

Q_4 SVE 256 Graviton

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

llama_print_timings: load time = 222.06 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 6243.64 ms / 512 tokens ( 12.19 ms per token, 82.00 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 6296.64 ms / 513 tokens

Q_4 SVE 512 FX700

./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q4_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4

And below is a summary.

jdomke · 2024-09-05T10:02:38Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Vithulep · 2024-09-05T10:39:46Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement
const int vector_length = ggml_sve_cnt_b*8; with the following one
const int vector_length = svcntb()*8;

I have used svcntb() while running so never faced segmentation fault.

jdomke · 2024-09-05T12:33:32Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;

I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

jdomke · 2024-09-05T14:20:28Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

Vithulep · 2024-09-05T14:39:10Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

I want to investigate it further; can you give me compiler configuration of your clang compiler and other system information. Thanks for reporting it.

jdomke · 2024-09-06T03:12:15Z

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

I want to investigate it further; can you give me compiler configuration of your clang compiler and other system information. Thanks for reporting it.

Sure.

FX1000 (fugaku)
llvm 19 (rc3) at commit 437434d
arm gnu toolchain 12.3.rel1 as sysroot
[email protected]

few more details i was able to collect last night:

for ./bin/llama-cli i cannot even get to a main breakpoint in gdb
with ./bin/llama-bench-matmult it fails when returning from ggml_new_tensor_2d (printf/fflush right before return statement is OK, right after function is not -> maybe corrupted stack?!)
-O0 is OK, but O1 and anything above shows the same issue

Vithulep · 2024-09-06T03:36:36Z

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

Here are the numbers from llama-bench
##Q_4
$ ./build_neon/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
$ ./build_sve/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

##Q_8
$ ./build_neon/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
$ ./build_sve/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

On A FX700 48 Core machine

Q_8 SVE 512-bit

Q_8 Neon

Q_4 SVE 512bit

Q_4 Neon

On A GRACEVM 32 Core Machine

Q_8 SVE 128-bit

Q_8 Neon

Q_4 SVE 128-bit

Q_4 Neon

ggerganov · 2024-09-08T12:37:16Z

@Vithulep Since there are quite a number of instruction sets to keep track of, can you help me understand how these kernels compare to the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats for which we already have SVE/INT8MM support on master (see the ggml-aarch64.c). They were introduced here: #5780.

I'm interested in comparison in terms of:

performance at Q4_0 quantization
hardware coverage of the two implementations

Vithulep · 2024-09-09T06:22:16Z

@Vithulep Since there are quite a number of instruction sets to keep track of, can you help me understand how these kernels compare to the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats for which we already have SVE/INT8MM support on master (see the ggml-aarch64.c). They were introduced here: #5780.

I'm interested in comparison in terms of:

performance at Q4_0 quantization

hardware coverage of the two implementations

Previously in #5780, It was specifically implemented for SVE 256-bit vector length. If the architecture has support for SVE 128-bit or 512-bit vector length, then it will use NEON Intrinsic.

So, to get best performance out of the architecture which support other vector length than SVE 256-bit, we have implemented SVE-VLA. In this case it will use SVE intrinsic depending on the available vector length.

And INT8MM kernel is currently implemented in NEON Intrinsic. Which is used in combination with SVE or NEON Intrinsic.

ggerganov · 2024-09-09T07:00:51Z

Thanks, this is helpful. Since this PR adds 256-bit SVE support as well, have you compared the performance against the 256-bit SVE implementation on master? I think it would be helpful to run llama-bench. The build instructions in #9321 could be of help.

I would happily run the benches myself, but I don't have access to hardware that supports SVE.

Vithulep · 2024-09-09T09:05:11Z

Thanks, this is helpful. Since this PR adds 256-bit SVE support as well, have you compared the performance against the 256-bit SVE implementation on master? I think it would be helpful to run llama-bench. The build instructions in #9321 could be of help.

I would happily run the benches myself, but I don't have access to hardware that supports SVE.

In my PR, I have implemented SVE for 128-bit and 512-bit vector length, and SVE 256-bit is taken from #7433 and I have put all this implementation in switch case. So, there is no performance difference in case of SVE 256-bit.

Here are results for Q8 SVE 256-bit and similar results for Q4 also.
#7433 PR (Merged SVE-256)

This PR (Under Switch case)

- fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts

ggerganov · 2024-09-09T09:54:07Z

Thank you, I understand now. I've sent a few style changes: Vithulep#1. We can merge after accepted.

ggml : style changes + fix 512-bit nb loop check

Vithulep · 2024-09-09T12:54:19Z

Thank you, I understand now. I've sent a few style changes: Vithulep#1. We can merge after accepted.

Thanks. I have cross-checked it.

ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths * Removed WhiteSpaces * ggml : style changes + fix 512-bit nb loop check - fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts * Update ggml/src/ggml-quants.c Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

Vithulep added 2 commits September 3, 2024 11:27

Implemented vector length agnostic SVE using switch case for 512-bit,…

4dbdb6c

… 256-bit, 128-bit vector lengths

Implemented vector length agnostic SVE using switch case for 512-bit,…

6a6cfd6

… 256-bit, 128-bit vector lengths

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 3, 2024

Removed WhiteSpaces

a9a9f66

slaren mentioned this pull request Sep 6, 2024

Arm AArch64: Documentation updates #9321

Merged

4 tasks

ggml : style changes + fix 512-bit nb loop check

cfbf33a

- fix local scope in switch cases - consistent predicate names - empty lines when necessary - opening braces, spaces - const-correctness - add asserts

Merge pull request #1 from ggerganov/SVE-vector-length-agnostic-VLA-gg

2bed254

ggml : style changes + fix 512-bit nb loop check

ggerganov reviewed Sep 9, 2024

View reviewed changes

ggml/src/ggml-quants.c Outdated Show resolved Hide resolved

Update ggml/src/ggml-quants.c

bb689e1

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov approved these changes Sep 9, 2024

View reviewed changes

ggerganov merged commit 5fac4d5 into ggml-org:master Sep 9, 2024
1 check passed

Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

Uh oh!

Conversation

Vithulep commented Sep 3, 2024

Uh oh!

ggerganov commented Sep 3, 2024

Uh oh!

Vithulep commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

-------------------- Q_8 --------------------------

Q_8 Graviton Neon

Q_8 Graviton SVE 256 (Merged)

Q_8 SVE 128 GRACE VM 32

Q_8 SVE 512 FX700

------------------ Q_4 ------------------------------------

Q_4 Neon Graviton

Q_4 SVE 128 GRACE VM 32

Q_4 SVE 256 Graviton

Q_4 SVE 512 FX700

Uh oh!

jdomke commented Sep 5, 2024

Uh oh!

Vithulep commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdomke commented Sep 5, 2024

Uh oh!

jdomke commented Sep 5, 2024

Uh oh!

Vithulep commented Sep 5, 2024

Uh oh!

jdomke commented Sep 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vithulep commented Sep 6, 2024

Uh oh!

ggerganov commented Sep 8, 2024

Uh oh!

Vithulep commented Sep 9, 2024

Uh oh!

ggerganov commented Sep 9, 2024

Uh oh!

Vithulep commented Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 9, 2024

Uh oh!

Vithulep commented Sep 9, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Vithulep commented Sep 5, 2024 •

edited

Loading

Vithulep commented Sep 5, 2024 •

edited

Loading

jdomke commented Sep 6, 2024 •

edited

Loading

Vithulep commented Sep 9, 2024 •

edited

Loading