Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, and 128-bit Vector lengths #9290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 9, 2024

Conversation

Vithulep
Copy link
Contributor

@Vithulep Vithulep commented Sep 3, 2024

This PR introduces VLA support for SVE (Scalable Vector Extensions) kernels for the q4_0_q8_0 and q8_0_q8_0 vector dot on the Arm architecture. This Consider 128-bit, 256-bit and 512-bit vector lengths. The VLA is implemented using switch case. The SVE 256-bit Implementation is taken from PR #7433 . SVE implementation of 128 and 512 bit is implemented from scratch.

If we compare neon with the machine which support 128 (GRACEVM-32 CPU), 256 (Graviton 3), and 512 -bit (FX700) vector length, then we can observe, for 128-bit SVE (GRACEVM-32 CPU) is ~ X1.1 to X1.5 times faster than Neon.
Similarly for 512-bit SVE (FX700) is ~ X4.7 to X2 times faster than Neon. And from PR#7433 , We know that SVE 256-bit (Graviton 3) ~X1.1 to X1.5 times faster than Neon.

Here are the performance measured on FX700 48 core (A64FX) Machine and GRACEVM- 32 CPU Machine.

### Q4_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q4_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q4_0.gguf-prompt.bin

### Q8_0_Q8_0
$ ./build/bin/llama-cli --model models/llama-2-7b-chat.Q8_0.gguf --temp 0.1 --threads 2 --prompt 'AI is going to' --n-predict 512 --seed 0 --prompt-cache llama-2-7b-chat.Q8_0.gguf-prompt.bin

Performance measured on A FX700 48 core (A64FX) Machine.

image

Performance measured on A GRACEVM- 32 CPU Machine.

image

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 3, 2024
@ggerganov
Copy link
Member

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 5, 2024

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

Thanks for the comment! I ran perplexity with SVE 128, 256, and 512-bit and non-SVE(Neon). The following is the command and partial logs.

-------------------- Q_8 --------------------------

Q_8 Graviton Neon

$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q8_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 44.856 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 3.07 seconds per pass - ETA 0.20 minutes
[1]7.1769,[2]5.9443,[3]6.8531,[4]4.4417,
Final estimate: PPL = 4.4417 +/- 0.76438

llama_print_timings: load time = 375.16 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 12245.33 ms / 512 tokens ( 23.92 ms per token, 41.81 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 12299.17 ms / 513 tokens

Q_8 Graviton SVE 256 (Merged)

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 43.991 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.63 seconds per pass - ETA 0.10 minutes
[1]7.2120,[2]5.9646,[3]6.8482,[4]4.4350,
Final estimate: PPL = 4.4350 +/- 0.76148

llama_print_timings: load time = 364.19 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 6494.23 ms / 512 tokens ( 12.68 ms per token, 78.84 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 6546.73 ms / 513 tokens

Q_8 SVE 128 GRACE VM 32

$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 29.527 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.71 seconds per pass - ETA 0.03 minutes
[1]7.2120,[2]5.9646,[3]6.8104,[4]4.4166,
Final estimate: PPL = 4.4166 +/- 0.75761

llama_print_timings: load time = 202.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2822.60 ms / 512 tokens ( 5.51 ms per token, 181.39 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 2864.59 ms / 513 tokens

Q_8 SVE 512 FX700

$ ./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q8_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 48 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 105.496 ms
perplexity: calculating perplexity over 4 chunks, batch_size=128
perplexity: 3.99 seconds per pass - ETA 0.25 minutes
[1]7.1426,[2]5.9335,[3]6.8278,[4]4.4216,
Final estimate: PPL = 4.4216 +/- 0.75837

llama_print_timings: load time = 6316.84 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 15873.91 ms / 512 tokens ( 31.00 ms per token, 32.25 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 16000.65 ms / 513 tokens

------------------ Q_4 ------------------------------------

Q_4 Neon Graviton

$ ./build_neon/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 43.849 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 3.03 seconds per pass - ETA 0.20 minutes
[1]7.6753,[2]6.5941,[3]7.4843,[4]4.8210,
Final estimate: PPL = 4.8210 +/- 0.85493

llama_print_timings: load time = 241.27 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 12103.49 ms / 512 tokens ( 23.64 ms per token, 42.30 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 12156.36 ms / 513 tokens

Q_4 SVE 128 GRACE VM 32

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 29.592 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.72 seconds per pass - ETA 0.03 minutes
[1]7.7460,[2]6.5982,[3]7.5013,[4]4.8223,
Final estimate: PPL = 4.8223 +/- 0.85661

llama_print_timings: load time = 135.31 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 2854.41 ms / 512 tokens ( 5.58 ms per token, 179.37 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 2894.22 ms / 513 tokens

Q_4 SVE 256 Graviton

$ ./build_sve/bin/llama-perplexity -s 0 -np 1 -t 32 -m ../../models/llama-2-7b-chat.Q4_0.gguf -f ../../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 32 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 43.704 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.57 seconds per pass - ETA 0.10 minutes
[1]7.7460,[2]6.5982,[3]7.5307,[4]4.8365,
Final estimate: PPL = 4.8365 +/- 0.85967

llama_print_timings: load time = 222.06 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 6243.64 ms / 512 tokens ( 12.19 ms per token, 82.00 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 6296.64 ms / 513 tokens

Q_4 SVE 512 FX700

./build_sve/bin/perplexity -s 0 -np 1 -t 32 -m ./models/llama-2-7b-chat.Q4_0.gguf -f ../wikitext-2-raw-v1 -c 128 -b 128 --chunks 4


system_info: n_threads = 32 / 48 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 108.764 ms
perplexity: calculating perplexity over 4 chunks, batch_size=128
perplexity: 5.38 seconds per pass - ETA 0.35 minutes
[1]7.6943,[2]6.5693,[3]7.4721,[4]4.8070,
Final estimate: PPL = 4.8070 +/- 0.85285

And below is a summary.
perplexity

@jdomke
Copy link
Contributor

jdomke commented Sep 5, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 5, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement
const int vector_length = ggml_sve_cnt_b*8; with the following one
const int vector_length = svcntb()*8;

I have used svcntb() while running so never faced segmentation fault.

@jdomke
Copy link
Contributor

jdomke commented Sep 5, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;

I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@jdomke
Copy link
Contributor

jdomke commented Sep 5, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 5, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

I want to investigate it further; can you give me compiler configuration of your clang compiler and other system information. Thanks for reporting it.

@slaren slaren mentioned this pull request Sep 6, 2024
4 tasks
@jdomke
Copy link
Contributor

jdomke commented Sep 6, 2024

Thanks @Vithulep , I (accidentally) compiled master with your PR with -msve-vector-bits=256 and ran on a FX700 system which results in a seg fault. Are you able to reproduce this behavior? Maybe we need compile time and runtime checks for the sve width 😢

Can you try replacing the statement const int vector_length = ggml_sve_cnt_b*8; with the following one const int vector_length = svcntb()*8;
I have used svcntb() while running so never faced segmentation fault.

With svcntb it fails as well. It's crashing instantly. Maybe a different issue entirely because I don't see any logs from loading the model.

@Vithulep better to ignore my segfault for now and go ahead with the pr. it could be some compiler bug on my end with the clang build or the arm sysroot it is using. i cannot reproduce the crash with gcc right now.

I want to investigate it further; can you give me compiler configuration of your clang compiler and other system information. Thanks for reporting it.

Sure.

  • FX1000 (fugaku)
  • llvm 19 (rc3) at commit 437434d
  • arm gnu toolchain 12.3.rel1 as sysroot
  • [email protected]

few more details i was able to collect last night:

  • for ./bin/llama-cli i cannot even get to a main breakpoint in gdb
  • with ./bin/llama-bench-matmult it fails when returning from ggml_new_tensor_2d (printf/fflush right before return statement is OK, right after function is not -> maybe corrupted stack?!)
  • -O0 is OK, but O1 and anything above shows the same issue

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 6, 2024

Please also provide numbers from llama-bench in order to see the prompt processing speed difference and from llama-perplexity in order to make sure the computation is correct.

Here are the numbers from llama-bench
##Q_4
$ ./build_neon/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
$ ./build_sve/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

##Q_8
$ ./build_neon/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
$ ./build_sve/bin/llama-bench -m ../../models/llama-2-7b-chat.Q8_0.gguf -n 0 -n 16 -p 64 -t 1,2,4,8,16,32

On A FX700 48 Core machine

Q_8 SVE 512-bit
FX700_Q8_512_sve
Q_8 Neon
Neon_Q8_FX700 (2)

Q_4 SVE 512bit
FX700_Q4_512_sve

Q_4 Neon
Neon_Q4_FX700 (2)

On A GRACEVM 32 Core Machine

Q_8 SVE 128-bit
sve_Q8_128_grcevm

Q_8 Neon
Neon_Q8_grace

Q_4 SVE 128-bit
sve_128_q4_grace

Q_4 Neon
Neon_Q4_graceVM

@ggerganov
Copy link
Member

@Vithulep Since there are quite a number of instruction sets to keep track of, can you help me understand how these kernels compare to the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats for which we already have SVE/INT8MM support on master (see the ggml-aarch64.c). They were introduced here: #5780.

I'm interested in comparison in terms of:

  • performance at Q4_0 quantization
  • hardware coverage of the two implementations

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 9, 2024

@Vithulep Since there are quite a number of instruction sets to keep track of, can you help me understand how these kernels compare to the Q4_0_4_4, Q4_0_4_8 and Q4_0_8_8 formats for which we already have SVE/INT8MM support on master (see the ggml-aarch64.c). They were introduced here: #5780.

I'm interested in comparison in terms of:

  • performance at Q4_0 quantization
  • hardware coverage of the two implementations

Previously in #5780, It was specifically implemented for SVE 256-bit vector length. If the architecture has support for SVE 128-bit or 512-bit vector length, then it will use NEON Intrinsic.

So, to get best performance out of the architecture which support other vector length than SVE 256-bit, we have implemented SVE-VLA. In this case it will use SVE intrinsic depending on the available vector length.

And INT8MM kernel is currently implemented in NEON Intrinsic. Which is used in combination with SVE or NEON Intrinsic.

@ggerganov
Copy link
Member

Thanks, this is helpful. Since this PR adds 256-bit SVE support as well, have you compared the performance against the 256-bit SVE implementation on master? I think it would be helpful to run llama-bench. The build instructions in #9321 could be of help.

I would happily run the benches myself, but I don't have access to hardware that supports SVE.

@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 9, 2024

Thanks, this is helpful. Since this PR adds 256-bit SVE support as well, have you compared the performance against the 256-bit SVE implementation on master? I think it would be helpful to run llama-bench. The build instructions in #9321 could be of help.

I would happily run the benches myself, but I don't have access to hardware that supports SVE.

In my PR, I have implemented SVE for 128-bit and 512-bit vector length, and SVE 256-bit is taken from #7433 and I have put all this implementation in switch case. So, there is no performance difference in case of SVE 256-bit.

Here are results for Q8 SVE 256-bit and similar results for Q4 also.
#7433 PR (Merged SVE-256)
SV 256 Old katosan

This PR (Under Switch case)
sve 256 vla

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts
@ggerganov
Copy link
Member

Thank you, I understand now. I've sent a few style changes: Vithulep#1. We can merge after accepted.

ggml : style changes + fix 512-bit nb loop check
@Vithulep
Copy link
Contributor Author

Vithulep commented Sep 9, 2024

Thank you, I understand now. I've sent a few style changes: Vithulep#1. We can merge after accepted.

Thanks. I have cross-checked it.

Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov ggerganov merged commit 5fac4d5 into ggml-org:master Sep 9, 2024
1 check passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Implemented vector length agnostic SVE using switch case for 512-bit, 256-bit, 128-bit vector lengths

* Removed WhiteSpaces

* ggml : style changes + fix 512-bit nb loop check

- fix local scope in switch cases
- consistent predicate names
- empty lines when necessary
- opening braces, spaces
- const-correctness
- add asserts

* Update ggml/src/ggml-quants.c

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants