Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Oct 12, 2025

This heavily refactors the caching structure of the MMQ shader and also makes it more modular, to work with other kinds of quants.

Basically instead of turning the quants into 8-bit integers during load to shared memory, the quant structs now get copied through shared memory into registers and only reshaped into 8-bit integers directly before the integer dot operation. This saves on shared memory and on registers.

TODO:

  • Q2_K
  • Q3_K
  • Q4_K
  • Q5_K
  • Q6_K

Q2_K performance is not that good yet. Mapping the 256-wide quant structure to 32-wide Q8_1 structures is not that easy to do efficiently, so I'm still trying to find the best way to do that. @jeffbolznv Let me know if you see any obvious issues with the implementation.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025
@jeffbolznv
Copy link
Collaborator

Interesting. How is the performance for the legacy quants?

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 12, 2025

Interesting. How is the performance for the legacy quants?

It's a ~10% improvement for Intel, a little less so on AMD and Nvidia.

Having the values decoded to 8b in shared memory would allow for using int8 coopmat, so this change seems to prevent that. But if using coopmat for this isn't planned then I guess that's fine.

Yeah, I gave that a try when I first created this shader and didn't find a good way to use coopmat. I plan to take another look, but I guess I'd create a separate shader for it. There wasn't a good way to add k-quants to the structure it had.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 15, 2025

@jeffbolznv I'm trying to investigate the low performance for q2_k with Nvidia Nsight Graphics, but it's giving me some weird results:
This is the q2_k shader:
image
This is the q4_0 shader:
image
One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there.
Also, the instruction count is quite a bit larger for q2_k, which may be related to the third-most common stall being NOINST.

Additionally, I get something like 12.81 TFLOPS on a normal run, but 14.90 TFLOPS if I disable FP16. (The test is MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1))

The hotspots otherwise are the integer dot math, the mul_q8_1 function and the data load from global to shared memory in block_a_to_shmem
image

Any clue what is going on?

@jeffbolznv
Copy link
Collaborator

One difference I can see is shared memory, but I actually requested less shared memory for q2_k than for q4_0, so I don't know what's going on there

This could be register spilling to shared memory. Might be worth trying a smaller tile size to not be so close to the register limit.

What is the relative performance of Q2_K and Q4_0, in the old and new paths?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 15, 2025

From memory, it's something like 10-14 tflops for the scalar float16 path and around 24 tflops for the q4_0 integer dot one.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 16, 2025

MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  384 runs -  2614.45 us/run -  60.13 GFLOP/run -  23.00 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  348 runs -  2885.67 us/run -  60.13 GFLOP/run -  20.84 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  304 runs -  3302.36 us/run -  60.13 GFLOP/run -  18.21 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  324 runs -  3099.11 us/run -  60.13 GFLOP/run -  19.40 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  328 runs -  3053.64 us/run -  60.13 GFLOP/run -  19.69 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  182 runs -  5507.77 us/run -  60.13 GFLOP/run -  10.92 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  230 runs -  4363.89 us/run -  60.13 GFLOP/run -  13.78 TFLOPS

These are the actual values.

Edit: They do also improve significantly without fp16 enabled, odd.

MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  438 runs -  2284.85 us/run -  60.13 GFLOP/run -  26.32 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  370 runs -  2711.59 us/run -  60.13 GFLOP/run -  22.17 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  382 runs -  2619.61 us/run -  60.13 GFLOP/run -  22.95 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  280 runs -  3578.32 us/run -  60.13 GFLOP/run -  16.80 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  430 runs -  2327.71 us/run -  60.13 GFLOP/run -  25.83 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  186 runs -  5417.37 us/run -  60.13 GFLOP/run -  11.10 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1):                  152 runs -  6588.63 us/run -  60.13 GFLOP/run -   9.13 TFLOPS

This is not just a testing fluke, it also increases pp512 of a model that uses the mmq shader.

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 1921.58 ± 2.88

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 2241.67 ± 9.24

Edit2: This is due to the accumulator type, so maybe a cache improvement due to 32-bit sums.

@SavicStefan
Copy link
Contributor

I think you can also use the optimization from PR #16203.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-k-quants branch from 3e4ff93 to 7984fc5 Compare October 25, 2025 13:33
SavicStefan pushed a commit to SavicStefan/llama-stefan.cpp that referenced this pull request Oct 28, 2025
@SavicStefan
Copy link
Contributor

This uses ACC_TYPE_VEC2 and each mmq_dot_product computes two of them.

Performance Comparison (Without coopmat and coopmat2) NVIDIA GeForce RTX 4060 Ti

Performance Comparison

Kernel Before(us/run) After(us/run) Δ %
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2376.51 2368.66 +0.33%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2484.84 2500.16 -0.62%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2697.12 2773.05 -2.82%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2793.02 2828.11 -1.26%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 2564.39 2573.99 -0.37%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5250.73 5211.97 +0.74%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3872.56 3815.29 +1.48%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 3095.60 3133.28 -1.22%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 5161.67 3933.89 +23.79%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1) 4903.09 4049.95 +17.40%

@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 28, 2025

I've had that implemented, but found that using single float32 accumulators is more performant than f16vec2 accumulators, not sure why. This PR will be ready to merge soon, we can experiment more with optimization, then.

@SavicStefan
Copy link
Contributor

I added also that mmq_dot_product returns ACC_TYPE_VEC2.
Sounds good, I will test it more and then will create PR when you merge it, so we can see can we use it or not.

@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-mmq-dp4a-k-quants branch from 07c0ee4 to 40d75d9 Compare October 28, 2025 15:34
@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 28, 2025

Here are some quick performance results:

AMD Radeon Pro VII

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 pp512 254.63 ± 0.15 304.70 ± 0.73 542.62 ± 5.28 +78.1%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 pp512 259.54 ± 0.07 291.79 ± 0.36 500.78 ± 2.20 +71.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 pp512 857.51 ± 0.32 295.10 ± 0.15 682.91 ± 6.78 +131.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 919.96 ± 0.75 282.70 ± 0.14 622.37 ± 3.68 +120.2%
llama 8B Q4_0 4.33 GiB 8.03 B 99 0 pp512 926.34 ± 0.28 739.29 ± 4.82 815.45 ± 0.71 +10.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 1000.29 ± 0.45 666.76 ± 0.97 723.91 ± 3.87 +8.6%
llama 8B Q4_1 4.77 GiB 8.03 B 99 0 pp512 973.35 ± 1.07 646.31 ± 2.04 788.65 ± 7.93 +22.0%
llama 8B Q4_1 4.77 GiB 8.03 B 99 1 pp512 1055.39 ± 0.58 587.53 ± 3.58 706.83 ± 4.05 +20.3%
llama 8B Q8_0 7.95 GiB 8.03 B 99 0 pp512 487.07 ± 0.21 648.16 ± 2.26 677.87 ± 1.24 +4.6%
llama 8B Q8_0 7.95 GiB 8.03 B 99 1 pp512 506.18 ± 0.19 592.97 ± 0.46 619.94 ± 0.43 +4.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 pp512 344.51 ± 2.37 386.40 ± 1.66 573.07 ± 2.68 +48.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 351.32 ± 1.94 356.26 ± 1.63 518.79 ± 1.69 +45.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 pp512 1026.61 ± 2.39 583.41 ± 3.72 1223.48 ± 4.55 +109.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1062.38 ± 8.99 566.24 ± 4.57 1170.69 ± 5.01 +106.7%

Intel A770

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 pp512 241.88 ± 0.05 814.40 ± 0.82 +236.7%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 pp512 107.65 ± 0.05 269.01 ± 0.12 +149.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 pp512 226.89 ± 0.39 721.43 ± 0.48 +218.0%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 106.13 ± 0.06 258.39 ± 0.10 +143.5%
llama 8B Q4_0 4.33 GiB 8.03 B 99 0 pp512 905.61 ± 1.13 1180.42 ± 1.38 +30.3%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 276.18 ± 0.13 297.01 ± 0.18 +7.5%
llama 8B Q4_1 4.77 GiB 8.03 B 99 0 pp512 902.66 ± 1.14 1191.98 ± 1.58 +32.1%
llama 8B Q4_1 4.77 GiB 8.03 B 99 1 pp512 276.27 ± 0.14 298.57 ± 0.20 +8.1%
llama 8B Q8_0 7.95 GiB 8.03 B 99 0 pp512 798.27 ± 0.78 821.94 ± 1.81 +3.0%
llama 8B Q8_0 7.95 GiB 8.03 B 99 1 pp512 265.18 ± 0.16 266.45 ± 0.12 +0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 pp512 300.39 ± 1.28 536.20 ± 2.08 +78.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 128.36 ± 0.11 173.50 ± 0.04 +35.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 pp512 473.30 ± 2.12 940.74 ± 3.95 +98.8%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 441.60 ± 3.40 814.58 ± 2.46 +84.5%

Nvidia RTX 3090 (without coopmat)

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 pp512 1293.97 ± 2.28 1238.57 ± 3.38 -4.3%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 pp512 1274.88 ± 4.68 1222.75 ± 2.59 -4.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 pp512 1237.19 ± 3.68 1736.91 ± 9.89 +40.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 1222.45 ± 1.87 1708.99 ± 3.28 +39.8%
llama 8B Q4_0 4.33 GiB 8.03 B 99 0 pp512 2037.84 ± 12.53 2273.29 ± 4.16 +11.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 1996.66 ± 9.93 2227.06 ± 5.90 +11.5%
llama 8B Q4_1 4.77 GiB 8.03 B 99 0 pp512 2012.62 ± 7.30 2163.48 ± 15.52 +7.5%
llama 8B Q4_1 4.77 GiB 8.03 B 99 1 pp512 1975.37 ± 2.99 2121.59 ± 8.64 +7.4%
llama 8B Q8_0 7.95 GiB 8.03 B 99 0 pp512 2006.05 ± 4.47 2134.07 ± 6.57 +6.4%
llama 8B Q8_0 7.95 GiB 8.03 B 99 1 pp512 1976.58 ± 10.48 2095.79 ± 10.55 +6.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 pp512 1173.09 ± 5.58 1198.19 ± 6.87 +2.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 1139.77 ± 2.96 1158.97 ± 9.83 +1.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 pp512 1464.30 ± 15.08 2439.57 ± 24.30 +66.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1458.39 ± 13.15 2419.96 ± 19.97 +65.9%

@0cc4m 0cc4m marked this pull request as ready for review October 28, 2025 15:52
@0cc4m 0cc4m requested a review from jeffbolznv October 28, 2025 15:55
@0cc4m
Copy link
Collaborator Author

0cc4m commented Oct 28, 2025

AMD Radeon RX 6800 XT

model size params ngl fa test t/s (ROCm) t/s (before) t/s (after) diff
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 0 pp512 1014.91 ± 1.27 783.30 ± 0.71 996.89 ± 0.82 +27.3%
llama 8B Q2_K - Medium 2.95 GiB 8.03 B 99 1 pp512 1115.28 ± 0.35 760.07 ± 6.63 968.06 ± 0.18 +27.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 0 pp512 1257.70 ± 1.83 725.19 ± 11.39 1583.06 ± 2.16 +118.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 1418.17 ± 0.59 694.91 ± 12.54 1510.78 ± 0.51 +117.4%
llama 8B Q4_0 4.33 GiB 8.03 B 99 0 pp512 1695.17 ± 1.96 1636.45 ± 9.58 1937.64 ± 1.71 +18.4%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 2008.64 ± 0.72 1522.08 ± 38.09 1829.96 ± 0.46 +20.2%
llama 8B Q4_1 4.77 GiB 8.03 B 99 0 pp512 1603.18 ± 1.52 1600.08 ± 14.43 1908.14 ± 1.33 +19.3%
llama 8B Q4_1 4.77 GiB 8.03 B 99 1 pp512 1877.70 ± 1.07 1494.55 ± 35.07 1802.73 ± 0.84 +20.6%
llama 8B Q8_0 7.95 GiB 8.03 B 99 0 pp512 1628.49 ± 1.31 1383.79 ± 5.98 1461.32 ± 1.01 +5.6%
llama 8B Q8_0 7.95 GiB 8.03 B 99 1 pp512 1917.50 ± 1.50 1322.85 ± 12.82 1405.28 ± 0.70 +6.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 0 pp512 1166.06 ± 5.48 851.60 ± 7.83 1059.15 ± 5.21 +24.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 1361.20 ± 3.18 817.02 ± 7.40 1006.37 ± 3.29 +23.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 0 pp512 2237.97 ± 8.42 1313.06 ± 27.29 1974.57 ± 41.37 +50.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 2709.14 ± 15.67 1330.93 ± 6.46 2014.92 ± 26.38 +51.4%

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I didn't review the new mmq shader code in great detail.

@0cc4m 0cc4m merged commit bcf5bda into master Oct 29, 2025
63 of 64 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-mmq-dp4a-k-quants branch October 29, 2025 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants