Thanks to visit codestin.com
Credit goes to github.com

Skip to content

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 9, 2025

Conversation

jeffbolznv
Copy link
Collaborator

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.
@jeffbolznv jeffbolznv requested a review from 0cc4m May 6, 2025 02:51
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 6, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented May 6, 2025

I didn't increase it so far since I assume this comes with a performance penalty. Are you aware of a dynamic way of resizing this array? Worst case we might need to load the shader with different array sizes.

@jeffbolznv
Copy link
Collaborator Author

There's not a way to dynamically allocate the shared memory. Closest thing is to use a spec constant to size it.

The main ways we'd lose performance for this is if we needed to use a smaller tile size, or through reduced occupancy. I don't think this affects the tile size when there's 32KB (uses medium) or >=48KB (uses large) of shared memory available. There's a chance it'll reduce occupancy a bit, but if we see a reduction in performance we could use a spec constant and do multiple variants as you suggested.

@nalf3in
Copy link

nalf3in commented May 6, 2025

This fixes #13164 for me and performance is within margin of error for a small dense model (Qwen3 4B) on my gpus.

Performance details

91a86a6 (sampling : don't consider -infinity values in top_n_sigma (#13344))

RX 480

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host ::
prompt eval time =    7130.20 ms /  1930 tokens (    3.69 ms per token,   270.68 tokens per second)
       eval time =   20442.94 ms /   400 tokens (   51.11 ms per token,    19.57 tokens per second)

RTX 2070

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan1 -ngl 99 -c 8192 --host ::
prompt eval time =    5008.83 ms /  1930 tokens (    2.60 ms per token,   385.32 tokens per second)
       eval time =    9593.73 ms /   324 tokens (   29.61 ms per token,    33.77 tokens per second)

282702877cc9d38e850254de9c189728553494fd (vulkan: Allow up to 4096 elements for mul_mat_id row_ids)

RX 480

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host ::
prompt eval time =    7192.84 ms /  1930 tokens (    3.73 ms per token,   268.32 tokens per second)
       eval time =   17046.93 ms /   331 tokens (   51.50 ms per token,    19.42 tokens per second)

RTX 2070

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan1 -ngl 99 -c 8192 --host ::
prompt eval time =    4872.91 ms /  1930 tokens (    2.52 ms per token,   396.07 tokens per second)
       eval time =    9747.11 ms /   338 tokens (   28.84 ms per token,    34.68 tokens per second)

@characharm
Copy link
Contributor

Master Branch

Metric AMD 9070XT Intel ARC A770
Prompt eval time 332.51 ms / 17 tokens→ 19.56 ms/token→ 51.13 tok/sec 315.37 ms / 17 tokens→ 18.55 ms/token→ 53.91 tok/sec
Main eval time 3443.46 ms / 156 tokens→ 22.07 ms/token→ 45.30 tok/sec 18010.72 ms / 439 tokens→ 41.03 ms/token→ 24.37 tok/sec
Total time 3775.97 ms / 173 tokens 18326.09 ms / 456 tokens

This PR

Metric AMD 9070XT Intel ARC A770
Prompt eval time 240.54 ms / 17 tokens→ 14.15 ms/token→ 70.68 tok/sec 321.42 ms / 17 tokens→ 18.91 ms/token→ 52.89 tok/sec
Main eval time 10676.77 ms / 460 tokens→ 23.21 ms/token→ 43.08 tok/sec 39865.32 ms / 955 tokens→ 41.74 ms/token→ 23.96 tok/sec
Total time 10917.31 ms / 477 tokens 40186.74 ms / 972 tokens

Qwen-14B-Q4_0.gguf

@Mushoz
Copy link

Mushoz commented May 8, 2025

@characharm Shouldn't you use llama-bench instead? The prompt eval time speed seems unreliable since it's only 17 tokens. I wouldn't expect a performance boost for 9070XT.

And the token generation is definitely not comparable, since the 9070XT generated 156 tokens for the first test and 460 for the second. Obviously the second one is going to be slower all other things being equal, as token generation slows down as the context fills up.

llama-bench allows you to:

  1. Average multiple runs for more reliable data (5 runs by default)
  2. Prompt process longer sequences for more reliable data (512 by default)
  3. Keep generation length identical (128 by default)

@characharm
Copy link
Contributor

characharm commented May 8, 2025

Okay, I figured out how to tell llama-bench which GPU to use. I've been using this build for a few days with different models, and I haven't noticed any negative changes

AMD GPU: RX 9070 XT
Master (build: 8c83449c (5315))
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
qwen2 14B Q4_0 7.95 GiB 14.77 B RPC,Vulkan 99 pp512 1984.34 ± 3.61
qwen2 14B Q4_0 7.95 GiB 14.77 B RPC,Vulkan 99 tg128 58.90 ± 0.24
gemma3 12B Q6_K 8.99 GiB 11.77 B RPC,Vulkan 99 pp512 1429.91 ± 2.00
gemma3 12B Q6_K 8.99 GiB 11.77 B RPC,Vulkan 99 tg128 47.56 ± 0.05

build: 8c83449c (5315)

This PR (build: 06daca9 (5288))
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl test t/s
qwen2 14B Q4_0 7.95 GiB 14.77 B Vulkan 99 pp512 1984.12 ± 1.39
qwen2 14B Q4_0 7.95 GiB 14.77 B Vulkan 99 tg128 58.88 ± 0.17
gemma3 12B Q6_K 8.99 GiB 11.77 B Vulkan 99 pp512 1431.86 ± 0.87
gemma3 12B Q6_K 8.99 GiB 11.77 B Vulkan 99 tg128 47.55 ± 0.07

build: 06daca9 (5288)

Intel GPU: Arc A770
Master (build: 8c83449c (5315))
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s
qwen2 14B Q4_0 7.95 GiB 14.77 B RPC,Vulkan 99 pp512 372.05 ± 0.36
qwen2 14B Q4_0 7.95 GiB 14.77 B RPC,Vulkan 99 tg128 24.98 ± 0.01
gemma3 12B Q6_K 8.99 GiB 11.77 B RPC,Vulkan 99 pp512 199.26 ± 0.51
gemma3 12B Q6_K 8.99 GiB 11.77 B RPC,Vulkan 99 tg128 26.68 ± 0.04

build: 8c83449c (5315)

This PR (build: 06daca9 (5288))
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Arc(TM) A770 Graphics (Intel Corporation) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none

model size params backend ngl test t/s
qwen2 14B Q4_0 7.95 GiB 14.77 B Vulkan 99 pp512 372.12 ± 0.55
qwen2 14B Q4_0 7.95 GiB 14.77 B Vulkan 99 tg128 24.99 ± 0.01
gemma3 12B Q6_K 8.99 GiB 11.77 B Vulkan 99 pp512 199.51 ± 0.55
gemma3 12B Q6_K 8.99 GiB 11.77 B Vulkan 99 tg128 26.71 ± 0.02

build: 06daca9 (5288)

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@0cc4m 0cc4m merged commit 02115dc into ggml-org:master May 9, 2025
84 of 86 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 9, 2025
* origin/master: (39 commits)
server : vision support via libmtmd (ggml-org#12898)
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858)
metal : optimize MoE for large batches (ggml-org#13388)
CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306)
llama : do not crash if there is no CPU backend (ggml-org#13395)
CUDA: fix crash on large batch size for MoE models (ggml-org#13384)
imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389)
llama-run: add support for downloading models from ModelScope (ggml-org#13370)
mtmd : fix batch_view for m-rope (ggml-org#13397)
llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398)
rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353)
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
server : (webui) rename has_multimodal --> modalities (ggml-org#13393)
ci : limit write permission to only the release step + fixes (ggml-org#13392)
mtmd : Expose helper_decode_image_chunk (ggml-org#13366)
server : (webui) fix a very small misalignment (ggml-org#13387)
server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365)
convert : support rope_scaling type and rope_type (ggml-org#13349)
mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381)
context : allow cache-less context for embeddings (ggml-org#13108)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants