vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

jeffbolznv · 2025-05-06T02:51:47Z

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.

0cc4m · 2025-05-06T06:11:28Z

I didn't increase it so far since I assume this comes with a performance penalty. Are you aware of a dynamic way of resizing this array? Worst case we might need to load the shader with different array sizes.

jeffbolznv · 2025-05-06T13:31:37Z

There's not a way to dynamically allocate the shared memory. Closest thing is to use a spec constant to size it.

The main ways we'd lose performance for this is if we needed to use a smaller tile size, or through reduced occupancy. I don't think this affects the tile size when there's 32KB (uses medium) or >=48KB (uses large) of shared memory available. There's a chance it'll reduce occupancy a bit, but if we see a reduction in performance we could use a spec constant and do multiple variants as you suggested.

nalf3in · 2025-05-06T20:25:56Z

This fixes #13164 for me and performance is within margin of error for a small dense model (Qwen3 4B) on my gpus.

Performance details

91a86a6 (sampling : don't consider -infinity values in top_n_sigma (#13344))

RX 480

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host ::
prompt eval time =    7130.20 ms /  1930 tokens (    3.69 ms per token,   270.68 tokens per second)
       eval time =   20442.94 ms /   400 tokens (   51.11 ms per token,    19.57 tokens per second)

RTX 2070

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan1 -ngl 99 -c 8192 --host ::
prompt eval time =    5008.83 ms /  1930 tokens (    2.60 ms per token,   385.32 tokens per second)
       eval time =    9593.73 ms /   324 tokens (   29.61 ms per token,    33.77 tokens per second)

282702877cc9d38e850254de9c189728553494fd (vulkan: Allow up to 4096 elements for mul_mat_id row_ids)

RX 480

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan0 -ngl 99 -c 8192 --host ::
prompt eval time =    7192.84 ms /  1930 tokens (    3.73 ms per token,   268.32 tokens per second)
       eval time =   17046.93 ms /   331 tokens (   51.50 ms per token,    19.42 tokens per second)

RTX 2070

./build/bin/llama-server -m /share/Qwen3-4B-UD-Q4_K_XL.gguf -dev Vulkan1 -ngl 99 -c 8192 --host ::
prompt eval time =    4872.91 ms /  1930 tokens (    2.52 ms per token,   396.07 tokens per second)
       eval time =    9747.11 ms /   338 tokens (   28.84 ms per token,    34.68 tokens per second)

characharm · 2025-05-08T16:42:26Z

Master Branch

Metric	AMD 9070XT	Intel ARC A770
Prompt eval time	332.51 ms / 17 tokens→ 19.56 ms/token→ 51.13 tok/sec	315.37 ms / 17 tokens→ 18.55 ms/token→ 53.91 tok/sec
Main eval time	3443.46 ms / 156 tokens→ 22.07 ms/token→ 45.30 tok/sec	18010.72 ms / 439 tokens→ 41.03 ms/token→ 24.37 tok/sec
Total time	3775.97 ms / 173 tokens	18326.09 ms / 456 tokens

This PR

Metric	AMD 9070XT	Intel ARC A770
Prompt eval time	240.54 ms / 17 tokens→ 14.15 ms/token→ 70.68 tok/sec	321.42 ms / 17 tokens→ 18.91 ms/token→ 52.89 tok/sec
Main eval time	10676.77 ms / 460 tokens→ 23.21 ms/token→ 43.08 tok/sec	39865.32 ms / 955 tokens→ 41.74 ms/token→ 23.96 tok/sec
Total time	10917.31 ms / 477 tokens	40186.74 ms / 972 tokens

Qwen-14B-Q4_0.gguf

Mushoz · 2025-05-08T17:32:25Z

@characharm Shouldn't you use llama-bench instead? The prompt eval time speed seems unreliable since it's only 17 tokens. I wouldn't expect a performance boost for 9070XT.

And the token generation is definitely not comparable, since the 9070XT generated 156 tokens for the first test and 460 for the second. Obviously the second one is going to be slower all other things being equal, as token generation slows down as the context fills up.

llama-bench allows you to:

Average multiple runs for more reliable data (5 runs by default)
Prompt process longer sequences for more reliable data (512 by default)
Keep generation length identical (128 by default)

characharm · 2025-05-08T19:12:42Z

Okay, I figured out how to tell llama-bench which GPU to use. I've been using this build for a few days with different models, and I haven't noticed any negative changes

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	RPC,Vulkan	99	pp512	1984.34 ± 3.61
qwen2 14B Q4_0	7.95 GiB	14.77 B	RPC,Vulkan	99	tg128	58.90 ± 0.24
gemma3 12B Q6_K	8.99 GiB	11.77 B	RPC,Vulkan	99	pp512	1429.91 ± 2.00
gemma3 12B Q6_K	8.99 GiB	11.77 B	RPC,Vulkan	99	tg128	47.56 ± 0.05

build: 8c83449c (5315)

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	1984.12 ± 1.39
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	58.88 ± 0.17
gemma3 12B Q6_K	8.99 GiB	11.77 B	Vulkan	99	pp512	1431.86 ± 0.87
gemma3 12B Q6_K	8.99 GiB	11.77 B	Vulkan	99	tg128	47.55 ± 0.07

build: 06daca9 (5288)

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	RPC,Vulkan	99	pp512	372.05 ± 0.36
qwen2 14B Q4_0	7.95 GiB	14.77 B	RPC,Vulkan	99	tg128	24.98 ± 0.01
gemma3 12B Q6_K	8.99 GiB	11.77 B	RPC,Vulkan	99	pp512	199.26 ± 0.51
gemma3 12B Q6_K	8.99 GiB	11.77 B	RPC,Vulkan	99	tg128	26.68 ± 0.04

build: 8c83449c (5315)

model	size	params	backend	ngl	test	t/s
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	pp512	372.12 ± 0.55
qwen2 14B Q4_0	7.95 GiB	14.77 B	Vulkan	99	tg128	24.99 ± 0.01
gemma3 12B Q6_K	8.99 GiB	11.77 B	Vulkan	99	pp512	199.51 ± 0.55
gemma3 12B Q6_K	8.99 GiB	11.77 B	Vulkan	99	tg128	26.71 ± 0.02

build: 06daca9 (5288)

0cc4m

LGTM

* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...

vulkan: Allow up to 4096 elements for mul_mat_id row_ids

06daca9

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.

jeffbolznv requested a review from 0cc4m May 6, 2025 02:51

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 6, 2025

jeffbolznv mentioned this pull request May 6, 2025

Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI #13164

Closed

0cc4m approved these changes May 9, 2025

View reviewed changes

0cc4m merged commit 02115dc into ggml-org:master May 9, 2025
84 of 86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

jeffbolznv commented May 6, 2025

0cc4m commented May 6, 2025

jeffbolznv commented May 6, 2025

nalf3in commented May 6, 2025 •

edited

Loading

characharm commented May 8, 2025

Mushoz commented May 8, 2025

characharm commented May 8, 2025 •

edited

Loading

0cc4m left a comment

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326

Conversation

jeffbolznv commented May 6, 2025

0cc4m commented May 6, 2025

jeffbolznv commented May 6, 2025

nalf3in commented May 6, 2025 • edited Loading

characharm commented May 8, 2025

Mushoz commented May 8, 2025

characharm commented May 8, 2025 • edited Loading

0cc4m left a comment

Choose a reason for hiding this comment

nalf3in commented May 6, 2025 •

edited

Loading

characharm commented May 8, 2025 •

edited

Loading