-
Notifications
You must be signed in to change notification settings - Fork 11.8k
vulkan: Allow up to 4096 elements for mul_mat_id row_ids #13326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.
I didn't increase it so far since I assume this comes with a performance penalty. Are you aware of a dynamic way of resizing this array? Worst case we might need to load the shader with different array sizes. |
There's not a way to dynamically allocate the shared memory. Closest thing is to use a spec constant to size it. The main ways we'd lose performance for this is if we needed to use a smaller tile size, or through reduced occupancy. I don't think this affects the tile size when there's 32KB (uses medium) or >=48KB (uses large) of shared memory available. There's a chance it'll reduce occupancy a bit, but if we see a reduction in performance we could use a spec constant and do multiple variants as you suggested. |
This fixes #13164 for me and performance is within margin of error for a small dense model (Qwen3 4B) on my gpus. Performance details91a86a6 (sampling : don't consider -infinity values in top_n_sigma (#13344)) RX 480
RTX 2070
282702877cc9d38e850254de9c189728553494fd (vulkan: Allow up to 4096 elements for mul_mat_id row_ids) RX 480
RTX 2070
|
Master Branch
This PR
Qwen-14B-Q4_0.gguf |
@characharm Shouldn't you use llama-bench instead? The prompt eval time speed seems unreliable since it's only 17 tokens. I wouldn't expect a performance boost for 9070XT. And the token generation is definitely not comparable, since the 9070XT generated 156 tokens for the first test and 460 for the second. Obviously the second one is going to be slower all other things being equal, as token generation slows down as the context fills up. llama-bench allows you to:
|
Okay, I figured out how to tell llama-bench which GPU to use. I've been using this build for a few days with different models, and I haven't noticed any negative changes AMD GPU: RX 9070 XT
build: 8c83449c (5315) This PR (build: 06daca9 (5288))
build: 06daca9 (5288) Intel GPU: Arc A770
build: 8c83449c (5315) This PR (build: 06daca9 (5288))
build: 06daca9 (5288) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.