-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Open
Labels
Description
Name and Version
version: 6250 (e92734d)
built with MSVC 19.44.35211.0 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server.exe ^
-m DeepSeek-R1-0528-UD-IQ1_S.gguf ^
-fa ^
-c 32768 ^
-t 16 ^
-mg 1 ^
-ngl 999 ^
-ts 50,50 ^
-ctk q4_0 ^
-ctv q4_0 ^
-ot "\.ffn_.*_exps\.weight=CPU" ^
-v
Problem description & steps to reproduce
Hardware / Drivers
- CPU: AMD Ryzen 9 9950X (16C/32T)
- RAM: 192 GB DDR5
- GPUs: 2 × NVIDIA GeForce RTX 3090 24 GB (no NVLink)
- NVIDIA Driver: 581.08
- CUDA: 13.0
Build
cmake .. -G "Visual Studio 17 2022" -A x64 `
-DCMAKE_TOOLCHAIN_FILE="C:/Development/vcpkg/scripts/buildsystems/vcpkg.cmake" `
-DVCPKG_TARGET_TRIPLET="x64-windows" `
-DGGML_CUDA=ON `
-DGGML_CUDA_GRAPHS=ON `
-DGGML_CUDA_F16=ON `
-DGGML_CUDA_FA_ALL_QUANTS=ON `
-DGGML_NATIVE=ON `
-DGGML_LTO=ON `
-DCMAKE_CUDA_ARCHITECTURES="86" `
-DCMAKE_BUILD_TYPE=Release
Reproduction steps
- Build llama.cpp with the flags above.
- Launch llama-server with the command attached to the issue.
- Run a short prompt → generation works.
- Run a long prompt (~24k tokens prefill) → prompt processing stall
Tested (control models that work with same flags & prompt shape)
- ERNIE-4.5-300B-A47B; MoE, no MLA
- GLM-4.5; MoE, no MLA
- Qwen3-235B-A22B , no MLA
First Bad Commit
Update (bisecting results):
- b6187 (Aug 17) – prompt completes normally (~13m56s @ 30.7 t/s).
- e9288e8 (Aug 19) – still good (~13m43s @ 31.2 t/s).
- 2f37014 (Aug 20) – still good (~13m39s @ 31.4 t/s).
- 7a6e91a (Aug 20) – still good (~13m46s @ 31.1 t/s).
- 13aeb7a (Aug 20, PR CUDA: refactor FA support/selection code #15454 “CUDA: refactor FA support/selection code”) – first bad commit.
Relevant log output
slot update_slots: id 0 | task 396 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 25699
slot update_slots: id 0 | task 396 | kv cache rm [2, end)
slot update_slots: id 0 | task 396 | prompt processing progress, n_past = 2050, n_tokens = 2048, progress = 0.079692
srv update_slots: decoding batch, n_tokens = 2048
clear_adapter_lora: call
set_embeddings: value = 0
# (stall after this point)