-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Description
Name and Version
version: 6237 (97ae596)
built with MSVC 19.44.35211.0 for x64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server.exe
-m Qwen3-235B-A22B-UD-Q4_K_XL.gguf
-fa
-c 32768
-t 16
-ngl 999
-ts 50,50
-ctk q4_0
-ctv q4_0
-b 1024
-ub 1024
-ot ".ffn_.*_exps.=CPU"
-v
Problem description & steps to reproduce
CPU: AMD Ryzen 9 9950X (16C/32T)
RAM: 192 GB DDR5
GPUs: 2 × RTX 3090 24 GB (no NVLink)
NVIDIA driver: Driver Version: 581.08
CUDA Version: 13.0
Long-prompt decoding crashes on recent commits on a hybrid (CPU+GPU) setup.
Same command and model are stable on older commit b6187 (and achieve ~3.0 t/s long-gen). Newer commit(s) crash during prompt processing
Built with
cmake .. -G "Visual Studio 17 2022" -A x64 `
-DCMAKE_TOOLCHAIN_FILE="C:/Development/vcpkg/scripts/buildsystems/vcpkg.cmake" `
-DVCPKG_TARGET_TRIPLET="x64-windows" `
-DGGML_CUDA=ON `
-DGGML_CUDA_GRAPHS=OFF `
-DGGML_CUDA_F16=ON `
-DGGML_CUDA_FA_ALL_QUANTS=ON `
-DGGML_NATIVE=ON `
-DGGML_LTO=ON `
-DCMAKE_CUDA_ARCHITECTURES="86" `
-DCMAKE_BUILD_TYPE=Release
Reproduction steps
- Build llama.cpp with the flags above.
- Launch llama-server with the command attached to the issue.
- Run a short prompt → generation works.
- Run a long prompt (~25k tokens prefill) → server crashes
First Bad Commit
Good: b6187 no crash; performance:
-
Short prompt (21 tok): prompt 11.5 t/s; gen 2.8–3.0 t/s
-
Long prompt (≈24,961 tok): prompt 78.6 t/s; gen ~3.0 t/s
Bad: 6237-97ae5961 crashes on long prompt with the same command & model.
-
Short prompt (21 tok): prompt ~12.3 t/s; gen ~2.8–3.0 t/s
-
Long prompt: crash
Relevant log output
slot launch_slot_: id 0 | task 8 | processing task
slot update_slots: id 0 | task 8 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 24964
slot update_slots: id 0 | task 8 | kv cache rm [3, end)
slot update_slots: id 0 | task 8 | prompt processing progress, n_past = 1027, n_tokens = 1024, progress = 0.041019
srv update_slots: decoding batch, n_tokens = 1024
clear_adapter_lora: call
set_embeddings: value = 0
# (crash shortly after this point)