Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Misc. bug: Long-prompt decode crash with MoE #15481

@sighpher

Description

@sighpher

Name and Version

version: 6237 (97ae596)
built with MSVC 19.44.35211.0 for x64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

llama-server.exe 
-m Qwen3-235B-A22B-UD-Q4_K_XL.gguf 
-fa 
-c 32768 
-t 16 
-ngl 999 
-ts 50,50 
-ctk q4_0 
-ctv q4_0 
-b 1024 
-ub 1024 
-ot ".ffn_.*_exps.=CPU" 
-v

Problem description & steps to reproduce

CPU: AMD Ryzen 9 9950X (16C/32T)

RAM: 192 GB DDR5

GPUs: 2 × RTX 3090 24 GB (no NVLink)

NVIDIA driver: Driver Version: 581.08
CUDA Version: 13.0

Long-prompt decoding crashes on recent commits on a hybrid (CPU+GPU) setup.
Same command and model are stable on older commit b6187 (and achieve ~3.0 t/s long-gen). Newer commit(s) crash during prompt processing

Built with

cmake .. -G "Visual Studio 17 2022" -A x64 `
-DCMAKE_TOOLCHAIN_FILE="C:/Development/vcpkg/scripts/buildsystems/vcpkg.cmake" `
-DVCPKG_TARGET_TRIPLET="x64-windows" `
-DGGML_CUDA=ON `
-DGGML_CUDA_GRAPHS=OFF `
-DGGML_CUDA_F16=ON `
-DGGML_CUDA_FA_ALL_QUANTS=ON `
-DGGML_NATIVE=ON `
-DGGML_LTO=ON `
-DCMAKE_CUDA_ARCHITECTURES="86" `
-DCMAKE_BUILD_TYPE=Release

Reproduction steps

  1. Build llama.cpp with the flags above.
  2. Launch llama-server with the command attached to the issue.
  3. Run a short prompt → generation works.
  4. Run a long prompt (~25k tokens prefill) → server crashes

First Bad Commit

Good: b6187 no crash; performance:

  • Short prompt (21 tok): prompt 11.5 t/s; gen 2.8–3.0 t/s

  • Long prompt (≈24,961 tok): prompt 78.6 t/s; gen ~3.0 t/s

Bad: 6237-97ae5961 crashes on long prompt with the same command & model.

  • Short prompt (21 tok): prompt ~12.3 t/s; gen ~2.8–3.0 t/s

  • Long prompt: crash

Relevant log output

slot launch_slot_: id  0 | task 8 | processing task
slot update_slots: id  0 | task 8 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 24964
slot update_slots: id  0 | task 8 | kv cache rm [3, end)
slot update_slots: id  0 | task 8 | prompt processing progress, n_past = 1027, n_tokens = 1024, progress = 0.041019
srv  update_slots: decoding batch, n_tokens = 1024
clear_adapter_lora: call
set_embeddings: value = 0
# (crash shortly after this point)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions