Misc. bug: Long-prompt decode crash with MoE

### Name and Version

version: 6237 (97ae5961)
built with MSVC 19.44.35211.0 for x64

### Operating systems

Windows

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
llama-server.exe 
-m Qwen3-235B-A22B-UD-Q4_K_XL.gguf 
-fa 
-c 32768 
-t 16 
-ngl 999 
-ts 50,50 
-ctk q4_0 
-ctv q4_0 
-b 1024 
-ub 1024 
-ot ".ffn_.*_exps.=CPU" 
-v
```

### Problem description & steps to reproduce

CPU: AMD Ryzen 9 9950X (16C/32T)

RAM: 192 GB DDR5

GPUs: 2 × RTX 3090 24 GB (no NVLink)

NVIDIA driver: Driver Version: 581.08   
CUDA Version: 13.0

Long-prompt decoding crashes on recent commits on a hybrid (CPU+GPU) setup. 
Same command and model are stable on older commit b6187 (and achieve ~3.0 t/s long-gen). Newer commit(s) crash during prompt processing

Built with 

```
cmake .. -G "Visual Studio 17 2022" -A x64 `
-DCMAKE_TOOLCHAIN_FILE="C:/Development/vcpkg/scripts/buildsystems/vcpkg.cmake" `
-DVCPKG_TARGET_TRIPLET="x64-windows" `
-DGGML_CUDA=ON `
-DGGML_CUDA_GRAPHS=OFF `
-DGGML_CUDA_F16=ON `
-DGGML_CUDA_FA_ALL_QUANTS=ON `
-DGGML_NATIVE=ON `
-DGGML_LTO=ON `
-DCMAKE_CUDA_ARCHITECTURES="86" `
-DCMAKE_BUILD_TYPE=Release
```

Reproduction steps
1. Build llama.cpp with the flags above.
2. Launch llama-server with the command attached to the issue.
3. Run a short prompt → generation works.
4. Run a long prompt (~25k tokens prefill) → server crashes

### First Bad Commit

**Good**: b6187  no crash; performance:

- Short prompt (21 tok): prompt 11.5 t/s; gen 2.8–3.0 t/s

- Long prompt (≈24,961 tok): prompt 78.6 t/s; gen ~3.0 t/s

**Bad**: 6237-97ae5961 crashes on long prompt with the same command & model.

- Short prompt (21 tok): prompt ~12.3 t/s; gen ~2.8–3.0 t/s

- Long prompt: crash

### Relevant log output

```shell
slot launch_slot_: id  0 | task 8 | processing task
slot update_slots: id  0 | task 8 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 24964
slot update_slots: id  0 | task 8 | kv cache rm [3, end)
slot update_slots: id  0 | task 8 | prompt processing progress, n_past = 1027, n_tokens = 1024, progress = 0.041019
srv  update_slots: decoding batch, n_tokens = 1024
clear_adapter_lora: call
set_embeddings: value = 0
# (crash shortly after this point)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Long-prompt decode crash with MoE #15481

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Long-prompt decode crash with MoE #15481

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions