Context
Final stage of qwen3-moe-forward-gpu-v1 v1.7.0 cascade. After M85 qtype-aware dispatch fix (#1529 squash 89cb26af7) closed the L6 moe_ffn_out NaN root cause, M-GPU-MOE-1.x reached ACTIVE_ALGORITHM_LEVEL (M86 #1530 squash 65bc42577). The remaining work to flip ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME is M-GPU-MOE-3.
Two-part scope
Part 1: fp-accumulator-order alignment
Post-M85 cosine: ~85% of layers cos > 0.99 between CPU forward_qwen3_moe (LAZY-FUSED-MATVEC) and GPU forward_qwen3_moe_cuda. The remaining ~7-8 layers (L7, L9, L12, L20, L23, L29, L46) sit at cos 0.94-0.987.
Cause: fp-accumulator-order between:
- CPU
fused_q6k_parallel_matvec (Rust SIMD via rayon, deterministic per-thread reduction)
- GPU
q6k_gemv (CUDA warp-shuffle reduction)
Both decode the same Q6_K bytes correctly; f32 sum-of-products is non-associative. Fix is kernel-level reduction-order alignment, not algorithmic.
Part 2: throughput + memory target
- ≥150 tok/s on RTX 4090 (≥5× CPU baseline of ~30 tok/s; allows headroom below dense Q4_K target of ~440 tok/s since MoE has expert-dispatch overhead)
- VRAM ≤95% on
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (cached 17.3 GB GGUF)
Acceptance
- All 48 layers cos ≥ 0.99 between CPU and GPU forward
apr run throughput ≥150 tok/s on the cached 17.3 GB Qwen3-Coder GGUF (RTX 4090, Ada sm_89)
- VRAM steady-state ≤95% utilization
qwen3-moe-forward-gpu-v1 v1.7.0 → v1.8.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME
Cross-refs
Context
Final stage of
qwen3-moe-forward-gpu-v1v1.7.0 cascade. After M85 qtype-aware dispatch fix (#1529 squash89cb26af7) closed the L6moe_ffn_outNaN root cause, M-GPU-MOE-1.x reached ACTIVE_ALGORITHM_LEVEL (M86 #1530 squash65bc42577). The remaining work to flip ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME is M-GPU-MOE-3.Two-part scope
Part 1: fp-accumulator-order alignment
Post-M85 cosine: ~85% of layers cos > 0.99 between CPU
forward_qwen3_moe(LAZY-FUSED-MATVEC) and GPUforward_qwen3_moe_cuda. The remaining ~7-8 layers (L7, L9, L12, L20, L23, L29, L46) sit at cos 0.94-0.987.Cause: fp-accumulator-order between:
fused_q6k_parallel_matvec(Rust SIMD via rayon, deterministic per-thread reduction)q6k_gemv(CUDA warp-shuffle reduction)Both decode the same Q6_K bytes correctly; f32 sum-of-products is non-associative. Fix is kernel-level reduction-order alignment, not algorithmic.
Part 2: throughput + memory target
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M(cached 17.3 GB GGUF)Acceptance
apr runthroughput ≥150 tok/s on the cached 17.3 GB Qwen3-Coder GGUF (RTX 4090, Ada sm_89)qwen3-moe-forward-gpu-v1v1.7.0 → v1.8.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIMECross-refs