M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment

## Context

Final stage of `qwen3-moe-forward-gpu-v1` v1.7.0 cascade. After M85 qtype-aware dispatch fix (#1529 squash `89cb26af7`) closed the L6 `moe_ffn_out` NaN root cause, M-GPU-MOE-1.x reached **ACTIVE_ALGORITHM_LEVEL** (M86 #1530 squash `65bc42577`). The remaining work to flip ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME is M-GPU-MOE-3.

## Two-part scope

### Part 1: fp-accumulator-order alignment

Post-M85 cosine: ~85% of layers cos > 0.99 between CPU `forward_qwen3_moe` (LAZY-FUSED-MATVEC) and GPU `forward_qwen3_moe_cuda`. The remaining **~7-8 layers (L7, L9, L12, L20, L23, L29, L46)** sit at cos 0.94-0.987.

**Cause**: fp-accumulator-order between:
- CPU `fused_q6k_parallel_matvec` (Rust SIMD via rayon, deterministic per-thread reduction)
- GPU `q6k_gemv` (CUDA warp-shuffle reduction)

Both decode the same Q6_K bytes correctly; f32 sum-of-products is non-associative. Fix is **kernel-level reduction-order alignment**, not algorithmic.

### Part 2: throughput + memory target

- **≥150 tok/s** on RTX 4090 (≥5× CPU baseline of ~30 tok/s; allows headroom below dense Q4_K target of ~440 tok/s since MoE has expert-dispatch overhead)
- **VRAM ≤95%** on `Qwen3-Coder-30B-A3B-Instruct-Q4_K_M` (cached 17.3 GB GGUF)

## Acceptance

- All 48 layers cos ≥ 0.99 between CPU and GPU forward
- `apr run` throughput ≥150 tok/s on the cached 17.3 GB Qwen3-Coder GGUF (RTX 4090, Ada sm_89)
- VRAM steady-state ≤95% utilization
- `qwen3-moe-forward-gpu-v1` v1.7.0 → v1.8.0 ACTIVE_ALGORITHM_LEVEL → **ACTIVE_RUNTIME**

## Cross-refs

- Companion-repo: paiml/claude-code-parity-apr § Sub-extension 2 deliverable 5 (PENDING) + R10 risk
- Related: #386 (Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — adjacent kernel-perf class)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583

Context

Two-part scope

Part 1: fp-accumulator-order alignment

Part 2: throughput + memory target

Acceptance

Cross-refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583

Description

Context

Two-part scope

Part 1: fp-accumulator-order alignment

Part 2: throughput + memory target

Acceptance

Cross-refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions