fix(M-GPU-MOE-1.4 step c): qtype-aware dispatch in expert_swiglu_cuda — closes L6 moe_ffn_out NaN#1529
Merged
Merged
Conversation
… — closes L6 moe_ffn_out NaN
Five-Whys root cause (validated by code inspection):
1. Why NaN at L6 moe_ffn_out? Garbage matvec output for gate or up
at L6 overflows in silu(gate)*up product.
2. Why garbage matvec? Q6_K bytes interpreted as Q4_K bytes by GPU.
3. Why wrong byte interpretation? expert_swiglu_cuda calls q4k_matvec
UNCONDITIONALLY for both gate AND up (no qtype check).
4. Why no qtype-aware dispatch on GPU? Helper authored after CPU
sibling; the qtype detail wasn't ported.
5. Why does CPU sibling have it? Qwen3-Coder-30B-A3B-Instruct
Q4_K_M is MIXED quant — expert tensors per layer can be either
Q4_K (12) or Q6_K (14). CPU expert_swiglu_quantized dispatches
via matvec_for_qtype with explicit comments calling out this
exact issue.
Why L0-L5 MATCH:
For layers 0-5 of canonical Qwen3-Coder GGUF, gate_exps + up_exps
are Q4_K. The unconditional q4k_matvec is correct for those
layers, output matches CPU within cosine 0.9999.
Why L6 first NaN:
Layer 6 (or some layer near it — bisection identifies L6 first
divergent) has at least one tensor at Q6_K. GPU feeds Q6_K bytes
into Q4_K kernel, which interprets the 256x6-bit super-block
layout as 256x4-bit, producing scaled-by-16x garbage that
overflows in silu(gate)*up within 1-2 layers.
Fix:
- Extend expert_swiglu_cuda signature with 3 qtype parameters
(gate_qtype, up_qtype, down_qtype)
- Add private matvec_qtype_cuda dispatch helper that mirrors CPU
matvec_for_qtype: Q4_K → q4k_matvec, Q6_K → q6k_gemv,
UnsupportedOperation otherwise
- Update both moe_ffn_forward_layer_cuda + _with_router callers
to pass layer.{gate,up,down}_exps.qtype
Discharges:
- M-GPU-MOE-1.4 implementation_stage: PARTIALLY_DISCHARGED → DISCHARGED
- FALSIFY-QW3-MOE-GPU-INVARIANTS-001: PARTIALLY_DISCHARGED → DISCHARGED
- FALSIFY-MOE-SUB-004 (sibling contract): PROPOSED → DISCHARGED
(this PR title cites L6 moe_ffn_out by name)
Pending heavy-harness re-run to promote
FALSIFY-QW3-MOE-GPU-PARITY-001 from PARTIALLY_DISCHARGED →
ALGORITHM_LEVEL_DISCHARGED → DISCHARGED.
Lib-only drift-prevention tests added:
- expert_swiglu_cuda_signature_has_three_qtype_params (compilation
gate)
- falsify_qw3_moe_gpu_qtype_aware_dispatch_rejects_unknown (asserts
same rejection set as CPU matvec_for_qtype)
Architecture portability: this fix is purely algorithmic dispatch
logic — same on sm_89 (Ada RTX 4090) and sm_120 (Blackwell GB10).
Discharges M83 finding that the bug is arch-portable.
Contract: qwen3-moe-forward-gpu-v1 v1.5.0 → v1.6.0.
`pv validate` 0/0; 4 lib tests pass.
Co-Authored-By: Claude Opus 4.7 <[email protected]>
…NaN, L6 0.999651 MATCH Live on Blackwell GB10 (gx10) post-fix: - first NaN_GPU on moe_router : None - first NaN_GPU on moe_ffn_out : None ← KEY: was Some(6); now None - L6 moe_ffn_out: NanGpu → 0.999651 MATCH - L0-L5 byte-identical (already Q4_K-only on both sides) - L7+ no longer NaN-poisoned; ~85% of layers cos > 0.99 Wall time 11.70s (warm cache; first-run was 23.18s). What this discharges: - FALSIFY-QW3-MOE-GPU-INVARIANTS-001 finiteness: PARTIALLY → DISCHARGED - FALSIFY-MOE-SUB-004: PROPOSED → DISCHARGED (this PR cites L6) - M-GPU-MOE-1.4 stage: PARTIALLY_DISCHARGED → DISCHARGED What stays partial: - FALSIFY-QW3-MOE-GPU-PARITY-001 cosine: ALGORITHM_LEVEL_DISCHARGED (~7-8 layers at cos 0.94-0.987, below 0.99 threshold). Cause: fp accumulator order variance between CPU fused_q6k_parallel_matvec (Rust SIMD rayon) and GPU q6k_gemv (CUDA warp-shuffle). Both decode same Q6_K bytes correctly; f32 sum-of-products is non-associative. M-GPU-MOE-3 territory, not the step-c NaN bug. Co-Authored-By: Claude Opus 4.7 <[email protected]>
3 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…ALGORITHM_LEVEL post 1.x cascade (#1530) Status promotion amendment after the M-GPU-MOE-1.4 step (c) cascade closure (v1.6.0 / aprender PR #1529). What flips: - metadata.status: DRAFT → ACTIVE_ALGORITHM_LEVEL - M-GPU-MOE-1 implementation_stage (umbrella): PENDING → SHIPPED (covers full 1.x sub-cascade 1.0 → 1.4 step c) - metadata.status comment refreshed (was stale "Scaffold + architecture amendments + preload-bug fix") Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME: Mirrors CPU sibling qwen3-moe-forward-v1 cadence — ALGORITHM_LEVEL = "algorithm bound on main; finite output for canonical prompt". RUNTIME flip waits on M-GPU-MOE-3 (throughput ≥150 tok/s + memory budget) per original v1.0 contract convention. Per-AC status: - AC_GPU_MOE_001 (cosine ≥0.99 vs CPU): ALGORITHM_LEVEL_DISCHARGED - AC_GPU_MOE_002 (cosine ≥0.99 vs HF FP16): blocked on fixture - AC_GPU_MOE_003 (top-5 token recovery): pending heavy re-run - AC_GPU_MOE_004 (output finiteness): DISCHARGED (M85) - AC_GPU_MOE_005 (deterministic per-token): ALGORITHM_LEVEL_DISCHARGED - AC_GPU_MOE_006 (throughput ≥150 tok/s): PENDING M-GPU-MOE-3 - AC_GPU_MOE_007 (VRAM ≤95%): PENDING M-GPU-MOE-3 YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <[email protected]>
Merged
4 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…SUB-004 DISCHARGED (#1531) Status promotion amendment: FALSIFY-MOE-SUB-004 PROPOSED → DISCHARGED. Rule: "The M-GPU-MOE-1.4 fix PR title/body MUST mention one of: {moe_router, moe_expert_gate, moe_expert_up, moe_expert_swigl, moe_expert_out, moe_ffn_out}." Discharge evidence: aprender PR #1529 squash 89cb26a (M85) title is "fix(M-GPU-MOE-1.4 step c): qtype-aware dispatch in expert_swiglu_cuda — closes L6 moe_ffn_out NaN" — explicitly cites `moe_ffn_out` (one of the 6 enumerated stages) by name. The PR body further cites "moe_ffn_out at layer 6" multiple times in the Five-Whys analysis and per-layer bisection result table. All four FALSIFY-MOE-SUB-* tests now DISCHARGED: - SUB-001 (parse): DISCHARGED at v1.4.0 (M82) - SUB-002 (byte-identity / heavy harness): DISCHARGED at v1.5.0 (M83) - SUB-003 (bisection-pinpoints-stage): DISCHARGED at v1.5.0 (M83) - SUB-004 (fix-PR-cites-stage): DISCHARGED at v1.6.0 (this amendment) M-MOE-SUB-4 (per-expert sub-stages) stays PENDING — was optional; M-MOE-SUB-3's MoeRouter+MoeFfnOut precision was sufficient for M85's fix. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the M-GPU-MOE-1.4 NaN root cause at layer 6
moe_ffn_outidentified by the LIVE bisection in PR #1527/#1528 (M83/M84).Post-fix verified on gx10 Blackwell GB10: ZERO NaN across all 48 layers; L6
moe_ffn_outcos goes fromNanGpu→0.999651 MATCH.Five-Whys (validated by code inspection)
silu(gate) * upproduct.expert_swiglu_cudacallsq4k_matvecUNCONDITIONALLY for both gate AND up (no qtype check).expert_swiglu_quantizeddispatches viamatvec_for_qtypewith explicit comments calling out this exact issue.Why L0–L5 MATCH
For layers 0-5 of canonical Qwen3-Coder GGUF, gate_exps + up_exps are Q4_K. The unconditional
q4k_matvecwas correct for those layers (cos > 0.9999 in pre-fix bisection).Why L6 first NaN
Layer 6 has at least one tensor at Q6_K. GPU feeds Q6_K bytes into Q4_K kernel, which interprets the 256×6-bit super-block layout as 256×4-bit, producing scaled-by-16x garbage that overflows in
silu(gate)*upwithin 1-2 layers.Fix
expert_swiglu_cudasignature with 3 qtype parameters (gate_qtype,up_qtype,down_qtype)matvec_qtype_cudadispatch helper that mirrors CPUmatvec_for_qtype: Q4_K →q4k_matvec, Q6_K →q6k_gemv,UnsupportedOperationotherwisemoe_ffn_forward_layer_cuda+_with_routercallers to passlayer.{gate,up,down}_exps.qtypeVerification
L6 row pre/post comparison:
Discharges
What stays partial
About 7-8 layers (L7, L9, L12, L20, L23, L29, L46, etc.) sit at cos 0.94–0.987 — below the 0.99 threshold. Cause: floating-point accumulator order variance between CPU
fused_q6k_parallel_matvec(Rust SIMD via rayon) and GPUq6k_gemv(CUDA warp-shuffle reduction). Both decode the same Q6_K bytes correctly; the f32 sum-of-products is just non-associative. This is M-GPU-MOE-3 territory (throughput-stage kernel refinement), not the step-c NaN bug.Architecture portability
Fix is purely host-side dispatch logic — same on sm_89 (Ada RTX 4090) and sm_120 (Blackwell GB10). M83 finding ("bug is arch-portable, single fix discharges both") confirmed.
Drift-prevention tests
expert_swiglu_cuda_signature_has_three_qtype_params(compilation gate)falsify_qw3_moe_gpu_qtype_aware_dispatch_rejects_unknown(asserts UnsupportedOperation on non-{Q4_K,Q6_K} qtype)Test plan
pv validate0/0cargo test -p aprender-serve --features cuda --lib expert_swiglu_cuda4/4 passevidence/m-gpu-moe-1-4-postfix-gx10-2026-05-06/🤖 Generated with Claude Code