Codestin Search App

noahgift · 2026-05-06T07:03:48Z

Summary

M-GPU-MOE-1.4 NaN root cause pinpointed to layer 6 moe_ffn_out. Operator-dispatched run of the M80 heavy harness (PR #1524) on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal.

This discharges 2 of the 4 still-open falsifiers in the M-MOE-SUB cascade and narrows the M-GPU-MOE-1.4 fix scope to a single-layer surface.

Per-layer cos-sim summary

Full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt (853 lines)

layer | moe_router (cos / verdict)        | moe_ffn_out (cos / verdict)
------|-----------------------------------|-----------------------------------
L00   | 1.000000 MATCH                    | 0.999987 MATCH
L01   | 0.999998 MATCH                    | 0.999904 MATCH
L02   | 0.999998 MATCH                    | 0.999953 MATCH
L03   | 0.999975 MATCH                    | 0.999876 MATCH
L04   | 0.999994 MATCH                    | 0.999861 MATCH
L05   | 0.999993 MATCH                    | 0.999896 MATCH
L06   | 0.999986 MATCH                    |             NanGpu  ← FIRST NaN
L07   | 0.818903 DIVERGE                  |             NanGpu  ← NaN poisons router
L08–L47 | DIVERGE                         |             NanGpu  ← downstream poisoning

Harness decision tree firing:

If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare).

Status promotions

Falsifier	v1.4.0	v1.5.0
FALSIFY-MOE-SUB-001	DISCHARGED	DISCHARGED
FALSIFY-MOE-SUB-002	ALGORITHM_LEVEL_DISCHARGED	DISCHARGED
FALSIFY-MOE-SUB-003	PROPOSED	DISCHARGED
FALSIFY-MOE-SUB-004	PROPOSED	unchanged (pending M-GPU-MOE-1.4 fix PR)

Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE.

Architectural portability finding

This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → the bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch — Q4K/Q6K matvec compiled and ran on sm_120 first-shot.

This is a stronger signal than expected. We initially assumed gx10 might fail to reproduce or hit JIT issues; instead it gave us a clean arch-portable bisection result that means a single fix at the bisected stage discharges both arch-specific manifestations.

Bug surface narrowed for M-GPU-MOE-1.4 fix

crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs (per-layer GPU helper)
crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs (per-expert SwiGLU)
CudaExecutor::q4k_matvec / q6k_gemv (custom PTX)

Hypotheses for layer-6-specific NaN (priority order):

Numerical overflow in expert SwiGLU at L6 — layer-6 intermediate activations have distribution causing silu(gate) * up to overflow accumulator
Expert weight distribution at L6 — layer-6 experts have weights that combined with CPU-traced L5 output produce large activations
Q4K dequant accumulator at L6 — a specific Q4K block at layer 6 has a scale value causing overflow during dequant + matmul fusion

Verification

$ pv validate contracts/trace-moe-gpu-sub-stages-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

$ ssh gx10 "cd ~/src/aprender && cargo test -p aprender-serve --features cuda --release \
    --test qwen3_moe_gpu_per_stage_diff falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff \
    -- --include-ignored --nocapture"
test falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 23.18s

Production hot paths

Byte-unchanged. YAML + evidence-only — additive-purity invariant pinned in v1.1.0 still holds.

Test plan

pv validate 0/0
LIVE harness ran on gx10 (Blackwell GB10) producing first NaN_GPU at L6 moe_ffn_out
Evidence captured in evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/
No production code touched

🤖 Generated with Claude Code

…on DISCHARGED on gx10 Operator-dispatched run of the M80 heavy harness on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6 moe_ffn_out**. Per-layer cos-sim summary (full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt): - L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out) - L6 first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+ all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." Status promotions: - FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED (heavy harness ran cleanly with --include-ignored) - FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED (bisection-pinpoints-stage; stage = L6 moe_ffn_out) - FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR pending — must cite L6 moe_ffn_out by name) Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE. Architectural portability finding: This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch. Bug surface narrowed for M-GPU-MOE-1.4 fix scope: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv YAML + evidence-only — production hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0 still holds). `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <[email protected]>

…tep (b) bisection result (#1528) Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <[email protected]>

noahgift enabled auto-merge (squash) May 6, 2026 07:03

noahgift mentioned this pull request May 6, 2026

contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result #1528

Merged

4 tasks

noahgift merged commit 3d40fdd into main May 6, 2026
11 checks passed

noahgift deleted the contract/moe-sub-v1.5.0-live-bisection-gx10 branch May 6, 2026 07:28

noahgift mentioned this pull request May 6, 2026

fix(M-GPU-MOE-1.4 step c): qtype-aware dispatch in expert_swiglu_cuda — closes L6 moe_ffn_out NaN #1529

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10#1527

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10#1527
noahgift merged 1 commit into
mainfrom
contract/moe-sub-v1.5.0-live-bisection-gx10

noahgift commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 6, 2026

Summary

Per-layer cos-sim summary

Status promotions

Architectural portability finding

Bug surface narrowed for M-GPU-MOE-1.4 fix

Verification

Production hot paths

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant