Thanks to visit codestin.com
Credit goes to github.com

Skip to content

contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result#1528

Merged
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-gpu-v1.5.0-bisection-result
May 6, 2026
Merged

contract(qwen3-moe-forward-gpu-v1): v1.4.0 → v1.5.0 — M-GPU-MOE-1.4 step (b) bisection result#1528
noahgift merged 2 commits into
mainfrom
contract/qwen3-moe-gpu-v1.5.0-bisection-result

Conversation

@noahgift

@noahgift noahgift commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Parallel companion PR to #1527. Records the same M-GPU-MOE-1.4 bisection result from the parent kernel-contract side. Both PRs reference the same evidence dir at evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/.

Bisection result: First NaN_GPU on moe_ffn_out = layer 6, on Blackwell GB10 (gx10), 23.18s wall.

Implementation stage promotion

Stage v1.4.0 v1.5.0
M-GPU-MOE-1.4 PENDING PARTIALLY_DISCHARGED (steps a+b done; step c fix OPEN)

Key findings

  1. Layer 6 is the first NaN-emitting GPU stage. L0–L5 all MATCH on both moe_router AND moe_ffn_out (cos > 0.99986). L7+ diverge from downstream NaN poisoning.

  2. Bug is arch-portable. Reproduced on sm_120 (Blackwell GB10) — same defect class as sm_89 (Ada RTX 4090, original M-GPU-MOE-1.3 finding). → algorithmic / numerical, NOT kernel codegen.

  3. v1.4.0's "missing Q/K RMSNorm" hypothesis is REFUTED. Q/K norm runs in attention which is before FFN; if it were missing, divergence would appear earlier than L6.

  4. Bug surface narrowed to:

    • moe_ffn_forward_layer_cuda.rs
    • expert_swiglu_cuda.rs
    • CudaExecutor::q4k_matvec / q6k_gemv

Hypotheses refined for step (c) fix

  1. Numerical overflow in expert SwiGLU at L6
  2. Expert weight distribution at L6 produces large activations
  3. Q4K dequant accumulator at L6 overflow

Verification

$ pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

Evidence + raw harness output: see #1527.

Production hot paths

Byte-unchanged. YAML-only.

Test plan

🤖 Generated with Claude Code

…tep (b) bisection result

Records the LIVE bisection result for M-GPU-MOE-1.4 step (b)
from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06.

WHAT THE BISECTION FOUND:
- First NaN_GPU on `moe_ffn_out` = layer 6
- L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986)
- L6: first NaN_GPU on moe_ffn_out (router still finite at L6)
- L7+: all DIVERGE on router (downstream NaN poisoning)

Decision tree firing per harness output:
  "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH:
   bug is layer-N specific (rare)."

ARCHITECTURAL PORTABILITY:
Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3
NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both
architectures produce NaN at the same layer → bug is algorithmic /
numerical, NOT kernel codegen. A single fix at the bisected stage
discharges both arch-specific manifestations.

BUG SURFACE NARROWED:
- crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs
- crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

The bug is NOT in the routing logic (router stays finite at L6).
It's in the per-expert FFN computation at layer 6 specifically.

HYPOTHESES (refined from v1.4.0, priority order):
1. Numerical overflow in expert SwiGLU at L6
2. Expert weight distribution at L6 produces large activations
3. Q4K dequant accumulator at L6 overflow

The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED —
Q/K norm runs in attention which is BEFORE FFN; if it were missing
the divergence would appear earlier than L6.

IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED.
- Step (a) instrumentation cascade: COMPLETE (M50→M81)
- Step (b) LIVE bisection: COMPLETE (this evidence)
- Step (c) fix: OPEN

Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0
amendment in PR #1527 records the same bisection result from
the falsifier-side. This PR records it from the parent kernel-
contract side; both refer to the same evidence dir.

YAML-only — production hot paths byte-unchanged.

`pv validate` 0/0.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift enabled auto-merge (squash) May 6, 2026 07:11
@noahgift noahgift merged commit 22d4e00 into main May 6, 2026
10 checks passed
@noahgift noahgift deleted the contract/qwen3-moe-gpu-v1.5.0-bisection-result branch May 6, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant