Thanks to visit codestin.com
Credit goes to github.com

Skip to content

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10#1527

Merged
noahgift merged 1 commit into
mainfrom
contract/moe-sub-v1.5.0-live-bisection-gx10
May 6, 2026
Merged

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10#1527
noahgift merged 1 commit into
mainfrom
contract/moe-sub-v1.5.0-live-bisection-gx10

Conversation

@noahgift

@noahgift noahgift commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

M-GPU-MOE-1.4 NaN root cause pinpointed to layer 6 moe_ffn_out. Operator-dispatched run of the M80 heavy harness (PR #1524) on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal.

This discharges 2 of the 4 still-open falsifiers in the M-MOE-SUB cascade and narrows the M-GPU-MOE-1.4 fix scope to a single-layer surface.

Per-layer cos-sim summary

Full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt (853 lines)

layer | moe_router (cos / verdict)        | moe_ffn_out (cos / verdict)
------|-----------------------------------|-----------------------------------
L00   | 1.000000 MATCH                    | 0.999987 MATCH
L01   | 0.999998 MATCH                    | 0.999904 MATCH
L02   | 0.999998 MATCH                    | 0.999953 MATCH
L03   | 0.999975 MATCH                    | 0.999876 MATCH
L04   | 0.999994 MATCH                    | 0.999861 MATCH
L05   | 0.999993 MATCH                    | 0.999896 MATCH
L06   | 0.999986 MATCH                    |             NanGpu  ← FIRST NaN
L07   | 0.818903 DIVERGE                  |             NanGpu  ← NaN poisons router
L08–L47 | DIVERGE                         |             NanGpu  ← downstream poisoning

Harness decision tree firing:

If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare).

Status promotions

Falsifier v1.4.0 v1.5.0
FALSIFY-MOE-SUB-001 DISCHARGED DISCHARGED
FALSIFY-MOE-SUB-002 ALGORITHM_LEVEL_DISCHARGED DISCHARGED
FALSIFY-MOE-SUB-003 PROPOSED DISCHARGED
FALSIFY-MOE-SUB-004 PROPOSED unchanged (pending M-GPU-MOE-1.4 fix PR)

Contract status: ACTIVE_ALGORITHM_LEVELACTIVE.

Architectural portability finding

This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → the bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch — Q4K/Q6K matvec compiled and ran on sm_120 first-shot.

This is a stronger signal than expected. We initially assumed gx10 might fail to reproduce or hit JIT issues; instead it gave us a clean arch-portable bisection result that means a single fix at the bisected stage discharges both arch-specific manifestations.

Bug surface narrowed for M-GPU-MOE-1.4 fix

  • crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs (per-layer GPU helper)
  • crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs (per-expert SwiGLU)
  • CudaExecutor::q4k_matvec / q6k_gemv (custom PTX)

Hypotheses for layer-6-specific NaN (priority order):

  1. Numerical overflow in expert SwiGLU at L6 — layer-6 intermediate activations have distribution causing silu(gate) * up to overflow accumulator
  2. Expert weight distribution at L6 — layer-6 experts have weights that combined with CPU-traced L5 output produce large activations
  3. Q4K dequant accumulator at L6 — a specific Q4K block at layer 6 has a scale value causing overflow during dequant + matmul fusion

Verification

$ pv validate contracts/trace-moe-gpu-sub-stages-v1.yaml
0 error(s), 0 warning(s)
Contract is valid.

$ ssh gx10 "cd ~/src/aprender && cargo test -p aprender-serve --features cuda --release \
    --test qwen3_moe_gpu_per_stage_diff falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff \
    -- --include-ignored --nocapture"
test falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 23.18s

Production hot paths

Byte-unchanged. YAML + evidence-only — additive-purity invariant pinned in v1.1.0 still holds.

Test plan

  • pv validate 0/0
  • LIVE harness ran on gx10 (Blackwell GB10) producing first NaN_GPU at L6 moe_ffn_out
  • Evidence captured in evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/
  • No production code touched

🤖 Generated with Claude Code

…on DISCHARGED on gx10

Operator-dispatched run of the M80 heavy harness on Blackwell GB10
(gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF
completed cleanly in 23.18s and produced the expected bisection
signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6
moe_ffn_out**.

Per-layer cos-sim summary (full output:
evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt):
- L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out)
- L6 first NaN_GPU on moe_ffn_out (router still finite at L6)
- L7+ all DIVERGE on router (downstream NaN poisoning)

Decision tree firing per harness output:
  "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH:
   bug is layer-N specific (rare)."

Status promotions:
- FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED
  (heavy harness ran cleanly with --include-ignored)
- FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED
  (bisection-pinpoints-stage; stage = L6 moe_ffn_out)
- FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR
  pending — must cite L6 moe_ffn_out by name)

Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE.

Architectural portability finding:
This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN
(PR #1493) was characterized on sm_89 (Ada RTX 4090). Both
architectures produce NaN at the same layer → bug is algorithmic /
numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT
pre-warming did NOT block the dispatch.

Bug surface narrowed for M-GPU-MOE-1.4 fix scope:
- crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs
- crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

YAML + evidence-only — production hot paths byte-unchanged
(additive-purity invariant pinned in v1.1.0 still holds).

`pv validate` 0/0.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift enabled auto-merge (squash) May 6, 2026 07:03
@noahgift noahgift merged commit 3d40fdd into main May 6, 2026
11 checks passed
@noahgift noahgift deleted the contract/moe-sub-v1.5.0-live-bisection-gx10 branch May 6, 2026 07:28
noahgift added a commit that referenced this pull request May 6, 2026
…tep (b) bisection result (#1528)

Records the LIVE bisection result for M-GPU-MOE-1.4 step (b)
from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06.

WHAT THE BISECTION FOUND:
- First NaN_GPU on `moe_ffn_out` = layer 6
- L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986)
- L6: first NaN_GPU on moe_ffn_out (router still finite at L6)
- L7+: all DIVERGE on router (downstream NaN poisoning)

Decision tree firing per harness output:
  "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH:
   bug is layer-N specific (rare)."

ARCHITECTURAL PORTABILITY:
Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3
NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both
architectures produce NaN at the same layer → bug is algorithmic /
numerical, NOT kernel codegen. A single fix at the bisected stage
discharges both arch-specific manifestations.

BUG SURFACE NARROWED:
- crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs
- crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs
- CudaExecutor::q4k_matvec / q6k_gemv

The bug is NOT in the routing logic (router stays finite at L6).
It's in the per-expert FFN computation at layer 6 specifically.

HYPOTHESES (refined from v1.4.0, priority order):
1. Numerical overflow in expert SwiGLU at L6
2. Expert weight distribution at L6 produces large activations
3. Q4K dequant accumulator at L6 overflow

The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED —
Q/K norm runs in attention which is BEFORE FFN; if it were missing
the divergence would appear earlier than L6.

IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED.
- Step (a) instrumentation cascade: COMPLETE (M50→M81)
- Step (b) LIVE bisection: COMPLETE (this evidence)
- Step (c) fix: OPEN

Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0
amendment in PR #1527 records the same bisection result from
the falsifier-side. This PR records it from the parent kernel-
contract side; both refer to the same evidence dir.

YAML-only — production hot paths byte-unchanged.

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant