contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10#1527
Merged
Merged
Conversation
…on DISCHARGED on gx10 Operator-dispatched run of the M80 heavy harness on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal pinpointing the M-GPU-MOE-1.4 NaN root cause to **layer 6 moe_ffn_out**. Per-layer cos-sim summary (full output: evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt): - L0–L5 ALL MATCH (cos > 0.99986 on both moe_router AND moe_ffn_out) - L6 first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+ all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." Status promotions: - FALSIFY-MOE-SUB-002: ALGORITHM_LEVEL_DISCHARGED → DISCHARGED (heavy harness ran cleanly with --include-ignored) - FALSIFY-MOE-SUB-003: PROPOSED → DISCHARGED (bisection-pinpoints-stage; stage = L6 moe_ffn_out) - FALSIFY-MOE-SUB-004: unchanged PROPOSED (M-GPU-MOE-1.4 fix PR pending — must cite L6 moe_ffn_out by name) Contract status: ACTIVE_ALGORITHM_LEVEL → ACTIVE. Architectural portability finding: This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch. Bug surface narrowed for M-GPU-MOE-1.4 fix scope: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv YAML + evidence-only — production hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0 still holds). `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <[email protected]>
Merged
4 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…tep (b) bisection result (#1528) Records the LIVE bisection result for M-GPU-MOE-1.4 step (b) from operator-dispatched run on Blackwell GB10 (gx10) on 2026-05-06. WHAT THE BISECTION FOUND: - First NaN_GPU on `moe_ffn_out` = layer 6 - L0–L5: ALL MATCH on both moe_router AND moe_ffn_out (cos > 0.99986) - L6: first NaN_GPU on moe_ffn_out (router still finite at L6) - L7+: all DIVERGE on router (downstream NaN poisoning) Decision tree firing per harness output: "If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare)." ARCHITECTURAL PORTABILITY: Bisection ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → bug is algorithmic / numerical, NOT kernel codegen. A single fix at the bisected stage discharges both arch-specific manifestations. BUG SURFACE NARROWED: - crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs - crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs - CudaExecutor::q4k_matvec / q6k_gemv The bug is NOT in the routing logic (router stays finite at L6). It's in the per-expert FFN computation at layer 6 specifically. HYPOTHESES (refined from v1.4.0, priority order): 1. Numerical overflow in expert SwiGLU at L6 2. Expert weight distribution at L6 produces large activations 3. Q4K dequant accumulator at L6 overflow The v1.4.0 hypothesis "missing per-head Q/K RMSNorm" is REFUTED — Q/K norm runs in attention which is BEFORE FFN; if it were missing the divergence would appear earlier than L6. IMPLEMENTATION_STAGE M-GPU-MOE-1.4: PENDING → PARTIALLY_DISCHARGED. - Step (a) instrumentation cascade: COMPLETE (M50→M81) - Step (b) LIVE bisection: COMPLETE (this evidence) - Step (c) fix: OPEN Sibling contract `trace-moe-gpu-sub-stages-v1` v1.4.0 → v1.5.0 amendment in PR #1527 records the same bisection result from the falsifier-side. This PR records it from the parent kernel- contract side; both refer to the same evidence dir. YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <[email protected]>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
M-GPU-MOE-1.4 NaN root cause pinpointed to layer 6
moe_ffn_out. Operator-dispatched run of the M80 heavy harness (PR #1524) on Blackwell GB10 (gx10) against cached 18 GB Qwen3-Coder-30B-A3B-Instruct GGUF completed cleanly in 23.18s and produced the expected bisection signal.This discharges 2 of the 4 still-open falsifiers in the M-MOE-SUB cascade and narrows the M-GPU-MOE-1.4 fix scope to a single-layer surface.
Per-layer cos-sim summary
Full output:
evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/m80-bisection.txt(853 lines)Harness decision tree firing:
Status promotions
Contract status:
ACTIVE_ALGORITHM_LEVEL→ACTIVE.Architectural portability finding
This ran on sm_120 (Blackwell GB10). Original M-GPU-MOE-1.3 NaN bug (PR #1493) was characterized on sm_89 (Ada RTX 4090). Both architectures produce NaN at the same layer → the bug is algorithmic / numerical, NOT kernel codegen. trueno#200 Blackwell PTX JIT pre-warming did NOT block the dispatch — Q4K/Q6K matvec compiled and ran on sm_120 first-shot.
This is a stronger signal than expected. We initially assumed gx10 might fail to reproduce or hit JIT issues; instead it gave us a clean arch-portable bisection result that means a single fix at the bisected stage discharges both arch-specific manifestations.
Bug surface narrowed for M-GPU-MOE-1.4 fix
crates/aprender-serve/src/gguf/cuda/moe_ffn_forward_layer_cuda.rs(per-layer GPU helper)crates/aprender-serve/src/gguf/cuda/expert_swiglu_cuda.rs(per-expert SwiGLU)CudaExecutor::q4k_matvec/q6k_gemv(custom PTX)Hypotheses for layer-6-specific NaN (priority order):
silu(gate) * upto overflow accumulatorVerification
Production hot paths
Byte-unchanged. YAML + evidence-only — additive-purity invariant pinned in v1.1.0 still holds.
Test plan
pv validate0/0evidence/m-gpu-moe-1-4-bisection-gx10-2026-05-06/🤖 Generated with Claude Code