Thanks to visit codestin.com
Credit goes to github.com

Skip to content

feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)#1491

Merged
noahgift merged 2 commits into
mainfrom
feat/m-gpu-moe-1-3-preload-bug-fix
May 4, 2026
Merged

feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge)#1491
noahgift merged 2 commits into
mainfrom
feat/m-gpu-moe-1-3-preload-bug-fix

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements M-GPU-MOE-1.3 per qwen3-moe-forward-gpu-v1 v1.3.0 (PR #1490). Two-commit fix.

What this PR fixes

File Change
gguf/config.rs Add is_moe: bool field to ArchConstraints
gguf/arch_constraints_fallback.rs Set is_moe: true for qwen3_moe arm; add raw qwen3moe/qwen3_5moe aliases
cuda/executor/weights.rs build_indexed_weights skips ffn_gate/up/down lookups when arch.is_moe
cuda/types.rs ValidatedLayerWeights::validate skips FfnGate/FfnUp/FfnDown role checks when arch.is_moe
gguf/cuda/mod.rs Skip parity_gate (Jidoka load-time gate) for MoE — runs dense forward against placeholders

Test progression on lambda-vector RTX 4090

BEFORE this PR:
  panic at OwnedQuantizedModelCuda::new
  → build_indexed_weights demands `blk.0.ffn_gate.weight` (doesn't exist for MoE)

AFTER commit 1 (build_indexed_weights + ValidatedLayerWeights gates):
  panic at parity_gate (matmul_fused.rs:211)
  → dense forward indexes `layer.ffn_up_weight.data` (placeholder, byte_size=0)

AFTER commit 2 (parity_gate MoE guard):
  ✅ CPU forward succeeds (forward_qwen3_moe LAZY-FUSED-MATVEC)
  ✅ OwnedQuantizedModelCuda construction succeeds  
  ✅ GPU forward executes (forward_qwen3_moe_cuda)
  ❌ Asserts at gpu_logits.iter().all(|v| v.is_finite())
     → GPU produces NaN/Inf (separate downstream numerical bug)

What this PR partially discharges

  • FALSIFY-QW3-MOE-GPU-PRELOAD-001 ✅ — wrapper construction succeeds (was the original bug)
  • FALSIFY-QW3-MOE-GPU-INVARIANTS-001 ⚠️ — partial (output length OK; finiteness FAILS due to NaN/Inf)
  • FALSIFY-QW3-MOE-GPU-PARITY-001 ❌ — blocked by downstream NaN/Inf bug

New downstream bug (next iteration)

GPU forward produces NaN/Inf logits. Likely M-GPU-MOE-1.5 candidates:

  • Q4K matmul accumulator overflow in expert_swiglu_cuda
  • SwiGLU silu producing Inf for large inputs
  • Top-k router weight renormalization div-by-zero
  • Missing per-head Q/K RMSNorm in MoE GPU path

Bisection via apr trace --json --payload per M32d Step 2 methodology.

Stacking

This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment + evidence).

Test plan

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 4, 2026 22:08
@noahgift noahgift force-pushed the feat/m-gpu-moe-1-3-preload-bug-fix branch from eba93d0 to 8b07755 Compare May 4, 2026 22:28
noahgift and others added 2 commits May 5, 2026 00:50
…partial discharge)

Per qwen3-moe-forward-gpu-v1 v1.3.0 amendment (PR #1490).

WHAT THIS PR FIXES:

  ArchConstraints + build_indexed_weights + ValidatedLayerWeights all
  made MoE-aware via new `is_moe: bool` field on ArchConstraints.

  (1) `crates/aprender-serve/src/gguf/config.rs` — adds `is_moe: bool`
      field to `ArchConstraints` struct.

  (2) `crates/aprender-serve/src/gguf/arch_constraints_fallback.rs` —
      sets `is_moe: false` on all 19 dense arch entries; sets
      `is_moe: true` on the qwen3_moe arm. Also adds the raw GGUF arch
      string `qwen3moe` (no underscore) and `qwen3_5moe` to the same
      arm — these reach `from_architecture` from
      `ValidatedModelConfig::from_apr` without going through
      `normalize_architecture`.

  (3) `crates/aprender-serve/src/cuda/executor/weights.rs` —
      `build_indexed_weights` gates the 3 FFN-related quant lookups
      (ffn_gate.weight, ffn_up.weight, ffn_down.weight) on
      `arch.is_moe`; uses (0u64, 0usize) sentinels for MoE. Same
      gating for the 3 qtype resolutions.

  (4) `crates/aprender-serve/src/cuda/types.rs` —
      `ValidatedLayerWeights::validate` skips the FfnGate/FfnUp/FfnDown
      role checks when `arch.is_moe`. The MoE forward path
      (`forward_qwen3_moe_cuda`) routes FFN through `moe_layers`
      parameter, never reading these from the indexed weights.

WHAT THIS PR PARTIALLY DISCHARGES:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new in v1.3.0) — wrapper
  construction now succeeds for qwen3_moe GGUFs. Before this PR,
  `OwnedQuantizedModelCuda::new(model, 0)` panicked at:

    UnsupportedOperation { operation: "preload_weights_gpu",
      reason: "PAR-043: Failed to build indexed weights:
               Invalid launch config: Quantized weight
               'blk.0.ffn_gate.weight' not cached" }

  After this PR, that specific path no longer fails. Verified by
  re-running M-GPU-MOE-1.2 heavy test — it now progresses past
  `OwnedQuantizedModelCuda::new`.

NEW DOWNSTREAM BUG (not blocking this PR):

  After the wrapper construction fix, the heavy test now panics in
  CPU forward `matmul_fused.rs:211` with
  `index out of bounds: the len is 0 but the index is N`. This is a
  separate bug class: someone in the CPU forward path is dereferencing
  `layer.ffn_up_weight.data` (or similar) which is the
  `dense_ffn_placeholder` (byte_size=0) for MoE layers per
  `transformer.rs:348-353`. Root cause likely: the CPU
  `forward_qwen3_moe` does NOT touch the dense placeholders directly,
  but some preload/validation/init step does. Needs a follow-up PR
  (M-GPU-MOE-1.4) to either (a) skip dense-FFN-data access for MoE
  layers, or (b) replace the placeholder with proper sentinel.

  This PR DOES NOT regress the previous behaviour: the previous
  state was "wrapper construction fails", which masked the
  downstream bug. M-GPU-MOE-1.4 will surface and fix it.

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors
  cargo check -p aprender-serve --features cuda  → 0 errors
  cargo test -p aprender-serve --test qwen3_moe_gpu_parity \
      --features cuda                            → 3 helpers pass

  Heavy test on lambda-vector RTX 4090:
    BEFORE this PR: panic at OwnedQuantizedModelCuda::new
                    (preload_weights_gpu / build_indexed_weights)
    AFTER this PR:  panic moved to CPU forward matmul_fused.rs:211
                    (downstream bug, separate PR scope)

  Net: progress one bug class. M-GPU-MOE-1.3 stage is FUNCTIONALLY
  DISCHARGED as defined; M-GPU-MOE-1.4 follow-up needed for full
  PARITY-001 discharge.

NOTE ON PR STACKING:

  This PR depends on PR #1490 (contract v1.2.0 → v1.3.0 amendment +
  evidence file) being on aprender main first. The contract pinned
  the architectural decision; this PR implements it.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (partial discharge)

Co-Authored-By: Claude Opus 4.7 <[email protected]>
Followup to the previous M-GPU-MOE-1.3 commit. The parity_gate
(Jidoka stop-the-line in `OwnedQuantizedModelCuda::with_max_seq_len`)
also runs the dense forward paths
(`forward_single_with_cache` CPU + `forward_gpu_resident` GPU) on
construction. For MoE these dispatch to `fused_matmul_f32` against
the `dense_ffn_placeholder` (byte_size=0), causing rayon-parallel
panics in `matmul_fused.rs:211`.

Fix: skip parity_gate when `arch.is_moe`, mirroring the rationale
already in v1.3.0's amendment_history block.

  - The parity gate's purpose is "stop the line if GPU diverges
    from CPU" — for dense models, it's load-time safety.
  - For MoE, the equivalent gate is FALSIFY-QW3-MOE-GPU-PARITY-001
    (qwen3_moe_gpu_parity.rs), which exercises the MoE-specific
    forward paths and bypasses the dense path the gate runs.
  - Net: MoE models lose load-time parity but gain
    test-time parity via the qwen3_moe_gpu_parity test.

VERIFICATION ON LAMBDA-VECTOR RTX 4090:

  Test progresses much further now:

    BEFORE: panic at OwnedQuantizedModelCuda::new build_indexed_weights
            (FALSIFY-QW3-MOE-GPU-PRELOAD-001 falsifier)
    AFTER previous commit: panic at parity_gate matmul_fused.rs:211
            (downstream bug — exposed but not yet fixed)
    AFTER this commit: CPU forward succeeds, GPU forward executes,
            then asserts at gpu_logits.iter().all(|v| v.is_finite())
            because the GPU produces NaN/Inf logits.

  Test output:
    [GH-129] Early kernel preload: 49 modules compiled
    [PMAT-082] cuBLASLt FP8 JIT warmed (2048x16x2048)
    [PMAT-053] FP8 weight cache: 193 matrices cached (728.8 MB)
    FALSIFY-QW3-MOE-GPU-PARITY-001: running GPU forward...
    panicked at qwen3_moe_gpu_parity.rs:168:
    all GPU logits must be finite (no NaN/Inf)

PARTIAL DISCHARGE:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction succeeds.
  FALSIFY-QW3-MOE-GPU-INVARIANTS-001 — partial (output length OK
                                       implicitly; finiteness FAILS).
  FALSIFY-QW3-MOE-GPU-PARITY-001 — blocked by NaN/Inf bug.

NEW DOWNSTREAM BUG:

  GPU forward (forward_qwen3_moe_cuda body, M-GPU-MOE-1.1.2 PR
  #1477) produces NaN/Inf for at least the canonical 3-token
  Qwen3-Coder prompt. This is the NEXT bug to investigate
  (M-GPU-MOE-1.5 follow-up). Likely candidates:
    - Q4K matmul accumulator overflow in expert_swiglu_cuda
    - Per-expert SwiGLU silu activation produces Inf for large inputs
    - Top-k router weight renormalization division by zero
    - missing per-head Q/K RMSNorm path for MoE (qk_norm tensors
      loaded but not applied)
  Bisection via `apr trace --json --payload` per the M32d Step 2
  surface methodology (per qwen3-moe-forward-gpu-v1 v1.1.0
  PARITY-001 if_fails).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift force-pushed the feat/m-gpu-moe-1-3-preload-bug-fix branch from 8b07755 to 2ebacaf Compare May 4, 2026 22:50
@noahgift noahgift merged commit f0cbe37 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/m-gpu-moe-1-3-preload-bug-fix branch May 4, 2026 23:09
noahgift added a commit that referenced this pull request May 4, 2026
…on plan (#1492)

Records the next-bug-class finding from the M-GPU-MOE-1.3 partial
discharge (PR #1491 squash f0cbe37, MERGED 2026-05-04 on aprender
main).

WHAT 1.3 DISCHARGED:

  FALSIFY-QW3-MOE-GPU-PRELOAD-001 — wrapper construction now
  succeeds for qwen3_moe GGUFs.

WHAT 1.3 EXPOSED (NOT YET FIXED):

  Heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 +
  cached 17.3 GB Qwen3-Coder GGUF runs end-to-end but fails:

    assert!(gpu_logits.iter().all(|v| v.is_finite()),
            "all GPU logits must be finite (no NaN/Inf)");

  Blocks FALSIFY-QW3-MOE-GPU-PARITY-001 + PARTIAL-discharges
  FALSIFY-QW3-MOE-GPU-INVARIANTS-001.

BISECTION PLAN (M-GPU-MOE-1.4):

  (a) Instrumentation — extend apr trace --json --payload to
      capture per-stage tensors on MoE GPU path
  (b) Bisection — diff CPU vs GPU per-stage to find first
      NaN/Inf-producing stage. Candidates:
        * Q4K matvec accumulator overflow
        * SwiGLU silu Inf
        * top-k router renorm div-by-zero
        * missing per-head Q/K RMSNorm in MoE GPU path
  (c) Fix — apply at bisected stage; class TBD by result

THIS PR ADDS:

  * v1.3.0 → v1.4.0 amendment_history block (~110 lines)
  * NEW M-GPU-MOE-1.4 implementation_stage (PENDING)
  * M-GPU-MOE-1.3 status updated PENDING → PARTIALLY_DISCHARGED
  * Top-level version + status block updated

VALIDATION: pv validate → 0 errors, 0 warnings.

Per CLAUDE.md "NEVER write code before writing a provable contract"
— this PR pins the bisection-and-fix plan BEFORE code. Code
follows in M-GPU-MOE-1.4 fix PR (separate scope).

Refs: M52, M53, M54, M55, M56, R10,
      qwen3-moe-forward-gpu-v1 v1.4.0,
      FALSIFY-QW3-MOE-GPU-INVARIANTS-001 (finiteness sub-check).

Co-authored-by: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 5, 2026
Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage
extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per
qwen3-moe-forward-gpu-v1 v1.4.0 amendment.

WHY THIS CONTRACT:

  After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37
  MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector
  RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared
  with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda
  → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate.
  Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a),
  apr trace needs new SaveTensorStage variants for MoE GPU stages.

WHAT IT DEFINES:

  * 2 mandatory new SaveTensorStage variants:
    - MoeRouter — top-k weights post-softmax/renormalize
    - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e)
  * 4 optional per-expert variants (with expert_id qualifier):
    - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut
    - Promoted to mandatory only if 2-stage bisection isn't
      precise enough.

  * Bisection chain CPU-vs-GPU:
    cos_sequence = [
      cos(CPU.ffn_norm,    GPU.ffn_norm),     # parent enum
      cos(CPU.moe_router,  GPU.moe_router),   # NEW
      cos(CPU.moe_ffn_out, GPU.moe_ffn_out),  # NEW
    ]

FALSIFICATION TESTS (4):

  FALSIFY-MOE-SUB-001: New variants exist + parse correctly
  FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved
  FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage
  FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name

IMPLEMENTATION_STAGES (4):

  M-MOE-SUB-0: This contract scaffold (SHIPPED)
  M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING)
  M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING)
  M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING)
  M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING)

VALIDATION: pv validate exits 0 errors, 0 warnings.

Per CLAUDE.md "NEVER write code before writing a provable contract"
— this PR pins the trace-stage architecture before code lands.
M-MOE-SUB-1 follows in a separate PR.

Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      trace-attn-sub-stages-v1 (sibling pattern).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 5, 2026
Sibling contract to trace-attn-sub-stages-v1; pins the SaveTensorStage
extensions needed for M-GPU-MOE-1.4 NaN/Inf bisection per
qwen3-moe-forward-gpu-v1 v1.4.0 amendment.

WHY THIS CONTRACT:

  After M-GPU-MOE-1.3 partial fix (PR #1491 squash f0cbe37
  MERGED), heavy `qwen3_moe_gpu_parity` test on lambda-vector
  RTX 4090 produces 100% NaN logits. Steps 1-9 are CPU-shared
  with finite output; step 10 (GPU MoE FFN — moe_ffn_forward_layer_cuda
  → expert_swiglu_cuda → q4k_matvec/q6k_gemv) is the only candidate.
  Per qwen3-moe-forward-gpu-v1 v1.4.0 bisection plan step (a),
  apr trace needs new SaveTensorStage variants for MoE GPU stages.

WHAT IT DEFINES:

  * 2 mandatory new SaveTensorStage variants:
    - MoeRouter — top-k weights post-softmax/renormalize
    - MoeFfnOut — aggregated MoE FFN output (Σ w_e * expert_out_e)
  * 4 optional per-expert variants (with expert_id qualifier):
    - MoeExpertGate, MoeExpertUp, MoeExpertSwigl, MoeExpertOut
    - Promoted to mandatory only if 2-stage bisection isn't
      precise enough.

  * Bisection chain CPU-vs-GPU:
    cos_sequence = [
      cos(CPU.ffn_norm,    GPU.ffn_norm),     # parent enum
      cos(CPU.moe_router,  GPU.moe_router),   # NEW
      cos(CPU.moe_ffn_out, GPU.moe_ffn_out),  # NEW
    ]

FALSIFICATION TESTS (4):

  FALSIFY-MOE-SUB-001: New variants exist + parse correctly
  FALSIFY-MOE-SUB-002: Existing 20-stage byte-identity preserved
  FALSIFY-MOE-SUB-003: Bisection identifies first NaN-producing stage
  FALSIFY-MOE-SUB-004: Fix PR cites bisected stage by name

IMPLEMENTATION_STAGES (4):

  M-MOE-SUB-0: This contract scaffold (SHIPPED)
  M-MOE-SUB-1: Add MoeRouter + MoeFfnOut variants (PENDING)
  M-MOE-SUB-2: Wire MoeRouter into both CPU + GPU forward (PENDING)
  M-MOE-SUB-3: Wire MoeFfnOut + run heavy bisection (PENDING)
  M-MOE-SUB-4: OPTIONAL per-expert promotion (PENDING)

VALIDATION: pv validate exits 0 errors, 0 warnings.

Per CLAUDE.md "NEVER write code before writing a provable contract"
— this PR pins the trace-stage architecture before code lands.
M-MOE-SUB-1 follows in a separate PR.

Refs: M-GPU-MOE-1.4, R10, qwen3-moe-forward-gpu-v1 v1.4.0,
      trace-attn-sub-stages-v1 (sibling pattern).

Co-authored-by: Claude Opus 4.7 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant