Thanks to visit codestin.com
Credit goes to github.com

Skip to content

test(aprender-serve): qwen3_moe_gpu_parity — M-GPU-MOE-1.2 cosine ≥0.99 falsifier#1484

Merged
noahgift merged 1 commit into
mainfrom
feat/qwen3-moe-gpu-parity-test-m-1-2
May 4, 2026
Merged

test(aprender-serve): qwen3_moe_gpu_parity — M-GPU-MOE-1.2 cosine ≥0.99 falsifier#1484
noahgift merged 1 commit into
mainfrom
feat/qwen3-moe-gpu-parity-test-m-1-2

Conversation

@noahgift

@noahgift noahgift commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Authors the FALSIFY-QW3-MOE-GPU-PARITY-001 test scaffold from qwen3-moe-forward-gpu-v1 v1.1.0 implementation_stages M-GPU-MOE-1.2.
  • New test file crates/aprender-serve/tests/qwen3_moe_gpu_parity.rs. Follows the M32d.2 CPU-vs-HF-FP16 template (qwen3_moe_parity.rs) line-for-line.
  • #[cfg(feature = \"cuda\")] + #[ignore] on the heavy test (CI default skips; explicit --include-ignored runs it on RTX 4090).
  • Three helper unit tests (cosine_similarity sanity coverage) DO run by default.

What the test does (when invoked with --include-ignored)

  1. Loads the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf once (mmap).
  2. Builds CPU forward_qwen3_moe reference logits (LAZY-FUSED-MATVEC ground truth).
  3. Builds GPU OwnedQuantizedModelCuda::forward_qwen3_moe_cuda logits.
  4. Computes cosine similarity over the full 151936-dim vocab.
  5. Asserts cos_sim ≥ 0.99 per the contract's formal bound.

Dependency

When the heavy test is run on lambda-vector, the M-GPU-MOE-1.1.2 full forward integration (PR #1477) must be on main first. Currently main has the v1.0-redo stub which returns UnsupportedOperation — running the heavy test against the stub will panic (correct behaviour for a falsifier against an incomplete impl).

The mut gpu_model binding carries a #[allow(unused_mut)] because PR #1477 changes the receiver &self → &mut self.

Test plan

🤖 Generated with Claude Code

…99 falsifier

Authors the FALSIFY-QW3-MOE-GPU-PARITY-001 test scaffold from contract
qwen3-moe-forward-gpu-v1 v1.1.0 implementation_stages M-GPU-MOE-1.2.

WHAT THE TEST DOES (when run with `--include-ignored` against the
cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf on RTX 4090):

  1. Loads the GGUF once (single mmap).
  2. Builds moe_layers: Vec<Qwen3MoeQuantizedLayer> once.
  3. Builds CPU OwnedQuantizedModel #1 → runs forward_qwen3_moe on a
     fixed prompt → cpu_logits (the LAZY-FUSED-MATVEC ground truth).
  4. Builds CPU OwnedQuantizedModel #2 → wraps into
     OwnedQuantizedModelCuda → runs forward_qwen3_moe_cuda on the same
     prompt → gpu_logits.
  5. Computes cosine_similarity(cpu_logits, gpu_logits) over the
     full 151936-dim vocab.
  6. Asserts cos_sim ≥ 0.99 per the contract's formal bound.

The test follows the qwen3_moe_parity.rs (M32d.2 CPU-vs-HF-FP16)
template line-for-line — same canonical GGUF paths array, same
fixture-skip pattern, same cosine_similarity helper. The only
difference is the second forward pass dispatches to
forward_qwen3_moe_cuda instead of treating an FP32 fixture as truth.

CI WIRING:

  - #[cfg(feature = "cuda")] gates the entire file (no GPU host =
    no compile)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it)
  - 3 helper unit tests (cosine_similarity_unit_vectors / handles_zero
    / within_threshold) DO run by default — they cover the cosine
    helper itself

WHEN THE TEST PASSES:

  - The aprender PR #1477 (M-GPU-MOE-1.1.2 full forward integration)
    must be on main first. Currently main has the v1.0-redo stub;
    running this test against the stub returns UnsupportedOperation
    error and the test panics (correct behaviour for a falsifier
    against an incomplete impl).

  - Once #1477 lands, run the test on lambda-vector with:
        cargo test -p aprender-serve --test qwen3_moe_gpu_parity \
            --features cuda -- --include-ignored

  - On PASS, the contract's M-GPU-MOE-1.2 stage flips PENDING →
    SHIPPED and (with PARITY-002 from the v1 sibling) the gate
    discharges qwen3-moe-forward-gpu-v1 v1.1.0 DRAFT →
    ACTIVE_ALGORITHM_LEVEL.

PR #1477 changes forward_qwen3_moe_cuda's receiver from `&self` to
`&mut self` (kernel cache mutation). The `mut gpu_model` binding here
carries a forward-looking #[allow(unused_mut)] note for that reason.

Refs: qwen3-moe-forward-gpu-v1 v1.1.0 :: M-GPU-MOE-1.2 +
      FALSIFY-QW3-MOE-GPU-PARITY-001 + companion-spec M51 + R10.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift force-pushed the feat/qwen3-moe-gpu-parity-test-m-1-2 branch from 6302645 to 0a0d7b3 Compare May 4, 2026 20:37
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift enabled auto-merge (squash) May 4, 2026 20:37
noahgift added a commit that referenced this pull request May 4, 2026
….99 falsifier (wgpu) (#1488)

wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484).
Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference
and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu`
integration on the same prompt.

Same falsifier ID as the cuda sibling
(FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend
implementing the same contract gate, not a different gate. Same
threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same
3-token canonical prompt as the cuda test.

CI WIRING:

  - #[cfg(feature = "gpu")] gates the file (matches the gate on
    OwnedQuantizedModelWgpu in gguf/mod.rs)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it on a wgpu-capable adapter — Apple
    Silicon Metal, AMD Vulkan, Intel ARC Vulkan)
  - 2 helper unit tests (cosine_similarity sanity coverage) DO run
    by default

WHEN THE TEST PASSES:

  - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test
    currently panics at the wgpu forward call (correct behaviour
    for a falsifier against an incomplete impl).
  - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu
    QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2
    (full forward integration analog of forward_qwen3_moe_cuda)
    must both land before this test passes on hardware.
  - On hardware with wgpu support, run with --include-ignored to
    exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for
    the wgpu backend (cuda backend discharged by sibling test).

DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub).
Branch is stacked on the v1.2.0 contract branch; once #1485 lands
on main, this PR's base flips to main automatically.

Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 ::
M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu).

Co-authored-by: Claude Opus 4.7 <[email protected]>
@noahgift noahgift merged commit 8cbb7b5 into main May 4, 2026
10 checks passed
@noahgift noahgift deleted the feat/qwen3-moe-gpu-parity-test-m-1-2 branch May 4, 2026 21:02
noahgift added a commit that referenced this pull request May 4, 2026
…QuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 4, 2026
….99 falsifier (wgpu) (#1488)

wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484).
Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference
and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu`
integration on the same prompt.

Same falsifier ID as the cuda sibling
(FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend
implementing the same contract gate, not a different gate. Same
threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same
3-token canonical prompt as the cuda test.

CI WIRING:

  - #[cfg(feature = "gpu")] gates the file (matches the gate on
    OwnedQuantizedModelWgpu in gguf/mod.rs)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it on a wgpu-capable adapter — Apple
    Silicon Metal, AMD Vulkan, Intel ARC Vulkan)
  - 2 helper unit tests (cosine_similarity sanity coverage) DO run
    by default

WHEN THE TEST PASSES:

  - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test
    currently panics at the wgpu forward call (correct behaviour
    for a falsifier against an incomplete impl).
  - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu
    QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2
    (full forward integration analog of forward_qwen3_moe_cuda)
    must both land before this test passes on hardware.
  - On hardware with wgpu support, run with --include-ignored to
    exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for
    the wgpu backend (cuda backend discharged by sibling test).

DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub).
Branch is stacked on the v1.2.0 contract branch; once #1485 lands
on main, this PR's base flips to main automatically.

Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 ::
M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu).

Co-authored-by: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 4, 2026
…arity test (#1485)

* contract(qwen3-moe-forward-gpu-v1): v1.1.0 → v1.2.0 — option I (OwnedQuantizedModelWgpu)

Pre-implementation architecture amendment for M-GPU-MOE-2 (wgpu
fallback). Mirrors the v1.1.0 option D amendment that pinned the
CUDA substrate before M-GPU-MOE-1.0 implementation; this one pins
the wgpu substrate before any wgpu code lands.

Why now: M-GPU-MOE-1 is in flight (1.0-redo SHIPPED, 1.1.1 SHIPPED,
1.1.2 OPEN as PR #1477, 1.2 test scaffold OPEN as PR #1484).
Choosing the wgpu seam early prevents the wrong-type-stub waste
that bit M-GPU-MOE-1.0 (PR #1460 placed forward_qwen3_moe_gpu on
OwnedQuantizedModel; one cycle later #1464 redo'd it on
OwnedQuantizedModelCuda — option D).

FOUR options considered:
  (I)   OwnedQuantizedModelWgpu wrapper type (analog of v1.1.0 option D) — CHOSEN
  (II)  GpuExecutor trait abstracting CUDA + wgpu — REJECTED (over-engineered)
  (III) Backend enum inside renamed OwnedQuantizedModelGpu — REJECTED (invasive)
  (IV)  Defer wgpu indefinitely — REJECTED (violates CLAUDE.md backend-agnostic mandate)

Option I picks wgpu by code-path symmetry, not by trait abstraction:
new file tree at `crates/aprender-serve/src/gguf/wgpu/` mirrors
`crates/aprender-serve/src/gguf/cuda/` line-for-line. Maintenance-mode
reviewer can verify a parity bug by diff, not by elaborate test
infrastructure.

M-GPU-MOE-2 decomposed into four substages mirroring M-GPU-MOE-1.x:
  M-GPU-MOE-2.0 stub on OwnedQuantizedModelWgpu
  M-GPU-MOE-2.1 per-expert wgpu dispatch helpers (expert_swiglu_wgpu,
                moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2 full forward integration (replaces 2.0 stub body)
  M-GPU-MOE-2.3 cosine-vs-CPU parity test on hardware with wgpu

Two new blockers documented:
  - wgpu adapter selection probe for non-NVIDIA hardware
  - trueno-gpu Q6_K QuantizeKernel coverage check before 2.1

Companion-spec records this as M52 (no companion contract bump).

Validation:
  pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
  → 0 error(s), 0 warning(s). Contract is valid.

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 option I.

Co-Authored-By: Claude Opus 4.7 <[email protected]>

* feat(aprender-serve): OwnedQuantizedModelWgpu stub — M-GPU-MOE-2.0 (#1487)

Implements M-GPU-MOE-2.0 per qwen3-moe-forward-gpu-v1 v1.2.0 option I
(see PR #1485 amendment). Analog of M-GPU-MOE-1.0-redo (PR #1464) for
the wgpu backend.

WHAT THIS PR ADDS:

  * crates/aprender-serve/src/gguf/wgpu_backend/mod.rs — new module
    with OwnedQuantizedModelWgpu struct + new() + stub method
    forward_qwen3_moe_wgpu(). Mirrors cuda/mod.rs structure.

  * crates/aprender-serve/src/gguf/wgpu_model.rs — re-export shim
    `pub use super::wgpu_backend::OwnedQuantizedModelWgpu`. Mirrors
    cuda_model.rs.

  * crates/aprender-serve/src/gguf/mod.rs — adds the two new modules
    behind `#[cfg(feature = \"gpu\")]` (the existing wgpu feature
    flag — `gpu = [\"trueno/gpu\"]` per Cargo.toml line 208).

WHY MODULE NAMED `wgpu_backend`:

The Rust ecosystem already has a `wgpu` crate. A module named `wgpu`
inside the same crate would shadow it inside the file's body. The
public re-export still presents `OwnedQuantizedModelWgpu` (no ugly
suffix) thanks to wgpu_model.rs.

WHY THIS IS A STUB:

Same staging discipline as M-GPU-MOE-1.0-redo — contract first,
scaffold second, implementation third. The body of
forward_qwen3_moe_wgpu validates preconditions (mirroring the cuda
sibling's boundary) then returns RealizarError::UnsupportedOperation
whose reason points at the v1.2.0 amendment block for the M-GPU-MOE-2
staging plan. Until M-GPU-MOE-2.2 lands, callers on non-CUDA
hardware fall back to OwnedQuantizedModel::forward_qwen3_moe (CPU
LAZY-FUSED-MATVEC, ~30 tok/s).

VERIFICATION:

  cargo check -p aprender-serve                  → 0 errors (default)
  cargo check -p aprender-serve --features cuda  → 0 errors (cuda)
  cargo check -p aprender-serve --features gpu   → 0 errors (wgpu)
  cargo test -p aprender-serve --lib --features gpu \
      owned_quantized_model_wgpu_tests           → 1 passed

Lib unit test asserts the function signature exists and matches the
cuda sibling step-for-step (compile-time checks via fn pointer
coercion — no runtime model construction needed at the stub stage).

DEPENDS ON: PR #1485 (qwen3-moe-forward-gpu-v1 v1.2.0 option I
amendment). Branch is stacked on the v1.2.0 contract branch; once
#1485 lands on main, this PR rebases onto main directly.

NEXT STAGES per v1.2.0:

  M-GPU-MOE-2.1  per-expert wgpu dispatch helpers
                 (expert_swiglu_wgpu, moe_ffn_forward_layer_wgpu)
  M-GPU-MOE-2.2  full forward integration mirror of cuda sibling
  M-GPU-MOE-2.3  cosine-vs-CPU parity test on wgpu hardware

Refs: M52, R10, qwen3-moe-forward-gpu-v1 v1.2.0 :: M-GPU-MOE-2.0.

Co-authored-by: Claude Opus 4.7 <[email protected]>

* test(aprender-serve): qwen3_moe_wgpu_parity — M-GPU-MOE-2.3 cosine ≥0.99 falsifier (wgpu) (#1488)

wgpu sibling of `qwen3_moe_gpu_parity.rs` (M-GPU-MOE-1.2, PR #1484).
Asserts cosine ≥ 0.99 between APR's CPU `forward_qwen3_moe` reference
and the wgpu `OwnedQuantizedModelWgpu::forward_qwen3_moe_wgpu`
integration on the same prompt.

Same falsifier ID as the cuda sibling
(FALSIFY-QW3-MOE-GPU-PARITY-001) — wgpu is a SECOND backend
implementing the same contract gate, not a different gate. Same
threshold (≥ 0.99), same canonical 17.3 GB Qwen3-Coder GGUF, same
3-token canonical prompt as the cuda test.

CI WIRING:

  - #[cfg(feature = "gpu")] gates the file (matches the gate on
    OwnedQuantizedModelWgpu in gguf/mod.rs)
  - #[ignore] on the heavy test (CI default skips; explicit
    `--include-ignored` runs it on a wgpu-capable adapter — Apple
    Silicon Metal, AMD Vulkan, Intel ARC Vulkan)
  - 2 helper unit tests (cosine_similarity sanity coverage) DO run
    by default

WHEN THE TEST PASSES:

  - M-GPU-MOE-2.0 stub returns UnsupportedOperation, so this test
    currently panics at the wgpu forward call (correct behaviour
    for a falsifier against an incomplete impl).
  - M-GPU-MOE-2.1 (per-expert wgpu helpers via trueno-gpu
    QuantizeKernel + GemmKernel compute pipelines) + M-GPU-MOE-2.2
    (full forward integration analog of forward_qwen3_moe_cuda)
    must both land before this test passes on hardware.
  - On hardware with wgpu support, run with --include-ignored to
    exercise. PASS discharges FALSIFY-QW3-MOE-GPU-PARITY-001 for
    the wgpu backend (cuda backend discharged by sibling test).

DEPENDS ON: PR #1485 (v1.2.0 amendment + M-GPU-MOE-2.0 stub).
Branch is stacked on the v1.2.0 contract branch; once #1485 lands
on main, this PR's base flips to main automatically.

Refs: M52, M53, R10, qwen3-moe-forward-gpu-v1 v1.2.0 ::
M-GPU-MOE-2.3 + FALSIFY-QW3-MOE-GPU-PARITY-001 (wgpu).

Co-authored-by: Claude Opus 4.7 <[email protected]>

---------

Co-authored-by: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-Authored-By: Claude Opus 4.7 <[email protected]>
noahgift added a commit that referenced this pull request May 4, 2026
…d-bug fix plan (#1490)

Live-dogfood finding 2026-05-04 on lambda-vector RTX 4090: the
M-GPU-MOE-1.2 heavy `qwen3_moe_gpu_parity` test (FALSIFY-QW3-MOE-
GPU-PARITY-001) cannot run on the cached 17.3 GB Qwen3-Coder GGUF
because `OwnedQuantizedModelCuda::new` itself fails:

  UnsupportedOperation { operation: "preload_weights_gpu",
    reason: "PAR-043: Failed to build indexed weights:
             Invalid launch config: Quantized weight
             'blk.0.ffn_gate.weight' not cached" }

ROOT CAUSE (5-whys in evidence file):

  `executor.build_indexed_weights` at
  `crates/aprender-serve/src/cuda/executor/weights.rs:325-373`
  unconditionally requires `blk.{i}.ffn_gate.weight`,
  `.ffn_up.weight`, `.ffn_down.weight` to be cached for every
  layer. For MoE these names DO NOT EXIST — MoE has 128 expert
  gates per layer (`blk.{i}.ffn_gate_exps.weight`) loaded into
  the `moe_layers` parameter at forward-time.

  M-GPU-MOE-1.1.2 (PR #1477)'s forward body sidesteps the indexed
  weights for FFN, but the wrapper construction goes through
  `preload_weights_gpu` BEFORE forward is ever called. Wrapper
  construction fails first.

WHY DEFAULT CI DIDN'T CATCH IT:

  Lib-only stub test (PR #1464) only checks signature at compile
  time. Heavy `qwen3_moe_gpu_parity.rs` (PR #1484) is `#[ignore]`d
  + needs RTX 4090 + 17.3 GB GGUF. First `--include-ignored`
  dogfood on lambda-vector found this 2026-05-04.

THIS PR ADDS:

  (1) Evidence file
      `evidence/m-gpu-moe-1-2-blocked-by-preload-bug-2026-05-04/findings.md`
      documenting the live failure + 5-whys + fix architecture.

  (2) Contract `qwen3-moe-forward-gpu-v1` v1.2.0 → v1.3.0:
      * New v1.3.0 amendment_history block (~110 lines) describing
        the bug, root cause, and three-step fix architecture
      * New implementation_stage `M-GPU-MOE-1.3` between 1.2 and 2
        with status PENDING
      * New falsification_test FALSIFY-QW3-MOE-GPU-PRELOAD-001
        (hardware test + lib-only sibling)
      * Top-level version "1.2.0" → "1.3.0"
      * Status comment expanded to mention M-GPU-MOE-1.3 as a
        precondition for ACTIVE_ALGORITHM_LEVEL flip

VALIDATION: pv validate contracts/qwen3-moe-forward-gpu-v1.yaml
            → 0 errors, 0 warnings. Contract is valid.

WHAT THIS PR DOES NOT DO:

  Does NOT implement the fix. Per CLAUDE.md "NEVER write code
  before writing a provable contract", this PR pins the contract
  first. The fix lands in a separate PR (M-GPU-MOE-1.3 stage):
  ~30 LOC in weights.rs + 1-2 callers + ArchConstraints field +
  drift-prevention test.

  Does NOT block PR #1485's already-shipped 3-commit cascade
  (M52/M54). The cascade is correct; M-GPU-MOE-1.3 is a sibling
  bug-fix.

Refs: M52, M53, M54, R10, qwen3-moe-forward-gpu-v1 v1.3.0,
      FALSIFY-QW3-MOE-GPU-PRELOAD-001 (new).

Co-authored-by: Claude Opus 4.7 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant