Thanks to visit codestin.com
Credit goes to github.com

Skip to content

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583)#1825

Merged
noahgift merged 2 commits into
mainfrom
contracts/qwen3-moe-forward-gpu-v1.8.0-cascade-discharge
May 19, 2026
Merged

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583)#1825
noahgift merged 2 commits into
mainfrom
contracts/qwen3-moe-forward-gpu-v1.8.0-cascade-discharge

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause.

True root cause

Path What it actually computes
CPU `fused_q4k_parallel_matvec` Q4_K(weights) × Q8_K(activations)
CUDA `q4k_matvec` Q4_K(weights) × f32_activations

They compute DIFFERENT MATHEMATICAL OPERATIONS. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop on the 7 problem layers (L7/L9/L12/L20/L23/L29/L46).

What v1.8.0 REFUTES

  • v1.7.2's "per-expert SwiGLU f32 intermediates" attribution — refuted by #1818 (SwiGLU intrinsic precision is ulp-scale across all input distributions including extreme [-20, 20])
  • v1.0.0..v1.7.1 "Q6_K fp-accumulator-order" framing — refuted by #1801 + #1816 (Q6_K per-matvec is ulp-scale on both synthetic AND real Qwen3 weights)
  • Q6_K-specific root-cause hypothesis — refuted by #1816's structural finding (L7/L9/L12 use Q4_K, not Q6_K, for ffn_down_exps)

Status change

Version Status Attribution
v1.7.2 ACTIVE_ALGORITHM_LEVEL "SwiGLU f32 intermediates" (WRONG)
v1.8.0 ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE activation-qtype mismatch (verified)

The 47/48 layers cos≥0.99 measurement stands; the L47 cliff and the 0.94-cos drop on 7 problem layers are now documented as the natural compositional consequence of the CPU/CUDA algorithm mismatch.

Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4)

Option What it does Tradeoff
1 CPU uses f32 activations (match CUDA) Slows CPU (loses maddubs 4-8×)
2 (recommended) CUDA uses Q8_K activations (match CPU) DP4A could be FASTER. `PackedDp4aQ4KQ8Kernel` already exists in trueno-gpu; just need a CUDA f32→Q8_K activation quant kernel to feed it.
3 Document divergence; relax cos threshold Cheapest

Test plan

  • `python3 -c "import yaml; yaml.safe_load(...)"` PASS
  • `pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0 errors, 0 warnings, Contract is valid

Cross-refs

🤖 Generated with Claude Code

…de DISCHARGE amendment (#1583 spec advancement)

Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier
PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned
the true root cause of the 0.94-cos drop on real Qwen3 layers
L7/L9/L12/L20/L23/L29/L46.

## TRUE root cause

  CPU fused_q4k_parallel_matvec  = Q4_K(weights) × Q8_K(activations)
  CUDA q4k_matvec                = Q4_K(weights) × f32_activations

Different mathematical operations. The 2.88% per-matvec delta is lossy
Q8_K activation quantization. Compounded across 128 experts × 48 layers,
it produces the observed ~6% cumulative cos drop.

## What this amendment REFUTES

- v1.7.2's 'per-expert SwiGLU f32 intermediates' attribution
  → refuted by #1818 (SwiGLU intrinsic precision is ulp-scale)
- v1.0.0..v1.7.1 'Q6_K fp-accumulator-order' framing
  → refuted by #1801 (synthetic ulp-scale) + #1816 (real Q6_K ulp-scale)
- Q6_K-specific root-cause hypothesis
  → refuted by #1816's structural finding (L7/L9/L12 are Q4_K, not Q6_K)

## Status change

  v1.7.2: ACTIVE_ALGORITHM_LEVEL (with wrong SwiGLU attribution)
  v1.8.0: ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE
          - 47/48 layers cos≥0.99 stands
          - root cause documented (activation-qtype algorithm mismatch)
          - L47 cliff is the natural compositional consequence
          - 0.94-cos on 7 problem layers is documented, not a bug

## Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4)

OPTION 1: CPU uses f32 activations (slow CPU)
OPTION 2: CUDA uses Q8_K activations (RECOMMENDED — DP4A faster)
          PackedDp4aQ4KQ8Kernel already exists; just need CUDA
          f32→Q8_K activation quant kernel to feed it.
OPTION 3: Document divergence; relax cos threshold

## Validation

- python3 yaml.safe_load: PASS
- pv validate contracts/qwen3-moe-forward-gpu-v1.yaml: 0 errors, 0 warnings

## Cross-refs

- Issue: #1583 (M-GPU-MOE-3)
- Cascade: #1801, #1805, #1811, #1816, #1818, #1821, #1822
- Sibling: tests/qwen3_moe_per_layer_gpu_parity.rs
  (FALSIFY-QW3-MOE-PER-LAYER-001) — real-model parity gate

Co-Authored-By: Claude Opus 4.7 <[email protected]>
@noahgift noahgift enabled auto-merge (squash) May 19, 2026 20:51
@noahgift noahgift merged commit 2615af6 into main May 19, 2026
10 checks passed
@noahgift noahgift deleted the contracts/qwen3-moe-forward-gpu-v1.8.0-cascade-discharge branch May 19, 2026 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant