Codestin Search App

noahgift · 2026-05-19T20:51:03Z

Summary

Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause.

True root cause

Path	What it actually computes
CPU `fused_q4k_parallel_matvec`	Q4_K(weights) × Q8_K(activations)
CUDA `q4k_matvec`	Q4_K(weights) × f32_activations

They compute DIFFERENT MATHEMATICAL OPERATIONS. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop on the 7 problem layers (L7/L9/L12/L20/L23/L29/L46).

What v1.8.0 REFUTES

v1.7.2's "per-expert SwiGLU f32 intermediates" attribution — refuted by #1818 (SwiGLU intrinsic precision is ulp-scale across all input distributions including extreme [-20, 20])
v1.0.0..v1.7.1 "Q6_K fp-accumulator-order" framing — refuted by #1801 + #1816 (Q6_K per-matvec is ulp-scale on both synthetic AND real Qwen3 weights)
Q6_K-specific root-cause hypothesis — refuted by #1816's structural finding (L7/L9/L12 use Q4_K, not Q6_K, for ffn_down_exps)

Status change

Version	Status	Attribution
v1.7.2	ACTIVE_ALGORITHM_LEVEL	"SwiGLU f32 intermediates" (WRONG)
v1.8.0	ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE	activation-qtype mismatch (verified)

The 47/48 layers cos≥0.99 measurement stands; the L47 cliff and the 0.94-cos drop on 7 problem layers are now documented as the natural compositional consequence of the CPU/CUDA algorithm mismatch.

Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4)

Option	What it does	Tradeoff
1	CPU uses f32 activations (match CUDA)	Slows CPU (loses maddubs 4-8×)
2 (recommended)	CUDA uses Q8_K activations (match CPU)	DP4A could be FASTER. `PackedDp4aQ4KQ8Kernel` already exists in trueno-gpu; just need a CUDA f32→Q8_K activation quant kernel to feed it.
3	Document divergence; relax cos threshold	Cheapest

Test plan

`python3 -c "import yaml; yaml.safe_load(...)"` PASS
`pv validate contracts/qwen3-moe-forward-gpu-v1.yaml` → 0 errors, 0 warnings, Contract is valid

Cross-refs

🤖 Generated with Claude Code

…de DISCHARGE amendment (#1583 spec advancement) Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause of the 0.94-cos drop on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. ## TRUE root cause CPU fused_q4k_parallel_matvec = Q4_K(weights) × Q8_K(activations) CUDA q4k_matvec = Q4_K(weights) × f32_activations Different mathematical operations. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop. ## What this amendment REFUTES - v1.7.2's 'per-expert SwiGLU f32 intermediates' attribution → refuted by #1818 (SwiGLU intrinsic precision is ulp-scale) - v1.0.0..v1.7.1 'Q6_K fp-accumulator-order' framing → refuted by #1801 (synthetic ulp-scale) + #1816 (real Q6_K ulp-scale) - Q6_K-specific root-cause hypothesis → refuted by #1816's structural finding (L7/L9/L12 are Q4_K, not Q6_K) ## Status change v1.7.2: ACTIVE_ALGORITHM_LEVEL (with wrong SwiGLU attribution) v1.8.0: ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE - 47/48 layers cos≥0.99 stands - root cause documented (activation-qtype algorithm mismatch) - L47 cliff is the natural compositional consequence - 0.94-cos on 7 problem layers is documented, not a bug ## Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4) OPTION 1: CPU uses f32 activations (slow CPU) OPTION 2: CUDA uses Q8_K activations (RECOMMENDED — DP4A faster) PackedDp4aQ4KQ8Kernel already exists; just need CUDA f32→Q8_K activation quant kernel to feed it. OPTION 3: Document divergence; relax cos threshold ## Validation - python3 yaml.safe_load: PASS - pv validate contracts/qwen3-moe-forward-gpu-v1.yaml: 0 errors, 0 warnings ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Cascade: #1801, #1805, #1811, #1816, #1818, #1821, #1822 - Sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) — real-model parity gate Co-Authored-By: Claude Opus 4.7 <[email protected]>

…de-discharge

noahgift enabled auto-merge (squash) May 19, 2026 20:51

Merge branch 'main' into contracts/qwen3-moe-forward-gpu-v1.8.0-casca…

23c195d

…de-discharge

noahgift merged commit 2615af6 into main May 19, 2026
10 checks passed

noahgift deleted the contracts/qwen3-moe-forward-gpu-v1.8.0-cascade-discharge branch May 19, 2026 21:58

noahgift mentioned this pull request May 20, 2026

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583)#1825

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583)#1825
noahgift merged 2 commits into
mainfrom
contracts/qwen3-moe-forward-gpu-v1.8.0-cascade-discharge

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

True root cause

What v1.8.0 REFUTES

Status change

Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4)

Test plan

Cross-refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant