Codestin Search App

noahgift · 2026-05-19T16:43:02Z

Summary

Fifth falsifier in the M-GPU-MOE-3 cascade. Tests CPU `f32::exp` vs CUDA `ex2.approx.f32 * LOG2_E` precision parity on the SwiGLU activation across 5 input distributions.

Per #1816's pivot ranking, SwiGLU was the highest-EV remaining candidate because:

Both Q4_K and Q6_K MoE FFN paths reach SwiGLU at the same point (qtype-agnostic).
CPU `expert_swiglu_quantized` uses `f32::exp` (libm, ~1 ulp).
CUDA `FusedSwigluKernel` uses `ex2.approx.f32` with `LOG2_E` multiplier (PTX intrinsic, ~2 ulps).
Algebraically equivalent sigmoid; different f32 precision behavior.

Empirical result (lambda-vector RTX 4090)

```
distribution lo hi max_abs max_rel cpu_l2

uniform -1.00 1.00 5.960e-8 2.369e-7 11.300
moderate -5.00 5.00 1.907e-6 4.303e-7 362.262
extreme_neg -20.00 -10.00 4.657e-9 9.970e-7 0.107
extreme_pos 10.00 20.00 0.000e0 0.000e0 14930.386
mixed -20.00 20.00 7.629e-6 9.803e-7 5998.385
```

Hypothesis FALSIFIED. rel_diff stays at ulp-scale (≤ 1e-6) across all distributions, including extreme [-20, 20] range. The intrinsic precision differential is NOT visible at the activation level.

Notable: `extreme_pos` gives rel_diff = 0.0 exactly — both intrinsics converge to `silu(g) ≈ g` since `exp(-g) ≈ 0` in both, and the multiply is bit-identical.

Cumulative cascade status — 6 hypotheses ruled out

✅ Per-matvec Q6_K reduction-order on synthetic (test(m-gpu-moe-3): FALSIFY-Q6K-FP-ACC-001 — per-matvec divergence is ulp-scale, NOT the 0.94-cos source (#1583 PR-3f) #1801)
✅ Activation distribution amplification (test(m-gpu-moe-3): FALSIFY-Q6K-AMP-002 — activation amplification ALSO falsified, only chain-length remains (#1583 PR-3g) #1805)
✅ Accumulator-chain length compounding (test(m-gpu-moe-3): FALSIFY-Q6K-CHAIN-003 — chain length ALSO falsified; all 3 hypotheses eliminated (#1583 PR-3h) #1811)
✅ Per-matvec Q6_K on real Qwen3 weights (test(m-gpu-moe-3): FALSIFY-Q6K-REAL-WEIGHT-004 — real Q6_K matches synthetic + qtype-mix structural finding (#1583 PR-3i) #1816)
✅ Q6_K-specific root cause — structural qtype-mix (test(m-gpu-moe-3): FALSIFY-Q6K-REAL-WEIGHT-004 — real Q6_K matches synthetic + qtype-mix structural finding (#1583 PR-3i) #1816)
✅ SwiGLU activation parity (this PR)

Remaining candidates after this PR

🎯 Q4_K real-weight matvec parity (next HIGHEST EV) — has not been directly tested CPU vs CUDA on real Qwen3 Q4_K weights. Q4_K kernel path is structurally different from Q6_K (different super-block layout, different reduction shape).
Compositional FFN-block — full gate × up × SwiGLU × down × weighted-sum on real weights + real activations. The 5-op chain on real distributions may exhibit divergence that no single primitive shows.
Top-K weighted-sum accumulation order — bit-identical by code inspection but worth empirical verification.

What this PR ships

`crates/aprender-serve/tests/falsify_swiglu_cpu_cuda_005.rs` (~310 LOC)
- 1 `#[ignore]` integration test (5-distribution sweep, ~0.12s on RTX 4090)
- 5 unit tests on helpers (synthetic_range, cpu_swiglu identity behavior at extreme/zero gate, max_rel_diff)

Test plan

`cargo check -p aprender-serve --features cuda --test falsify_swiglu_cpu_cuda_005` clean
`cargo test --release --features cuda -p aprender-serve --test falsify_swiglu_cpu_cuda_005 -- --ignored --nocapture` PASS on lambda-vector with empirical sweep emitted

Cross-refs

Issue: M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583 (M-GPU-MOE-3)
Predecessors: test(m-gpu-moe-3): FALSIFY-Q6K-FP-ACC-001 — per-matvec divergence is ulp-scale, NOT the 0.94-cos source (#1583 PR-3f) #1801, test(m-gpu-moe-3): FALSIFY-Q6K-AMP-002 — activation amplification ALSO falsified, only chain-length remains (#1583 PR-3g) #1805, test(m-gpu-moe-3): FALSIFY-Q6K-CHAIN-003 — chain length ALSO falsified; all 3 hypotheses eliminated (#1583 PR-3h) #1811, test(m-gpu-moe-3): FALSIFY-Q6K-REAL-WEIGHT-004 — real Q6_K matches synthetic + qtype-mix structural finding (#1583 PR-3i) #1816
CPU side: `crates/aprender-serve/src/gguf/qwen3_moe_load.rs::expert_swiglu_quantized`
CUDA side: `crates/aprender-serve/src/cuda/executor/kernel.rs::CudaExecutor::fused_swiglu_host`
PTX kernel: `crates/aprender-gpu/src/kernels/elementwise/swiglu.rs::FusedSwigluKernel`

🤖 Generated with Claude Code

…cision is ulp-scale, NOT the amplifier (#1583 PR-3j) Fifth falsifier in M-GPU-MOE-3 cascade. Tests CPU `f32::exp` vs CUDA `ex2.approx.f32 * LOG2_E` parity on the SwiGLU activation across 5 input distributions (uniform/moderate/extreme_neg/extreme_pos/mixed). ## Empirical result (lambda-vector RTX 4090) ``` distribution lo hi max_abs max_rel cpu_l2 ------------------------------------------------------------------ uniform -1.00 1.00 5.960e-8 2.369e-7 11.300 moderate -5.00 5.00 1.907e-6 4.303e-7 362.262 extreme_neg -20.00 -10.00 4.657e-9 9.970e-7 0.107 extreme_pos 10.00 20.00 0.000e0 0.000e0 14930.386 mixed -20.00 20.00 7.629e-6 9.803e-7 5998.385 ``` **Hypothesis FALSIFIED.** rel_diff stays at ulp-scale (≤ 1e-6) across all distributions, including the most extreme [-20, 20] range. The `ex2.approx.f32` vs `f32::exp` precision differential is NOT visible at the SwiGLU activation level. ## Cumulative cascade status — 6 hypotheses ruled out 1. Per-matvec Q6_K reduction-order on synthetic (#1801) 2. Activation distribution amplification (#1805) 3. Accumulator-chain length compounding (#1811) 4. Per-matvec Q6_K on real Qwen3 weights (#1816) 5. Q6_K-specific root cause — structural qtype-mix (#1816) 6. SwiGLU activation parity (this PR) ## Remaining candidates 1. Q4_K real-weight matvec parity (highest EV next) 2. Compositional FFN-block chain on real Qwen3 weights 3. Top-K weighted-sum accumulation order ## What this PR ships - `tests/falsify_swiglu_cpu_cuda_005.rs` — 1 #[ignore] integration test (5-distribution sweep, ~0.12s on RTX 4090) + 5 unit tests on helpers (synthetic_range, cpu_swiglu identity behavior, max_rel_diff). Per `feedback_falsifier_cascade_decomposes_magnitude.md` — 1 PR ≈ 1 falsifier. ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Predecessors: #1801, #1805, #1811, #1816 - CPU side: `expert_swiglu_quantized` in qwen3_moe_load.rs - CUDA side: `CudaExecutor::fused_swiglu_host` (uses FusedSwigluKernel PTX with ex2.approx.f32) Co-Authored-By: Claude Opus 4.7 <[email protected]>

…CUDA Q4_K matvec 237,775× divergence vs CPU (#1583 PR-3k) (#1821) Sixth falsifier in M-GPU-MOE-3 cascade. The structural finding in #1816 (3 of 7 problem layers use Q4_K for ffn_down_exps) suggested Q4_K kernel as the highest-EV remaining candidate. This PR empirically confirms it. ## EMPIRICAL RESULT — DISCHARGE-CLASS FINDING lambda-vector RTX 4090, real Qwen3-Coder-30B-A3B Q4_K bytes: source tensor: blk.0.attn_k.weight (16 rows × 512 cols, 4608 bytes) cos = 0.999994 max_rel_diff = 5.469e-2 ← 5.47 PERCENT per-element error cpu_l2 = 0.754 gpu_l2 = 0.755 **237,775× amplification** over #1816's Q6_K real-weight baseline (2.281e-7). Three orders of magnitude beyond anything in #1801/#1805/#1811/#1816/#1818. ## Why this explains the 0.94-cos drop in #1583 - 3 of 7 problem layers (L7/L9/L12) use Q4_K ffn_down_exps directly - All 7 problem layers use Q4_K for ffn_gate_exps + ffn_up_exps - Per-matvec ~5% error compounds across 128 experts × MoE FFN block - Naturally produces the 0.94-cos cumulative drop on real-model forward ## CASCADE DISCHARGE The M-GPU-MOE-3 cascade has empirically pinned root cause to: CudaExecutor::q4k_matvec vs CPU fused_q4k_parallel_matvec on real Qwen3 Q4_K bytes. Q6_K was a red herring — the original #1583 framing led the cascade through 5 dead-end hypotheses before #1816's structural finding redirected to Q4_K. ## Fix scope (multi-week, references in #1583 as PR-3h+) 1. Bisect WHICH part of the CUDA Q4_K path produces the 5% delta: dequant (Q4_K → f32), reduction (warp-shuffle), or both 2. Align CUDA Q4_K kernel reduction order to match CPU fused_q4k_parallel_matvec rayon midi-tile reduction 3. Re-run qwen3_moe_per_layer_gpu_parity.rs — verify all 48 layers move from ~85% to 100% cos≥0.99 4. Flip qwen3-moe-forward-gpu-v1 v1.7.0 → v1.8.0 ACTIVE_RUNTIME ## What this PR ships - tests/falsify_q4k_real_weight_006.rs — direct sibling of #1816's Q6_K test but for Q4_K. 1 #[ignore] integration test (2.2s on RTX 4090) + 5 unit tests. The 6-PR cascade structure: #1801 Q6_K synthetic ulp-scale #1805 activation distribution flat #1811 chain length flat #1816 Q6_K real ulp-scale + qtype-mix structural pivot #1818 SwiGLU intrinsic ulp-scale #1816+#1818 + this PR = ROOT CAUSE PINNED to Q4_K kernel Per feedback_falsifier_cascade_decomposes_magnitude.md — 1 PR ≈ 1 falsifier. Per feedback_predict_then_verify_closes_cascade.md — this PR's measurement closes the M-GPU-MOE-3 root-cause search; fix-PR is separate scope. ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Predecessors: #1801, #1805, #1811, #1816, #1818 - Real-model sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

…es Q8K activation quant, CUDA uses f32 — different algorithms (#1583 PR-3l DISCHARGE) (#1822) Seventh falsifier in M-GPU-MOE-3 cascade. Bisects #1821's 5% Q4_K divergence by comparing THREE paths on identical Q4_K bytes: A = CPU fused_q4k_parallel_matvec (production-MoE path) B = CPU dequantize_q4_k_to_f32 + naive_f32_matvec (isolates dequant) C = CUDA q4k_matvec (suspected broken in #1821) ## EMPIRICAL RESULT — INVERTS #1821 pair rel_diff 1-cos A vs B (CPU) 2.883e-2 7.093e-6 ← CPU fused ≠ CPU dequant A vs C (CPU-GPU) 2.883e-2 7.033e-6 ← CPU fused ≠ CUDA B vs C (deq-GPU) 5.028e-7 -1.192e-7 ← CPU dequant ≈ CUDA ✅ **CUDA q4k_matvec is CORRECT.** Path B (manual CPU dequant + naive f32 dot) matches path C (CUDA) to ulp-scale 5e-7. **CPU fused_q4k_parallel_matvec is the divergent path.** It disagrees with BOTH the CPU naive-dequant reference AND CUDA by the SAME 2.88% delta. ## True root cause — CPU pre-quantizes activations parallel_k.rs:181-182 docstring confirms: 'Pre-quantizes f32 activations to Q8_K once per matmul, enabling integer-only inner loops (maddubs) for ~4-8x speedup' So: CPU fused_q4k_parallel_matvec = Q4_K(W) × Q8_K(quantize(f32_act)) CUDA q4k_matvec = Q4_K(W) × f32_act (no quant) **They compute DIFFERENT MATHEMATICAL OPERATIONS.** The 2.88% per-matvec delta is the lossy Q8_K activation quantization. ## What this means for M-GPU-MOE-3 (#1583) The 0.94-cos drop on real Qwen3 L7/L9/L12/L20/L23/L29/L46 is **NOT a kernel correctness bug**. It is the natural compositional consequence of CPU using Q8K activation quant while CUDA uses f32 activations. 2-3% per-matvec compounds across 128 experts × 48 layers to produce the observed ~6% cumulative drop. #1583's original framing (kernel-level reduction-order alignment in Q6_K) was triple-wrong: not reduction-order, not Q6_K, not kernel-correctness. The actual issue is an **activation-qtype algorithm mismatch** between CPU and CUDA Q4_K paths. ## Fix paths (multi-week, M-GPU-MOE-3 fix scope) OPTION 1: CPU uses f32 activations (match CUDA) - Add fused_q4k_f32_parallel_matvec (no Q8K step) - Slows CPU (loses maddubs 4-8× speedup) OPTION 2: CUDA uses Q8_K activations (match CPU) — RECOMMENDED - Add Q8_K activation quant before q4k_matvec - Could be FASTER on GPU via DP4A integer ops on Ampere+ - Modest CUDA kernel scope OPTION 3: Accept divergence — relax contract cos threshold - Update qwen3-moe-forward-gpu-v1 to cos≥0.93 - Cheapest ## Full cascade discharge Seven falsifiers, six wrong hypotheses, one true root cause: #1801 Q6_K synthetic reduction-order → ulp-scale #1805 activation distribution → flat #1811 chain length compounding → flat #1816 Q6_K real weights + qtype-mix → ulp-scale + L7/9/12 are Q4_K #1818 SwiGLU intrinsic precision → ulp-scale #1821 Q4_K real weights (CPU as truth) → 5% — misattributed CUDA THIS Q4_K bisection → CPU has Q8K quant step ## What this PR ships - tests/falsify_q4k_bisect_dequant_007.rs — three-path bisection (~330 LOC), 1 #[ignore] integration test (2.2s on RTX 4090) + 5 unit tests. Per feedback_test_methodology_can_fake_bugs.md — this PR is the textbook case of why bisection beats single-comparison parity tests. #1821 used the CPU as ground truth without verifying the CPU was implementing the same operation as CUDA. Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

…de DISCHARGE amendment (#1583 spec advancement) (#1825) Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause of the 0.94-cos drop on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. ## TRUE root cause CPU fused_q4k_parallel_matvec = Q4_K(weights) × Q8_K(activations) CUDA q4k_matvec = Q4_K(weights) × f32_activations Different mathematical operations. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop. ## What this amendment REFUTES - v1.7.2's 'per-expert SwiGLU f32 intermediates' attribution → refuted by #1818 (SwiGLU intrinsic precision is ulp-scale) - v1.0.0..v1.7.1 'Q6_K fp-accumulator-order' framing → refuted by #1801 (synthetic ulp-scale) + #1816 (real Q6_K ulp-scale) - Q6_K-specific root-cause hypothesis → refuted by #1816's structural finding (L7/L9/L12 are Q4_K, not Q6_K) ## Status change v1.7.2: ACTIVE_ALGORITHM_LEVEL (with wrong SwiGLU attribution) v1.8.0: ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE - 47/48 layers cos≥0.99 stands - root cause documented (activation-qtype algorithm mismatch) - L47 cliff is the natural compositional consequence - 0.94-cos on 7 problem layers is documented, not a bug ## Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4) OPTION 1: CPU uses f32 activations (slow CPU) OPTION 2: CUDA uses Q8_K activations (RECOMMENDED — DP4A faster) PackedDp4aQ4KQ8Kernel already exists; just need CUDA f32→Q8_K activation quant kernel to feed it. OPTION 3: Document divergence; relax cos threshold ## Validation - python3 yaml.safe_load: PASS - pv validate contracts/qwen3-moe-forward-gpu-v1.yaml: 0 errors, 0 warnings ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Cascade: #1801, #1805, #1811, #1816, #1818, #1821, #1822 - Sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) — real-model parity gate Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

noahgift enabled auto-merge (squash) May 19, 2026 16:43

noahgift added 3 commits May 19, 2026 19:38

Merge branch 'main' into feat/m-gpu-moe-3-falsify-swiglu-cpu-cuda-005

ee43804

Merge branch 'main' into feat/m-gpu-moe-3-falsify-swiglu-cpu-cuda-005

22d20a2

Merge branch 'main' into feat/m-gpu-moe-3-falsify-swiglu-cpu-cuda-005

c7bf0ec

noahgift merged commit ecd3ebf into main May 19, 2026
10 checks passed

noahgift deleted the feat/m-gpu-moe-3-falsify-swiglu-cpu-cuda-005 branch May 19, 2026 19:07

noahgift mentioned this pull request May 19, 2026

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583) #1825

Merged

2 tasks

noahgift mentioned this pull request May 20, 2026

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(m-gpu-moe-3): FALSIFY-SWIGLU-CPU-CUDA-005 — SwiGLU intrinsic precision is ulp-scale, NOT the amplifier (#1583 PR-3j)#1818

test(m-gpu-moe-3): FALSIFY-SWIGLU-CPU-CUDA-005 — SwiGLU intrinsic precision is ulp-scale, NOT the amplifier (#1583 PR-3j)#1818
noahgift merged 4 commits into
mainfrom
feat/m-gpu-moe-3-falsify-swiglu-cpu-cuda-005

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Empirical result (lambda-vector RTX 4090)

``` distribution lo hi max_abs max_rel cpu_l2

Cumulative cascade status — 6 hypotheses ruled out

Remaining candidates after this PR

What this PR ships

Test plan

Cross-refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

```
distribution lo hi max_abs max_rel cpu_l2