Codestin Search App

noahgift · 2026-05-19T11:11:51Z

Summary

Third falsifier in the M-GPU-MOE-3 cascade. After #1801 ruled out simple per-matvec reduction-order (ulp-scale ~1e-7) and #1805 ruled out activation distribution amplification (flat across bursty inputs), this PR tests compositional accumulator-chain length — the only remaining hypothesis from #1801's pivot.

Empirical result (lambda-vector RTX 4090)

```
depth rel_diff 1-cos cpu_l2

1      6.224e-5       0.000e0       1.000
2      1.059e-4    -1.192e-7       1.000
4      1.011e-4    -1.192e-7       1.000
8      5.595e-5    -1.192e-7       1.000

16 9.533e-4 -1.192e-7 1.000
32 9.063e-5 -2.384e-7 1.000
48 9.862e-5 1.192e-7 1.000
```

rel_diff stays FLAT from N=1 to N=48 with no scaling. Cosine at f32 noise floor (~1.0). At N=48 (matching real model's 48 layers), rel_diff is 9.862e-5 — essentially identical to N=1's 6.224e-5.

Hypothesis #3 ALSO FALSIFIED

The synthetic 48-step Q6_K matvec chain (with L2-norm between steps, mirroring RMSNorm's scale control) does NOT reproduce the 0.94-cos drop observed on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. Chain length is NOT the amplifier.

Cascade state after this PR

All three candidate amplifiers from #1801's pivot are now empirically eliminated:

#	Hypothesis	Falsifier	Result
1	Per-matvec reduction-order	#1801	ulp-scale ~1e-7
2	Activation distribution non-uniformity	#1805	flat across distributions
3	Accumulator-chain length	this PR	flat from N=1 to N=48

Where the cascade pivots next

Real-model divergence must come from sources NOT in synthetic q6k matvec chains. The candidates are:

Q4_K matmul parity — gate/up projections in MoE FFN are Q4_K, not Q6_K. Different kernel = potentially different reduction order.
SwiGLU activation parity — CPU vs CUDA use different sigmoid intrinsics (`exp(-g)` vs `ex2.approx.f32`). Algebraically equivalent but different f32-precision on extreme inputs.
Real Qwen3 weight pattern — synthetic random weights might not hit corner cases that real-model Q6_K weights do.
Top-K weighted sum — host-side f32 accumulation; bit-identical by inspection.

Recommended next cascade PR: candidate #3 (real-weight single-matvec). If real Qwen3 Q6_K weights also produce ulp-scale divergence, the bug is in Q4_K/SwiGLU/weighted-sum. If real weights show 1e-3+ divergence, synthetic-random was hiding it and #1801's premise needs revisiting.

What this PR ships

`crates/aprender-serve/tests/falsify_q6k_chain_length_003.rs` (~330 LOC)
- 1 `#[ignore]` integration test (chain sweep N ∈ {1,2,4,8,16,32,48}, 0.22s on RTX 4090)
- 6 unit tests on the chain helpers (L2 normalize, weight builder, rel_diff, cosine, deterministic synthetic vec)

Test plan

`cargo check -p aprender-serve --features cuda --test falsify_q6k_chain_length_003` clean
`cargo test --release --features cuda -p aprender-serve --test falsify_q6k_chain_length_003 -- --ignored --nocapture` PASS on lambda-vector with empirical sweep emitted

Cross-refs

Issue: M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583 (M-GPU-MOE-3)
Predecessors: test(m-gpu-moe-3): FALSIFY-Q6K-FP-ACC-001 — per-matvec divergence is ulp-scale, NOT the 0.94-cos source (#1583 PR-3f) #1801 (FALSIFY-Q6K-FP-ACC-001), test(m-gpu-moe-3): FALSIFY-Q6K-AMP-002 — activation amplification ALSO falsified, only chain-length remains (#1583 PR-3g) #1805 (FALSIFY-Q6K-AMP-002)
Memory: `feedback_falsifier_cascade_decomposes_magnitude.md`, `feedback_falsifier_chain_assert_difference.md`

🤖 Generated with Claude Code

…d; all three #1801 hypotheses now eliminated (#1583 PR-3h) Third falsifier in the M-GPU-MOE-3 cascade. After #1801 ruled out simple per-matvec reduction-order (ulp-scale ~1e-7) and #1805 ruled out activation distribution amplification (flat across bursty inputs), this PR tests the only remaining hypothesis from #1801's pivot: **compositional accumulator-chain length**. ## Empirical result (lambda-vector RTX 4090) ``` depth rel_diff 1-cos cpu_l2 -------------------------------------------------- 1 6.224e-5 0.000e0 1.000 2 1.059e-4 -1.192e-7 1.000 4 1.011e-4 -1.192e-7 1.000 8 5.595e-5 -1.192e-7 1.000 16 9.533e-4 -1.192e-7 1.000 32 9.063e-5 -2.384e-7 1.000 48 9.862e-5 1.192e-7 1.000 ``` **rel_diff stays flat from N=1 to N=48** with NO scaling. Cosine stays at 1.0 within f32 noise floor. At N=48 (matching real model's 48 layers), rel_diff is 9.862e-5 — essentially identical to N=1's 6.224e-5. ## Hypothesis #3 ALSO falsified The synthetic chain of 48 Q6_K matvecs with L2-norm between steps does NOT reproduce the 0.94-cos drop observed on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. Chain length is NOT the amplifier. ## What the cascade has NOT yet tested The real-model divergence must come from sources NOT captured by synthetic q6k matvec chains. Remaining candidates worth a next-cascade-PR: 1. **Q4_K matmul parity** — gate/up projections in MoE FFN are Q4_K, not Q6_K. Different kernel = potentially different reduction order. 2. **SwiGLU activation parity** — CPU vs CUDA use different sigmoid intrinsics (`exp(-g)` vs `ex2.approx.f32`). Algebraically equivalent but different f32-precision behavior on extreme inputs. 3. **Real Qwen3 weight pattern** — synthetic random weights might not hit corner cases that real-model Q6_K weights do. Load actual L7 q6k bytes from cached GGUF and re-run #1801's single-matvec test. 4. **Top-K weighted sum** — host-side f32 accumulation; bit-identical by inspection but worth verifying empirically. **Highest EV next falsifier**: candidate #3 (real-weight single-matvec). If real Qwen3 Q6_K weights ALSO produce ulp-scale per-matvec divergence, the bug must be in Q4_K/SwiGLU/weighted-sum. If real weights show 1e-3+ divergence, synthetic-random was hiding it and the cascade pivots. ## What this PR ships - `tests/falsify_q6k_chain_length_003.rs` — 1 `#[ignore]` integration test (chain sweep N ∈ {1,2,4,8,16,32,48} in 0.22s on RTX 4090) + 6 unit tests on the chain helpers (L2 normalize, weight builder, rel_diff, cosine, deterministic synthetic vec). Per `feedback_falsifier_cascade_decomposes_magnitude.md` — 1 PR ≈ 1 falsifier. Per `feedback_falsifier_chain_assert_difference.md` — this PR's assertions are sanity floors; the load-bearing artifact is the per-N rel_diff table. ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Predecessors: #1801 (single-matvec baseline), #1805 (activation sweep) - Memory: `feedback_falsifier_cascade_decomposes_magnitude.md`, `feedback_falsifier_chain_assert_difference.md` Co-Authored-By: Claude Opus 4.7 <[email protected]>

…cision is ulp-scale, NOT the amplifier (#1583 PR-3j) (#1818) Fifth falsifier in M-GPU-MOE-3 cascade. Tests CPU `f32::exp` vs CUDA `ex2.approx.f32 * LOG2_E` parity on the SwiGLU activation across 5 input distributions (uniform/moderate/extreme_neg/extreme_pos/mixed). ## Empirical result (lambda-vector RTX 4090) ``` distribution lo hi max_abs max_rel cpu_l2 ------------------------------------------------------------------ uniform -1.00 1.00 5.960e-8 2.369e-7 11.300 moderate -5.00 5.00 1.907e-6 4.303e-7 362.262 extreme_neg -20.00 -10.00 4.657e-9 9.970e-7 0.107 extreme_pos 10.00 20.00 0.000e0 0.000e0 14930.386 mixed -20.00 20.00 7.629e-6 9.803e-7 5998.385 ``` **Hypothesis FALSIFIED.** rel_diff stays at ulp-scale (≤ 1e-6) across all distributions, including the most extreme [-20, 20] range. The `ex2.approx.f32` vs `f32::exp` precision differential is NOT visible at the SwiGLU activation level. ## Cumulative cascade status — 6 hypotheses ruled out 1. Per-matvec Q6_K reduction-order on synthetic (#1801) 2. Activation distribution amplification (#1805) 3. Accumulator-chain length compounding (#1811) 4. Per-matvec Q6_K on real Qwen3 weights (#1816) 5. Q6_K-specific root cause — structural qtype-mix (#1816) 6. SwiGLU activation parity (this PR) ## Remaining candidates 1. Q4_K real-weight matvec parity (highest EV next) 2. Compositional FFN-block chain on real Qwen3 weights 3. Top-K weighted-sum accumulation order ## What this PR ships - `tests/falsify_swiglu_cpu_cuda_005.rs` — 1 #[ignore] integration test (5-distribution sweep, ~0.12s on RTX 4090) + 5 unit tests on helpers (synthetic_range, cpu_swiglu identity behavior, max_rel_diff). Per `feedback_falsifier_cascade_decomposes_magnitude.md` — 1 PR ≈ 1 falsifier. ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Predecessors: #1801, #1805, #1811, #1816 - CPU side: `expert_swiglu_quantized` in qwen3_moe_load.rs - CUDA side: `CudaExecutor::fused_swiglu_host` (uses FusedSwigluKernel PTX with ex2.approx.f32) Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

…CUDA Q4_K matvec 237,775× divergence vs CPU (#1583 PR-3k) (#1821) Sixth falsifier in M-GPU-MOE-3 cascade. The structural finding in #1816 (3 of 7 problem layers use Q4_K for ffn_down_exps) suggested Q4_K kernel as the highest-EV remaining candidate. This PR empirically confirms it. ## EMPIRICAL RESULT — DISCHARGE-CLASS FINDING lambda-vector RTX 4090, real Qwen3-Coder-30B-A3B Q4_K bytes: source tensor: blk.0.attn_k.weight (16 rows × 512 cols, 4608 bytes) cos = 0.999994 max_rel_diff = 5.469e-2 ← 5.47 PERCENT per-element error cpu_l2 = 0.754 gpu_l2 = 0.755 **237,775× amplification** over #1816's Q6_K real-weight baseline (2.281e-7). Three orders of magnitude beyond anything in #1801/#1805/#1811/#1816/#1818. ## Why this explains the 0.94-cos drop in #1583 - 3 of 7 problem layers (L7/L9/L12) use Q4_K ffn_down_exps directly - All 7 problem layers use Q4_K for ffn_gate_exps + ffn_up_exps - Per-matvec ~5% error compounds across 128 experts × MoE FFN block - Naturally produces the 0.94-cos cumulative drop on real-model forward ## CASCADE DISCHARGE The M-GPU-MOE-3 cascade has empirically pinned root cause to: CudaExecutor::q4k_matvec vs CPU fused_q4k_parallel_matvec on real Qwen3 Q4_K bytes. Q6_K was a red herring — the original #1583 framing led the cascade through 5 dead-end hypotheses before #1816's structural finding redirected to Q4_K. ## Fix scope (multi-week, references in #1583 as PR-3h+) 1. Bisect WHICH part of the CUDA Q4_K path produces the 5% delta: dequant (Q4_K → f32), reduction (warp-shuffle), or both 2. Align CUDA Q4_K kernel reduction order to match CPU fused_q4k_parallel_matvec rayon midi-tile reduction 3. Re-run qwen3_moe_per_layer_gpu_parity.rs — verify all 48 layers move from ~85% to 100% cos≥0.99 4. Flip qwen3-moe-forward-gpu-v1 v1.7.0 → v1.8.0 ACTIVE_RUNTIME ## What this PR ships - tests/falsify_q4k_real_weight_006.rs — direct sibling of #1816's Q6_K test but for Q4_K. 1 #[ignore] integration test (2.2s on RTX 4090) + 5 unit tests. The 6-PR cascade structure: #1801 Q6_K synthetic ulp-scale #1805 activation distribution flat #1811 chain length flat #1816 Q6_K real ulp-scale + qtype-mix structural pivot #1818 SwiGLU intrinsic ulp-scale #1816+#1818 + this PR = ROOT CAUSE PINNED to Q4_K kernel Per feedback_falsifier_cascade_decomposes_magnitude.md — 1 PR ≈ 1 falsifier. Per feedback_predict_then_verify_closes_cascade.md — this PR's measurement closes the M-GPU-MOE-3 root-cause search; fix-PR is separate scope. ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Predecessors: #1801, #1805, #1811, #1816, #1818 - Real-model sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

…es Q8K activation quant, CUDA uses f32 — different algorithms (#1583 PR-3l DISCHARGE) (#1822) Seventh falsifier in M-GPU-MOE-3 cascade. Bisects #1821's 5% Q4_K divergence by comparing THREE paths on identical Q4_K bytes: A = CPU fused_q4k_parallel_matvec (production-MoE path) B = CPU dequantize_q4_k_to_f32 + naive_f32_matvec (isolates dequant) C = CUDA q4k_matvec (suspected broken in #1821) ## EMPIRICAL RESULT — INVERTS #1821 pair rel_diff 1-cos A vs B (CPU) 2.883e-2 7.093e-6 ← CPU fused ≠ CPU dequant A vs C (CPU-GPU) 2.883e-2 7.033e-6 ← CPU fused ≠ CUDA B vs C (deq-GPU) 5.028e-7 -1.192e-7 ← CPU dequant ≈ CUDA ✅ **CUDA q4k_matvec is CORRECT.** Path B (manual CPU dequant + naive f32 dot) matches path C (CUDA) to ulp-scale 5e-7. **CPU fused_q4k_parallel_matvec is the divergent path.** It disagrees with BOTH the CPU naive-dequant reference AND CUDA by the SAME 2.88% delta. ## True root cause — CPU pre-quantizes activations parallel_k.rs:181-182 docstring confirms: 'Pre-quantizes f32 activations to Q8_K once per matmul, enabling integer-only inner loops (maddubs) for ~4-8x speedup' So: CPU fused_q4k_parallel_matvec = Q4_K(W) × Q8_K(quantize(f32_act)) CUDA q4k_matvec = Q4_K(W) × f32_act (no quant) **They compute DIFFERENT MATHEMATICAL OPERATIONS.** The 2.88% per-matvec delta is the lossy Q8_K activation quantization. ## What this means for M-GPU-MOE-3 (#1583) The 0.94-cos drop on real Qwen3 L7/L9/L12/L20/L23/L29/L46 is **NOT a kernel correctness bug**. It is the natural compositional consequence of CPU using Q8K activation quant while CUDA uses f32 activations. 2-3% per-matvec compounds across 128 experts × 48 layers to produce the observed ~6% cumulative drop. #1583's original framing (kernel-level reduction-order alignment in Q6_K) was triple-wrong: not reduction-order, not Q6_K, not kernel-correctness. The actual issue is an **activation-qtype algorithm mismatch** between CPU and CUDA Q4_K paths. ## Fix paths (multi-week, M-GPU-MOE-3 fix scope) OPTION 1: CPU uses f32 activations (match CUDA) - Add fused_q4k_f32_parallel_matvec (no Q8K step) - Slows CPU (loses maddubs 4-8× speedup) OPTION 2: CUDA uses Q8_K activations (match CPU) — RECOMMENDED - Add Q8_K activation quant before q4k_matvec - Could be FASTER on GPU via DP4A integer ops on Ampere+ - Modest CUDA kernel scope OPTION 3: Accept divergence — relax contract cos threshold - Update qwen3-moe-forward-gpu-v1 to cos≥0.93 - Cheapest ## Full cascade discharge Seven falsifiers, six wrong hypotheses, one true root cause: #1801 Q6_K synthetic reduction-order → ulp-scale #1805 activation distribution → flat #1811 chain length compounding → flat #1816 Q6_K real weights + qtype-mix → ulp-scale + L7/9/12 are Q4_K #1818 SwiGLU intrinsic precision → ulp-scale #1821 Q4_K real weights (CPU as truth) → 5% — misattributed CUDA THIS Q4_K bisection → CPU has Q8K quant step ## What this PR ships - tests/falsify_q4k_bisect_dequant_007.rs — three-path bisection (~330 LOC), 1 #[ignore] integration test (2.2s on RTX 4090) + 5 unit tests. Per feedback_test_methodology_can_fake_bugs.md — this PR is the textbook case of why bisection beats single-comparison parity tests. #1821 used the CPU as ground truth without verifying the CPU was implementing the same operation as CUDA. Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

…de DISCHARGE amendment (#1583 spec advancement) (#1825) Discharge amendment for the full M-GPU-MOE-3 cascade. Seven falsifier PRs (#1801, #1805, #1811, #1816, #1818, #1821, #1822) empirically pinned the true root cause of the 0.94-cos drop on real Qwen3 layers L7/L9/L12/L20/L23/L29/L46. ## TRUE root cause CPU fused_q4k_parallel_matvec = Q4_K(weights) × Q8_K(activations) CUDA q4k_matvec = Q4_K(weights) × f32_activations Different mathematical operations. The 2.88% per-matvec delta is lossy Q8_K activation quantization. Compounded across 128 experts × 48 layers, it produces the observed ~6% cumulative cos drop. ## What this amendment REFUTES - v1.7.2's 'per-expert SwiGLU f32 intermediates' attribution → refuted by #1818 (SwiGLU intrinsic precision is ulp-scale) - v1.0.0..v1.7.1 'Q6_K fp-accumulator-order' framing → refuted by #1801 (synthetic ulp-scale) + #1816 (real Q6_K ulp-scale) - Q6_K-specific root-cause hypothesis → refuted by #1816's structural finding (L7/L9/L12 are Q4_K, not Q6_K) ## Status change v1.7.2: ACTIVE_ALGORITHM_LEVEL (with wrong SwiGLU attribution) v1.8.0: ACTIVE_ALGORITHM_LEVEL_WITH_DOCUMENTED_DIVERGENCE - 47/48 layers cos≥0.99 stands - root cause documented (activation-qtype algorithm mismatch) - L47 cliff is the natural compositional consequence - 0.94-cos on 7 problem layers is documented, not a bug ## Fix paths (OUT OF SCOPE for this PR; tracked as M-GPU-MOE-3 PR-4) OPTION 1: CPU uses f32 activations (slow CPU) OPTION 2: CUDA uses Q8_K activations (RECOMMENDED — DP4A faster) PackedDp4aQ4KQ8Kernel already exists; just need CUDA f32→Q8_K activation quant kernel to feed it. OPTION 3: Document divergence; relax cos threshold ## Validation - python3 yaml.safe_load: PASS - pv validate contracts/qwen3-moe-forward-gpu-v1.yaml: 0 errors, 0 warnings ## Cross-refs - Issue: #1583 (M-GPU-MOE-3) - Cascade: #1801, #1805, #1811, #1816, #1818, #1821, #1822 - Sibling: tests/qwen3_moe_per_layer_gpu_parity.rs (FALSIFY-QW3-MOE-PER-LAYER-001) — real-model parity gate Co-authored-by: Noah Gift <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]>

noahgift enabled auto-merge (squash) May 19, 2026 11:11

noahgift merged commit c8c865f into main May 19, 2026
11 checks passed

noahgift deleted the feat/m-gpu-moe-3-falsify-chain-length-003 branch May 19, 2026 11:31

noahgift mentioned this pull request May 19, 2026

contracts(qwen3-moe-forward-gpu): v1.7.2 → v1.8.0 — M-GPU-MOE-3 cascade DISCHARGE — true root cause is CPU/CUDA activation-qtype mismatch (#1583) #1825

Merged

2 tasks

noahgift mentioned this pull request May 20, 2026

M-GPU-MOE-3 Option 2: CUDA f32→Q8K activation quant kernel (close v1.8.0 discharge with parity fix) #1838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(m-gpu-moe-3): FALSIFY-Q6K-CHAIN-003 — chain length ALSO falsified; all 3 hypotheses eliminated (#1583 PR-3h)#1811

test(m-gpu-moe-3): FALSIFY-Q6K-CHAIN-003 — chain length ALSO falsified; all 3 hypotheses eliminated (#1583 PR-3h)#1811
noahgift merged 1 commit into
mainfrom
feat/m-gpu-moe-3-falsify-chain-length-003

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Empirical result (lambda-vector RTX 4090)

``` depth rel_diff 1-cos cpu_l2

Hypothesis #3 ALSO FALSIFIED

Cascade state after this PR

Where the cascade pivots next

What this PR ships

Test plan

Cross-refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

```
depth rel_diff 1-cos cpu_l2