Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
199 lines (141 loc) · 15.8 KB

File metadata and controls

199 lines (141 loc) · 15.8 KB

📖 Journal: Fixing Gemma-4 Inference in Aegis-AI (SwiftLM)

Date: April 3, 2026 Goal: Resolve startup crashes for mlx-community/gemma-4-26b-a4b-it-4bit on Apple Silicon.

🏔️ The Journey

1. Identifying the Root Cause

The first major obstacle was tracking down why the server abruptly crashed with Mismatched parameter model.layers.0.mlp.down_proj.weight.

  • The Investigation: We reviewed MLX Swift logs and dived into the implementation of Gemma4Model. What was expected to be a straight 4-bit dimension packing was inexplicably arriving with 8-bit dimensions for specific sub-layers.
  • The Culprit: Mixed-Precision Quantization. While the top-level config.json indicated 4-bit packaging overall, the creators left critical projections like mlp.down_proj in 8-bit to preserve model perplexity score capability. Our monolithic sanitize algorithm ignored those differences.

2. 🧱 The Blockers

The journey had a lot of detours due to syntax errors and library protections:

  • internal Protection Masking: When updating the Gemma4 wrapper to process SwitchLinear bits exactly like Linear, we hit a wall where Swift's visibility protocols blocked us. SwitchLinear held its weight property under internal protection, forcing us to detour into SwitchLayers.swift and export its dimension protocols publicly.
  • Syntax Quirks in MLX Arrays: A heavy blocker hit us near the end. Shapes (1,21) and (1,0) cannot be broadcast. Why? Because Swift handles array slices completely uniquely compared to Python. Passing standard [0..., kth...] mapped physically missing sequence axes, yielding (1,0). Tracing MLX's index protocol and discovering the true explicit form of [0..., 0..., kth...] for 3D tensors was critical.
  • Redeclaration Conflicts: While iterating the Gemma4RouterProj, structural redeclarations crashed the compile chain.

3. 🛠️ The Architecture Refactor

Rather than forcing constraints, we engineered the logic to be universally adaptable: We implemented the determineBits() math (32 * original / packed) locally on every single linear operation initialization within MLX's weight mapping tree. Each individual layer dynamically detects its own native quantization state natively extracted directly from its Safetensor footprint checkpoint, completely bypassing unreliable config generalizations.

4. 🚀 The End Result

After several rounds of debugging and recompilation via swift build -c release, we finally produced a robust SwiftLM backend. When we bound it to 127.0.0.1:5430 natively loading all variables entirely on the Metal GPU in 13 tokens/second execution speeds without zero shape errors, we achieved the final breakthrough!

A smooth copy of our SwiftLM binary into the Aegis b21/macos-arm64/ environment officially crowned the feature fully shipped and complete.


🔮 Phase 2: Making It Actually Think (April 3-4, 2026)

5. TurboKV Crash on 512-Dim Global Heads

The server now loaded the model... then immediately crashed with turbo_encode_k requires 128 or 256 but got 512.

  • The Fix: Added a strict whitelist guard in KVCache.swift — only 128 and 256 dimensions pass through to the Metal kernel; everything else gracefully falls back to fp16.

6. The Token Collapse — Output Was All Dashes

After the TurboKV fix, the model loaded and ran at 627 t/s prefill speed. But every generated token was 236772 — a dash character. Infinite dashes.

We audited the entire forward pass against the Python mlx-vlm reference and found 7 critical differences:

# Bug Python Reference Swift Had
1 MLP activation gelu_approx silu (gate * sigmoid(gate))
2 Attention scale 1.0 (norms handle it) 1/sqrt(queryPreAttnScalar) ≈ 0.0625
3 Global RoPE ProportionalRoPE (custom class) Standard RoPE on wrong dims
4 Router topK argpartition(-scores, kth=topK-1)[:topK] argpartition(probs, kth=N-topK)[kth:]
5 Softcapping tanh(logits/30)*30 Disabled (believed to saturate)
6 Embedding scale h * sqrt(hidden_size) ≈ 53x Missing entirely

7. The Breakthrough: Missing Embedding Scale

After fixing bugs 1-5, the output changed from dashes to... all spaces. The logit distribution looked numerically reasonable (max35, min-46) but always peaked at the same whitespace token.

This was the smoking gun: the logits had the right shape but the wrong magnitude. The Gemma architecture family (since Gemma 1) scales embedding outputs by sqrt(hidden_size). For Gemma 4 with hidden_size=2816, that's a 53x multiplier. Without it, every activation in the entire 32-layer transformer was 53x too small, causing the model to "think in whispers" and default to whitespace.

One line of code:

h = h * MLXArray(Float(config.hiddenSize).squareRoot())

8. 🎉 First Words

"What is 2+2?" → "2 + 2 equals 4."
"Write a haiku about the ocean." → "Blue waves kiss the shore,
                                     Endless tides rise and fall low,
                                     Deep salt mystery."

The model speaks. Coherently. Creatively. With proper EOS stopping.

9. Key Lesson: ProportionalRoPE

The most complex fix was implementing Gemma4ProportionalRoPE — a custom positional encoding class that:

  • Computes frequencies relative to the full head_dim (512)
  • But only rotates 25% of the dimensions (partial_rotary_factor=0.25 → 128 dims)
  • Uses the HuggingFace rotate_half convention: split head into left/right halves, take rotated_dims//2 from each half
  • The standard RoPE class couldn't handle this — it either rotates ALL dims or rotates the FIRST N dims. The Python reference has an entirely separate ProportionalRoPE class.

📋 Files Changed

  • Gemma4.swift — All forward pass fixes, ProportionalRoPE, embed_scale
  • KVCache.swift — TurboKV head_dim guard
  • Evaluate.swift — Debug print cleanup

🚀 Deployment

Binary deployed to ~/.aegis-ai/mlx_binaries/b21/macos-arm64/ as both SwiftLM and mlx-server.


🌟 Appendix: Optimization and The Future of SSD MoE Streaming

The Hacker News Discussion

vessenes

I like this idea on expert streaming. I've been poking around fairly thoroughly at the same idea - can we fix a set of experts? when can we fix them? How long is the top-k selection "good" for in terms of number of forward passes? One thing I've turned up in smaller models and I'm sort of winding my way toward verifying in larger ones is that if you train the MoE model from scratch with this kind of knockout / subset of experts baked in, then you get significantly better loss outcomes. In small models, it's actually better than training an MOE without conditioning on a reduced set of experts per pass. Anyway, pretty cool. There's some Pareto-optimal curve based on memory bandwidth, amount of GPU / unified RAM and inference compute times for streaming stuff in.

aegis_camera (reply) This is an incredible insight, and what you are seeing with the "expert knockout" training outcome aligns perfectly with some of the most cutting-edge research happening right now around efficient MoE architectures and memory-constrained inference.

If we look at the entire pipeline—from how we design the training objective to how we execute the binary on macOS with SSD streaming—there is a very clear path to optimizing this.

Here is my end-to-end thought process on how this entire pipeline fits together, and why your observation about training and temporal locality is the key to unlocking the Pareto frontier for consumer hardware.

1. The Training Implication (Expert Knockout & Regularization)

Your observation that training an MoE from scratch with a reduced/fixed set of experts per pass yields better loss is profound. Standard token-level routing often suffers from "expert collapse" (where a few experts do all the work) or requires heavy auxiliary loss penalties just to keep the routing balanced. By aggressively enforcing "expert knockout" or fixing the subset of experts over a sequence/chunk during training:

  • You are forcing generalization: It acts like a macro-level Dropout or DropConnect. The model can't over-rely on a specific "super-expert" because it knows that expert might not be available in the current pass.
  • Redundant Knowledge Distribution: The network learns to distribute critical semantic representations across multiple experts, making the model far more robust.

2. Temporal Locality (How long is Top-K "good" for?)

In standard auto-regressive generation, a model does not wildly shift its semantic domain or syntactic structure from token $N$ to token $N+1$. There is massive temporal locality in expert activation. If you train the model with block-level routing (forcing the routing decision to be fixed for $M$ tokens), the answer to your question—how long is the top-k selection good for?—changes from "1 token" to "10 to 50 tokens." This completely changes the math for SSD streaming. Instead of paying the NVMe latency tax on every single forward pass, you amortize the SSD read across an entire semantic chunk.

3. The Pareto-Optimal Curve (Bandwidth vs. RAM vs. Compute)

On Apple Silicon, the variables are very stark:

  • GPU Compute: Effectively instant for these matrix sizes.
  • Unified Memory Bandwidth: ~400 GB/s to 800 GB/s (plenty fast).
  • SSD Bandwidth: ~5 GB/s to 10 GB/s (the massive bottleneck).

The Pareto frontier comes down to Hit Rate vs. Fetch Cost. If you can keep a small LRU (Least Recently Used) cache of experts in Unified RAM, and the model has high temporal locality, your SSD fetch rate drops to near zero for long stretches of generation. You only hit the SSD when the semantic context shifts (e.g., moving from writing Python code to explaining it in English).

4. The Ideal "Full Pipeline" Architecture

If we were to build the ultimate MoE pipeline optimizing for SSD streaming on consumer hardware, here is how the whole thing looks:

Phase 1: Pre-training (Temporal Block MoE) Train the model to route at the chunk level (e.g., every 16 or 32 tokens) rather than the token level. Apply your "expert knockout" during training to ensure the model maintains performance even if its preferred expert is forcibly swapped out.

Phase 2: Lookahead Routing (The Pre-fetcher) During inference, because the MoE layer sits deep within the transformer block (after attention), you can compute the routing logits early. Better yet, train a tiny, ultra-fast auxiliary MLP (a "Routing Predictor") that runs on the CPU. It looks at the current context and predicts which experts will be needed 3-4 tokens in the future.

Phase 3: Asynchronous Zero-Copy DMA (The MLX/Metal Layer) While the GPU is crunching the Attention layers for the current token... The CPU triggers an async pread() directly pointing to the unified memory command buffer. The NVMe controller DMA's the upcoming MoE weights straight from the SSD into RAM. Crucially: Because of Apple's Unified Memory architecture, you bypass the CPU RAM -> VRAM copy entirely. The GPU just reads the pointer once the DMA completes.

Phase 4: LRU Eviction & Quantization You maintain a strict budget of RAM (e.g., 2 GB for active experts). The experts themselves are aggressively quantized (e.g., 4-bit or even lower using something like TurboQuant). When the context shifts and a new expert is swapped in, the oldest expert is simply discarded (since it's read-only, there's no write-back penalty).

Summary

What you are poking at is exactly the future of local LLMs. Models are getting too big for VRAM, but SSDs are getting fast enough to bridge the gap if the model architecture cooperates. By changing the training objective to favor temporal blocks and expert knockout, you are effectively "hardware-aware" training the model to be friendly to the SSD PCIe lane. It completely shifts the bottleneck from the hardware (bus speed) to the algorithm (routing predictability).


🔬 Phase 3: Extreme Context Profiling & The Prompt Cache Discovery (April 5, 2026)

10. Building the Profiling Framework

With Gemma 4-26B stable and generating, we needed to answer the real deployment question: How does this model behave at extreme context lengths across different memory configurations? We built scripts/profiling/profile_runner.py — an automated profiling framework that:

  • Iterates through 4 configurations: Dense/Vanilla, SSD Stream, TurboQuant, SSD+TurboQuant
  • Tests across 3 context depths: 512, 40K, and 100K tokens
  • Captures both Active RAM (OS physical footprint via mach_task_basic_info) and GPU Memory Allocated (Apple GPU driver allocation via ioreg AGXAccelerator)

11. The ioreg Breakthrough

The initial profiling used only phys_footprint — the OS physical memory metric. But at 100K context, both Dense/Vanilla (49.3 GB) and TurboQuant (49.3 GB) showed identical numbers. This made no sense — TurboQuant was clearly compressing the KV cache.

The problem: phys_footprint is capped by available physical RAM. On a 64 GB machine, it tops out at ~49 GB regardless of actual demand. We needed to see the total GPU allocation including memory swapped to SSD.

Following the same pattern used by the Aegis-AI HardwareDetector, we queried Apple's AGXAccelerator GPU driver via ioreg for the "Alloc system memory" counter. This metric CAN exceed physical RAM — revealing the true memory demand:

Configuration Active RAM (capped) GPU Alloc (true demand)
Dense/Vanilla @ 40K 49.4 GB 52.6 GB
TurboQuant @ 40K 32.4 GB 35.0 GB

The 52.6 GB vs 35.0 GB difference was invisible in the OS metric but clearly visible via ioreg.

12. 🐛 The Prompt Cache Bug

At 100K context, even with the ioreg metric, TurboQuant (52.5 GB) and Dense/Vanilla (52.1 GB) were nearly identical. Tracing through the code revealed the root cause:

  1. During prefill, the full 100K fp16 KV cache is built (~37 GB)
  2. After the first generation token, TurboQuant compresses it to ~3 GB of polar buffers ✅
  3. But then onPrefillDone fires → the prompt cache calls cache.state
  4. The state getter decodes ALL compressed polar buffers back to full fp16 to create a restorable snapshot
  5. The eval() call materializes this decoded copy — a fresh ~37 GB allocation
  6. Net result: compression savings completely negated

The key insight: for Dense/Vanilla, cache.state returns views (zero-copy references) of existing buffers. For TurboQuant, it creates new arrays via turboDecodeK/V — an O(N) memory allocation the size of the entire context.

13. 🔧 The Fix

One targeted change: skip prompt cache save when TurboQuant has actively compressed data.

Results at 100K context (SSD + TurboQuant):

Metric Before Fix After Fix
GPU Memory Allocated 52.2 GB 33.3 GB (-36%)
Active RAM 49.1 GB 29.6 GB (-40%)

29.6 GB Active RAM for a 26B model at 100K tokens. This fits in a 32 GB Mac Studio — previously required 64 GB.

📋 Files Changed

  • Sources/SwiftLM/MemoryUtils.swift — Added GPU active memory and total demand metrics
  • Sources/SwiftLM/Server.swift — OS_RAM + MEM_DEMAND + GPU_MEM logging at prefill and post-generation; prompt cache TurboQuant guard
  • scripts/profiling/profile_runner.py — Full profiling framework with ioreg GPU allocation tracking

🎯 Key Lesson: Measure What Matters

The OS phys_footprint metric is what Activity Monitor shows — but it lies by omission. It's capped by physical RAM and doesn't reveal how much memory the GPU driver has actually allocated (including SSD-swapped pages). For memory-constrained deployment, the ioreg AGXAccelerator "Alloc system memory" counter is the ground truth.