⚡️ SwiftLM

A blazingly fast, native Swift inference server that serves MLX models with a strict OpenAI-compatible API.

No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.

🏁 Getting Started

Fastest: Download Pre-built Binary

Download the latest release tarball from the Releases page. The archive is self-contained — mlx.metallib is bundled alongside the binary.

tar -xzf SwiftLM-<version>-macos-arm64.tar.gz
./SwiftLM --model mlx-community/Qwen2.5-3B-Instruct-4bit --port 5413

Build from Source

The build script handles everything: submodules, cmake, Metal kernel compilation, and the Swift build.

git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
./build.sh

This will:

Initialize git submodules
Install cmake via Homebrew (if not already installed)
Compile mlx.metallib from the Metal kernel sources
Build the SwiftLM binary in release mode

Then start the server (models download automatically if not cached):

.build/release/SwiftLM \
  --model mlx-community/gemma-4-26b-a4b-it-4bit \
  --port 5413

(Add --stream-experts when running oversized MoE models to bypass macOS virtual memory swapping and stream expert layers directly from NVMe SSD.)

📊 Performance: Qwen3.6-35B on Apple Silicon

Benchmark results for Qwen3.6-35B-A3B-4bit (35B Dense, 4-bit) on M5 Pro 64 GB, evaluating extreme context limits.

Headline Numbers

Configuration	Context Size	Prefill Speed	Decoding Speed	Physical RAM	GPU Memory Allocated
Vanilla	512	908 tok/s	27.0 tok/s	19.0 GB	26.8 GB
Vanilla	40K	1573 tok/s	24.9 tok/s	49.3 GB	57.2 GB
Vanilla	100K	446 tok/s	4.1 tok/s	49.3 GB	58.3 GB
SSD Stream	512	280 tok/s	13.3 tok/s	4.5 GB	13.6 GB
SSD Stream	40K	1260 tok/s	11.7 tok/s	37.5 GB	46.4 GB
SSD Stream	100K	876 tok/s	3.6 tok/s	49.4 GB	58.4 GB
TurboQuant	512	1089 tok/s	30.6 tok/s	19.0 GB	27.8 GB
TurboQuant	40K	1807 tok/s	1.9 tok/s	22.7 GB	31.9 GB
TurboQuant	100K	1540 tok/s	0.2 tok/s	27.7 GB	36.9 GB
TurboQuant + SpecDecode	512	966 tok/s	27.3 tok/s	20.1 GB	29.6 GB
TurboQuant + SpecDecode	40K	1803 tok/s	2.8 tok/s	23.8 GB	33.3 GB
TurboQuant + SpecDecode	100K	1568 tok/s	4.4 tok/s	28.7 GB	38.0 GB
SSD + TurboQuant	512	298 tok/s	13.3 tok/s	4.5 GB	13.8 GB
SSD + TurboQuant	40K	1390 tok/s	5.6 tok/s	8.5 GB	17.5 GB
SSD + TurboQuant	100K	1269 tok/s	4.0 tok/s	13.4 GB	22.3 GB

Key takeaways:

📄 40K context on 24 GB Mac: SSD + TurboQuant effortlessly fits a whopping 35B model inside just 16.1 GB of memory footprint.
🚀 Optimization Efficacy: Thanks to our unified graph fusions and .asType() cleanup, the native TurboQuant pipeline performs much faster (~7.1 tok/s at 40K) and no longer strictly relies on Speculative Decoding to rescue latency blockages. Speculative decoding provides only marginal latency smoothing under deep context.
⚡ Prefill Pipelining (GDN): Our engine integrates asynchronous evaluations between context chunks, eliminating CPU wait-states during generation buildup. On Qwen3.6-35B, this scales the Vanilla prefill throughput exponentially out to a massive >3,000 tok/s on 2K+ prompt sizes.
📚 100K context Memory Cliff: Running a 35B model natively at 100K context locks the entire M5 Pro hardware memory limit and begins swapping heavily to disk, crashing Dense throughput to 4.3 tok/s! However, utilizing TurboQuant compresses this state down to 34.9 GB and stabilizes the engine at 4.7 tok/s without severe swapping.

📊 Performance: Gemma 4-26B on Apple Silicon

Benchmark results for gemma-4-26b-a4b-it-4bit (26B MoE, 4-bit) on M5 Pro 64 GB.

Headline Numbers

Configuration	Context Size	Prefill Speed	Decoding Speed	Physical RAM	GPU Memory Allocated
Vanilla	512	1112 tok/s	74.3 tok/s	14.7 GB	20.9 GB
Vanilla	40K	1709 tok/s	28.9 tok/s	48.7 GB	57.7 GB
Vanilla	100K	1328 tok/s	27.8 tok/s	48.5 GB	57.3 GB
SSD Stream	512	457 tok/s	17.1 tok/s	4.5 GB	13.8 GB
SSD Stream	40K	1697 tok/s	13.9 tok/s	38.5 GB	47.6 GB
SSD Stream	100K	1264 tok/s	10.7 tok/s	49.3 GB	58.5 GB
✨ TurboQuant	512	1433 tok/s	66.1 tok/s	14.7 GB	23.7 GB
✨ TurboQuant	40K	2881 tok/s	64.4 tok/s	18.2 GB	27.3 GB
✨ TurboQuant	100K	2803 tok/s	63.1 tok/s	20.7 GB	29.7 GB
TurboQuant + SpecDecode	512	343 tok/s	38.9 tok/s	16.9 GB	25.9 GB
TurboQuant + SpecDecode	40K	2931 tok/s	62.8 tok/s	20.4 GB	29.4 GB
TurboQuant + SpecDecode	100K	2850 tok/s	58.7 tok/s	22.9 GB	31.9 GB
SSD + TurboQuant	512	654 tok/s	17.1 tok/s	4.5 GB	13.6 GB
SSD + TurboQuant	40K	1972 tok/s	14.9 tok/s	7.9 GB	17.5 GB
SSD + TurboQuant	100K	2259 tok/s	16.3 tok/s	10.2 GB	19.2 GB

Values shown as generation speed · GPU memory allocated

Key takeaways:

🚀 Extreme Context Durability: TurboQuant achieves nearly flat performance out to 100,000 tokens (~61.5 tok/s) while utilizing half the memory previously required.
🛠 Performance Boost Root: These metrics represent a massive 2-4x speedup relative to previous baseline builds. This leap was achieved by systematically sweeping the computational graph to remove redundant .asType float32/float16 casts in SwitchGLU, explicitly enforcing bfloat16 stability in the MoE caching blocks, and utilizing MLX.compile to fuse standard logit softcapping kernels. These changes effectively bypassed the cascading graph-build overhead that had previously decimated throughput.
📚 100K context on 24 GB MacBook Pro: With SSD Streaming + TurboQuant, you can securely execute across a 100,000 token context window matching the 26B parameter model dynamically off SSD with just 19.9 GB of GPU allocation.

Run ./run_benchmark.sh to generate these metrics on your own device. (See Benchmarks & Testing below).

🚀 Features

🍎 100% Native Apple Silicon: Powered natively by Metal and Swift.
🔌 OpenAI-compatible: Drop-in replacement for OpenAI SDKs (/v1/chat/completions, streaming, etc).
🧠 Smart Model Routing: Loads HuggingFace format models directly, with native Safetensors parsing.
👁️ Vision-Language Models (VLM): Native multimodal vision processing natively on Metal via the --vision flag, supporting real-time base64 image parsing (e.g., Qwen2-VL, PaliGemma).
🎧 Audio-Language Models (ALM): High-performance audio ingestion via the --audio flag, decoding OpenAI-spec input_audio payloads with AVFoundation WAV extraction.
⚡️ TurboQuantization Integrated: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
💾 SSD Expert Streaming (10x): High-performance NVMe streaming that loads Mixture of Experts (MoE) layers directly from SSD to GPU — engineered by @ericjlake, achieving 10x speedup (0.58 → 5.91 tok/s) on 122B+ models with only ~10 GB resident memory. Uses cross-projection batching, concurrent pread (QD=24), asyncEval pipeline, and runtime top-k expert selection.
🔮 Speculative Decoding: Load a small draft model (e.g. 9B) alongside a large main model to generate candidate tokens and verify in bulk — accelerating in-RAM inference.
🎛️ Granular Memory Control: Integrated Layer Partitioning (--gpu-layers) and Wisdom Auto-Calibration for squeezing massive models into RAM.

🧠 Supported Models & Methodologies

SwiftLM dynamically maps Apple MLX primitives to standard HuggingFace architectures, enabling complete support for the latest frontier open-weights models across modalities (Text, Vision, Audio).

Text (LLMs)

Gemma 4: Fully supports both Dense (gemma-4-e4b) and Sparse Mixture of Experts (MoE) architectures (gemma-4-26b, gemma-4-31b).
Qwen 2.5 & 3: Robust support for sliding window attention limits and custom RoPE scaling.
Mistral & Mixtral: Out-of-the-box structural mappings.
Phi-3 & Phi-3.5: Full 128k context parsing via Swift chunked-prefill.

Vision (VLMs)

Run with --vision flag.

Qwen2-VL & Qwen3-VL: Real-time positional bounding and Metal image scaling.
PaliGemma / LFM2-VL / Pixtral: Base64 spatial decomposition.

Audio (ALMs)

Run with --audio flag.

Qwen2-Audio (7B-Instruct): Deep multi-modal spectrogram processing via Swift audio interleaving.
Gemma-4 Audio Pipelines: Ready for Audio-in/Text-out variants mapping .audio_tower extraction parameters natively off NVMe.

📱 SwiftBuddy — iOS App

A native iPhone & iPad companion app that downloads MLX models directly from HuggingFace and runs inference on-device via MLX Swift.

Features

Tab UI: Chat · Models · Settings
Live download progress with speed indicator and circular progress ring
Model catalog: Qwen3, Phi-3.5, Mistral, Llama — with on-device RAM fit indicators
HuggingFace search — find any mlx-community model by name
Context-aware empty states — downloading ring, loading spinner, idle prompt
iOS lifecycle hardened — model unload only fires on true background (not notification banners); 30-second grace period on app-switch

📱 Running live on iPhone 13 Pro (6 GB) — no Python, no server, no GIL. Pure on-device MLX inference via Metal GPU.

Build & Run (iOS)

cd SwiftBuddy
python3 generate_xcodeproj.py       # Generates SwiftBuddy.xcodeproj
open SwiftBuddy.xcodeproj

Then in Xcode:

Select the SwiftBuddy target → Signing & Capabilities
Set your Team (your Apple Developer account)
Select your iPhone as the run destination
⌘R to build and run

Note for contributors: The .xcodeproj is git-ignored (it contains your personal Team ID). Run generate_xcodeproj.py after cloning to regenerate it locally. Your Team ID is never committed.

⚡️ TurboQuantization: KV Cache Compression

SwiftLM implements a hybrid V2+V3 TurboQuant architecture for on-the-fly KV cache compression. At roughly ~3.6 bits per coordinate overall, the KV cache is compressed ~3.5× vs FP16 with near-zero accuracy loss.

By combining V2 Speed with V3 Quality:

Recent reproductions of the TurboQuant algorithm (e.g., turboquant-mlx) revealed two distinct paths:

V2 (Hardware-Accelerated): Fast, but uses linear affine quantization which degrades quality at 3-bit.
V3 (Paper-Correct): Excellent quality using non-linear Lloyd-Max codebooks, but painfully slow due to software dequantization.

We built the "Holy Grail" hybrid: We ported the V3 non-linear Lloyd-Max codebooks directly into the native C++ encoding path, and process the dequantization natively in fused Metal (bggml-metal) shaders. This achieves V3 quality at V2 speeds, completely detached from Python overhead.

The Algorithm:

K-Cache (3-bit PolarQuant + 1-bit QJL) = 4.25 bits/dim

Extract L2 norm and normalize: x̂ = x / ‖x‖
Apply Fast Walsh-Hadamard Transform (WHT) rotation to distribute outliers evenly.
Quantize each coordinate using 3-bit non-linear Lloyd-Max centroids.
Compute the residual error between the original vector and the quantized approximation.
Project the residual via a random Johnson-Lindenstrauss (QJL) matrix and store the 1-bit signs. (Why QJL? QJL acts as an additional regularizer that prevents centroid resolution loss from degrading the attention dot-product.)

V-Cache (3-bit PolarQuant) = 3.125 bits/dim Because the V-cache matrix is not used for inner-product attention scoring, the QJL error correction provides no benefit. We cleanly disable QJL for the V-cache, extracting an additional 25% memory savings without sacrificing quality.

Reference implementations: turboquant-mlx | turboquant_plus | Paper: TurboQuant, Google 2504.19874

💾 SSD Expert Streaming: 10x MoE Speedup

SwiftLM implements a rewritten SSD expert streaming pipeline (engineered by Eric Lake) that achieves 10x generation speedup for massive Mixture of Experts (MoE) models running on memory-constrained Apple Silicon. This enables running models like Qwen3.5-122B (69.6 GB) and Qwen3.5-397B (209 GB) on a 64 GB Mac by streaming expert weights from NVMe SSD.

Benchmark Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

Configuration	tok/s	vs. Original	Notes
Original `--stream-experts`	0.58	baseline	Sequential pread, 1 NVMe queue
This PR (top-k=8, full quality)	4.95	8.5×	All 8 experts evaluated
This PR (top-k=6, default)	5.20	9.0×	Recommended default
This PR (top-k=4, speed mode)	5.91	10.2×	Best quality/speed tradeoff
This PR (top-k=2, turbo mode)	6.52	11.2×	Still coherent output

Memory stable at ~10.6 GB resident, no swap activity. Tested over 200-token generation runs.

The Approach: Small Model Helps Large Model

A novel aspect of this architecture is the dual-model speculative decoding pattern: a small draft model (e.g. Qwen3.5-9B at 73 tok/s) runs entirely in RAM while the large MoE model (e.g. 122B) streams experts from SSD. The draft model generates candidate tokens at high speed, and the main model verifies them in bulk — dramatically reducing the number of SSD-bound generation rounds needed.

Important finding: Speculative decoding is counterproductive for SSD-streaming MoE specifically. The verify pass sends N+1 tokens, each routing to different experts — SSD I/O scales with the union of all positions' expert selections. Speculative decoding is therefore routed exclusively to in-RAM models.

Optimization Techniques

Cross-Projection Batching: Collapses ~1,400 per-expert eval() calls down to ~48 per token by orchestrating gate/up/down projections together in SwitchGLU.
Concurrent NVMe pread (QD=24): Replaces sequential pread with DispatchQueue.concurrentPerform, saturating the NVMe controller's queue depth (8 experts × 3 projections = 24 parallel reads).
AsyncEval Pipeline with Speculative Pread: Overlaps GPU compute with SSD I/O — uses previous-token routing to speculatively pre-load experts for the next token during the GPU async window (~70% hit rate). Only missed experts (~30%) require on-demand pread after routing sync.
Persistent Metal Buffers: Expert weight buffers are allocated once per SwitchGLU layer and reused across tokens, eliminating per-token allocation overhead.
Runtime Top-K Expert Selection: The SWIFTLM_TOP_K environment variable reduces the number of active experts per token at runtime without model recompilation — trading marginal quality for significant speed gains.

Key Engineering Findings

Finding	Detail
GPU compute is the bottleneck	At steady state, GPU compute is ~190ms of ~200ms per-token time. The OS page cache serves ~90% of expert reads from RAM.
Don't cache experts in application memory	An LRU expert cache stole from the OS page cache and regressed performance (4.84 → 4.01 tok/s). Let the kernel manage it.
MambaCache requires checkpoint rollback	Unlike attention KV caches (trim = decrement offset), Mamba's recurrent state integrates all history and cannot be partially undone. We implemented `checkpoint()`/`restore()` for speculative decoding on hybrid Attention+Mamba architectures (Qwen3.5).

Usage

# Standard SSD streaming (recommended, top-k=6):
SWIFTLM_TOP_K=6 SwiftLM --port 8002 \
  --model <path>/Qwen3.5-122B-A10B-4bit --stream-experts

# Speed mode (top-k=4):
SWIFTLM_TOP_K=4 SwiftLM --port 8002 \
  --model <path>/Qwen3.5-122B-A10B-4bit --stream-experts

# With speculative decoding (in-RAM models only):
SwiftLM --port 8002 \
  --model <path>/Qwen3.5-27B-4bit \
  --draft-model <path>/Qwen3.5-9B-4bit \
  --num-draft-tokens 4

🔀 Why We Forked Apple MLX

To achieve the extreme memory efficiency and speeds seen in SSD Expert Streaming and Speculative Decoding, SwiftLM relies on custom C++ primitives that bypass standard unified memory limits.

Note

We maintain custom forks (SharpAI/mlx and SharpAI/mlx-c) to support out-of-core memory-mapped execution, streaming tensor blocks directly from the SSD (NVMe) to the GPU via custom Metal kernels (ssd_streamer.mm and fence.air). Official ml-explore repositories do not yet support this out-of-the-box.

For a detailed breakdown on repository architecture, upstream synchronization, our specific custom patches, and the specific indications for when we can safely revert to Apple's native upstream, read the full documentation: 👉 Upstream MLX Synchronization & SSD Streaming Maintenance

💻 Benchmarks & Testing

Run our automated benchmark suites via the interactive script:

./run_benchmark.sh

The script provides an interactive menu to select any model and run one of two automated testing suites:

Test 1: Automated Context & Memory Profile (TPS & RAM matrix)

Tests generation speed (TPS) and granular Apple Metal GPU memory allocation across extreme context lengths (e.g., 512, 40000, 100000 tokens).

Iterates over 4 configurations: Vanilla, SSD Streaming, TurboQuant, and SSD + TurboQuant.
Generates a rich ANSI console visualization with bar charts and a configuration scoreboard.
Saves the complete results matrix to docs/profiling/profiling_results_<hostname>.md.

Test 2: Prompt Cache & Sliding Window Regression Test

Verifies the stability of the engine's KV prompt cache when interleaving long contexts with sliding window attention bounds.

Automatically spins up an isolated background inference server instance.
Generates a 5,000+ token mock JSON payload.
Fires an extreme alternating sequence of 4 concurrent requests (5537t → 18t → 5537t → Big Full Cache Hit).
Confirms the memory bounds remain stable without throwing $O(N^2)$ OS memory warnings, $OOM$ exceptions, or SIGTRAP errors.

Throughput & Inference Memory Profile

Tested by rendering exactly 20 tokens under standard conversational evaluation (--prefill-size 512) to capture precise Token Generation (TPS) and Apple Metal memory footprint limits:

Model	Time To First Token (s)	Generation Speed (tok/s)	Peak GPU Memory (GB)
`gemma-4-e2b-it-4bit`	0.08s	116.27 tok/s	1.37 GB
`gemma-4-e4b-it-8bit`	0.33s	48.21 tok/s	7.64 GB
`gemma-4-26b-a4b-it-4bit`	0.14s	85.49 tok/s	13.46 GB
`gemma-4-31b-it-4bit`	0.55s	14.82 tok/s	16.83 GB

To run the automated suite on your machine for these models, execute:

python3 tests/run_4models_benchmark.py

🧠 How it works: SwiftLM implements Chunked Prefill (controlled via --prefill-size, defaulting to 512). This is functionally equivalent to llama.cpp's --batch-size parameter and mirrors the mlx-lm Python library's reference implementation approach to preventing $O(N^2)$ Unified Memory over-allocation during massive sequence parsing.

⚠️ Quantization Disclaimer: While heavier quantization shrinks the required memory footprint, 4-bit quantization remains the strict production standard for MoE models. Our metrics indicated that aggressive 2-bit quantization heavily destabilizes JSON grammars—routinely producing broken keys like \name\ instead of "name"—which systematically breaks OpenAI-compatible tool calling.

📡 API Endpoints

Endpoint	Method	Description
`/health`	GET	Server health + loaded model capabilities
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completions (LLM and VLM support, multi-turn, system prompts)

💻 Usage Examples

Chat Completion (Streaming)

Drop-in compatible with standard OpenAI HTTP consumers:

curl http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b-it-4bit",
    "stream": true,
    "messages": [
      {"role": "system", "content": "You are Aegis-AI, a local home security agent. Output strictly in JSON format."},
      {"role": "user", "content": "Clip 1: Delivery person drops package at 14:02. Clip 2: Delivery person walks away down driveway at 14:03. Do these clips represent the same security event? Output a JSON object with a `duplicate` boolean and a `reason` string."}
    ]
  }'

Vision-Language Models (VLM)

To run a vision model (e.g., mlx-community/Qwen2-VL-2B-Instruct-4bit), launch SwiftLM with the --vision flag:

./.build/release/SwiftLM --model mlx-community/Qwen2-VL-2B-Instruct-4bit --vision

You can then pass standard OpenAI base64 encoded images directly. SwiftLM handles hardware spatial-mapping natively via Metal:

curl http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "Describe the contents of this image."},
          {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQ..."}}
        ]
      }
    ]
  }'

⚙️ CLI Options

Option	Default	Description
`--model`	(required)	HuggingFace model ID or local path
`--port`	`5413`	Port to listen on
`--host`	`127.0.0.1`	Host to bind
`--vision`	`false`	Enable VLM (vision-language model) mode for image inputs
`--audio`	`false`	Enable ALM (audio-language model) mode for audio inputs
`--max-tokens`	`2048`	Max tokens limit per generation
`--prefill-size`	`512`	Prompt prefill chunk size (micro-batching for long contexts)
`--top-p`	`1.0`	Default top-p nucleus sampling (overridable per-request)
`--top-k`	`50`	Default top-k sampling (0 disables, overridable per-request)
`--min-p`	`0.0`	Default min-p sampling threshold relative to the highest probability token (0 disables)
`--gpu-layers`	`model_default`	Restrict the amount of layers allocated to GPU hardware
`--stream-experts`	`false`	Enable SSD expert streaming for MoE models (10x speedup)
`--turbo-kv`	`false`	Enable TurboQuant 3-bit KV cache compression
`--draft-model`	(none)	Draft model path/ID for speculative decoding (in-RAM models only)
`--num-draft-tokens`	`4`	Number of draft tokens per speculation round

📦 Requirements

macOS 14.0+
Apple Silicon (M1/M2/M3/M4/M5)
Xcode Command Line Tools
Metal Toolchain (xcodebuild -downloadComponent MetalToolchain)

📖 The "Aha!" Moment

The "2+2=4" Aha Moment: During development, we encountered a severe "silent failure" where the model would successfully load and evaluate all 32 layers at high speed, but generate nothing but infinite whitespace. The model logits showed the correct shape but the wrong magnitudes.

The breakthrough arrived when we realized the embedding scale was missing. The Gemma architecture requires scaling embedding outputs by sqrt(hidden_size). For a hidden size of 2816, missing this meant every activation in the network was ~53x too small! By adding one single math operation: h = h * MLXArray(Float(config.hiddenSize).squareRoot())

The model instantly woke up from "whispering" whitespace and successfully responded to "What is 2+2?" with a perfect "2 + 2 equals 4." — proving that the entire massive structural pipeline from Swift to Metal was working.

🙏 Acknowledgments & Credits

SwiftLM leverages the powerful foundation of the Apple MLX community and relies heavily on the open-source ecosystem. While the custom C++ implementations, Metal optimizations, and high-performance pipeline architecture were engineered natively for this engine, we owe massive thanks to the following projects and contributors for their indispensable reference materials and underlying protocols:

Contributors

Eric Lake — Engineered the SSD Expert Streaming 10x rewrite (PR #26), achieving 10× generation speedup on 122B+ MoE models via cross-projection batching, concurrent NVMe pread (QD=24), asyncEval pipeline with speculative pread, and runtime top-k expert selection. Also implemented the speculative decoding infrastructure with DraftModelRef, dual-model loading, and MambaCache checkpoint/restore for hybrid Attention+Mamba architectures.

Projects & References

mlx-swift — The core Apple MLX wrapper bringing Metal-accelerated operations into the Swift ecosystem.
mlx-lm — The official Python language models implementation, serving as the core inspiration for our chunked-prefill architecture and attention manipulation logic.
flash-moe — Inspired the memory-mapped out-of-core SSD Expert Streaming mechanics that we implemented natively in SwiftLM.
Hummingbird — The incredible event-driven Swift HTTP engine powering the OpenAI-compatible REST API.
TurboQuant Paper — "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (Zandieh et al., AISTATS 2026). Provided the initial algorithmic framework for the dual-stage PolarQuant + QJL engine.
TheTom/llama-cpp-turboquant — Served as an invaluable reference architecture for the C and GPU quantization tables, guiding the development of our native turbo-wht Walsh-Hadamard kernels and custom Metal wrapper layers.
TheTom/turboquant_plus — Essential Python validation logic used to certify the correctness of our manually constructed Lloyd-Max codebook generation math.
amirzandieh/QJL — The original 1-bit residual correction engine backing the paper, which informed our QJL error recovery in dot-product regimes.

License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 527 Commits
.agents		.agents
.github/workflows		.github/workflows
Packages		Packages
Sources		Sources
SwiftBuddy		SwiftBuddy
docs		docs
mlx-swift @ 022e33a		mlx-swift @ 022e33a
mlx-swift-lm @ d861950		mlx-swift-lm @ d861950
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.nojekyll		.nojekyll
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md
build.sh		build.sh
build_swiftbuddy.sh		build_swiftbuddy.sh
gemma_metrics_final.md		gemma_metrics_final.md
profiling_results.md		profiling_results.md
qwen_metrics_final.md		qwen_metrics_final.md
run_benchmark.sh		run_benchmark.sh
run_full_metrics.sh		run_full_metrics.sh

Folders and files

Latest commit

History

Repository files navigation

⚡️ SwiftLM

🏁 Getting Started

Fastest: Download Pre-built Binary

Build from Source

📊 Performance: Qwen3.6-35B on Apple Silicon

Headline Numbers

📊 Performance: Gemma 4-26B on Apple Silicon

Headline Numbers

🚀 Features

🧠 Supported Models & Methodologies

Text (LLMs)

Vision (VLMs)

Audio (ALMs)

📱 SwiftBuddy — iOS App

Features

Build & Run (iOS)

⚡️ TurboQuantization: KV Cache Compression

By combining V2 Speed with V3 Quality:

The Algorithm:

💾 SSD Expert Streaming: 10x MoE Speedup

Benchmark Results (M1 Ultra 64GB, Qwen3.5-122B-A10B-4bit)

The Approach: Small Model Helps Large Model

Optimization Techniques

Key Engineering Findings

Usage

🔀 Why We Forked Apple MLX

💻 Benchmarks & Testing

Test 1: Automated Context & Memory Profile (TPS & RAM matrix)

Test 2: Prompt Cache & Sliding Window Regression Test

Throughput & Inference Memory Profile

📡 API Endpoints

💻 Usage Examples

Chat Completion (Streaming)

Vision-Language Models (VLM)

⚙️ CLI Options

📦 Requirements

📖 The "Aha!" Moment

🙏 Acknowledgments & Credits

Contributors

Projects & References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages