What is the best way to run AI locally on a Mac?

vMLX is a free, open source Mac app that uses Apple's MLX framework to run LLMs locally on Apple Silicon. It provides prefix caching, paged KV cache, and continuous batching for faster inference than alternatives like LM Studio or Ollama.

Can I run DeepSeek, Llama, or Qwen locally on Mac with vMLX?

Yes. vMLX supports any MLX-compatible model from HuggingFace, including DeepSeek, Llama 3, Qwen, Gemma, Mistral, and hundreds more. Just download the model in MLX format and select it in the app.

Does vMLX require an internet connection or API keys?

No. vMLX runs entirely on your Mac with no cloud connection, no API keys, and no rate limits. All data stays on your machine. It's completely free and open source.

What Mac hardware do I need to run vMLX?

vMLX requires any Mac with Apple Silicon (M1 or later) running macOS. More unified memory allows larger models — 16GB handles up to 12-20B parameter models, while 64GB+ handles 70B+ models.

Open Source · macOS Native

Run AI locally.
No compromises.

Free, open source Mac app for local LLM inference on Apple Silicon. Prefix caching, paged KV cache, continuous batching, and MCP tools — features LM Studio and Ollama don't have.

Download for Mac View on GitHub

Free & Open Source · Apple Silicon Native · No Cloud Required

vMLX

Features

Everything you need to
run LLMs locally

The fastest MLX inference engine for Apple Silicon. Features that other local AI apps don't offer.

Prefix Cache

Reuse previously computed prefill tokens. Up to 9.7x faster time-to-first-token on cached prompts. Multi-context caching means switching conversations doesn't evict your cache.

Paged KV Cache

Memory-efficient key-value caching with configurable block sizes. Handle longer contexts without running out of unified memory.

Continuous Batching

Serve multiple concurrent requests with intelligent batch scheduling. Up to 256 concurrent sequences.

MCP Tools

Native Model Context Protocol support. Connect models to external tools and APIs for agentic workflows.

OpenAI-Compatible API

Drop-in replacement for OpenAI's chat completions endpoint. Use your existing tools, scripts, and integrations. Streaming, function calling, and structured output all work out of the box.

Performance

Built for Apple Silicon

Optimized for unified memory architecture. Run Llama, DeepSeek, Qwen, Gemma, and Mistral locally with maximum throughput.

256

Max concurrent sequences

512

Prefill batch size

20%

Auto cache memory

∞

Concurrent sessions

Benchmarks

vMLX vs LM Studio

Real benchmarks on Apple M3 Ultra (256 GB) with Llama 3.2 3B Instruct 4-bit.

vMLX (vllm-mlx)

Engine: vllm-mlx v0.1 (SimpleEngine + mlx-lm)
Flags: --continuous-batching --enable-prefix-cache --use-paged-cache
Cache: Paged KV cache, multi-context
API: OpenAI /v1/chat/completions (streaming)

LM Studio (MLX engine)

Engine: LM Studio 0.3.x built-in MLX backend
Flags: Default settings (auto prefix caching)
Cache: Single-slot (1 active context)
API: OpenAI /v1/chat/completions (streaming)

Head-to-Head: TTFT (Time to First Token)

Metric	vMLX	LM Studio MLX
~2.5K Token Context
Cold TTFT	0.50s	—
Warm TTFT (cached)	0.05s	—
Cache Speedup	9.7×	—
~10K Token Context
Cold TTFT	0.12s	6.12s
Warm TTFT (cached)	0.08s	0.29s
Cache Speedup	1.6×	21×
~50K Token Context
Cold TTFT	0.30s	—
Warm TTFT (cached)	0.22s	—
Cache Speedup	1.4×	—
~100K Token Context
Cold TTFT	0.65s	131.06s
Warm TTFT (cached)	0.45s	1.14s
Cold PP/s	154,121	686
Warm PP/s	222,462	78,635
Architecture
Cache type	Paged multi-context	Single-slot
Multi-conversation	✓ concurrent caching	✗ evicts on switch
Concurrent sequences	Up to 256	1

All measurements: TTFT via streaming OpenAI-compatible API. Cold = first request, no cache. Warm = same prefix cached. vMLX flags: --continuous-batching --enable-prefix-cache --use-paged-cache. LM Studio: default MLX engine settings. Model: mlx-community/Llama-3.2-3B-Instruct-4bit. Hardware: Apple M3 Ultra, 256 GB unified memory. Feb 2026.

Prompt Processing Speed

Cold = first request (full processing). Warm = same prefix cached. Up to 18.6× faster at 50K tokens.

Multi-Turn Prefix Cache

8-turn coding conversation with 12K system prompt. After turn 1, 99%+ tokens served from cache.

Get Started

Up and running in seconds

Download vMLX, install the MLX inference backend with one click, pick any model from HuggingFace, and start generating. No cloud, no API keys, no Docker.

✓ One-click vLLM-MLX installer
✓ Download any MLX-compatible model
✓ Automatic server start with smart defaults
✓ OpenAI-compatible API on localhost
✓ Full chat UI with advanced settings
✓ Latest DMG: contains a Developer ID signed and notarized app

terminal

# First run: vMLX auto-installs vLLM-MLX

$ open vMLX.app

Installing vllm-mlx via uv...

# Select a model, configure settings, hit Start

Server started on http://127.0.0.1:8000

# OpenAI-compatible API ready instantly

$ curl http://127.0.0.1:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model":"default","messages":[

{"role":"user","content":"Hello!"}]}'

Server Configuration

Configuration

Every parameter
at your fingertips

Fine-tune the MLX inference pipeline. Prefill batch sizes, cache memory, paged KV cache blocks, MCP tool bindings — vMLX exposes all 23 configuration flags.

FAQ

Questions & answers

What is the best app to run AI locally on a Mac?+

vMLX is the fastest open source Mac app for running LLMs locally on Apple Silicon. Unlike LM Studio or Ollama, vMLX provides prefix caching (up to 9.7x faster time-to-first-token), paged KV cache, continuous batching for up to 256 concurrent sequences, and MCP tool integration. It's free, requires no cloud connection, and works on any M1+ Mac.

How does vMLX compare to LM Studio and Ollama?+

At 100K token context, vMLX achieves 154,121 prompt tokens/sec (cold) vs LM Studio's 686 tok/s. vMLX uses paged multi-context KV caching (concurrent conversations stay cached), while LM Studio uses single-slot caching that evicts on switch. vMLX supports up to 256 concurrent sequences vs 1 for LM Studio. All three offer OpenAI-compatible APIs, but only vMLX exposes all 23 inference parameters.

Can I run DeepSeek, Llama, Qwen, or Gemma locally?+

Yes. vMLX supports any MLX-compatible model from HuggingFace including DeepSeek V3, Llama 3/4, Qwen 2.5/3, Gemma 3, Mistral, Phi, and hundreds more. Models run entirely on your Mac's Apple Silicon GPU. A 16GB Mac handles up to ~20B parameters, while 64GB+ handles 70B+ models.

What is prefix caching and why does it matter?+

Prefix caching stores computed KV states from previous prompt processing. When you send a new message that shares the same system prompt or history, cached tokens are reused instantly. In benchmarks, this reduces TTFT by up to 9.7x on 2.5K context. Critical for multi-turn conversations and agentic workflows.

Do I need internet or API keys?+

No. vMLX runs entirely on your Mac with zero cloud dependency. No API keys, no subscriptions, no rate limits. Your conversations and model weights stay 100% local and private. Internet is only needed to download models initially.

What Mac hardware do I need?+

Any Mac with Apple Silicon (M1, M2, M3, M4, M5 or later). More unified memory = larger models: 8GB handles ~3-7B, 16GB up to ~20B, 32-64GB handles 30-70B, and 128-512GB runs the largest open models at full precision.

Ready to run AI locally?

Free, open source, built for the Apple Silicon Mac you already have.
No cloud. No API keys. No rate limits.

Download for Mac

Run AI locally. No compromises.

Everything you need torun LLMs locally