Thanks to visit codestin.com
Credit goes to vmlx.net

Open Source · macOS Native

Run AI locally.
No compromises.

Free, open source Mac app for local LLM inference on Apple Silicon. Prefix caching, paged KV cache, continuous batching, and MCP tools — features LM Studio and Ollama don't have.

Free & Open Source · Apple Silicon Native · No Cloud Required
vMLX
vMLX Mac app — run AI locally on Apple Silicon with MLX
Features

Everything you need to
run LLMs locally

The fastest MLX inference engine for Apple Silicon. Features that other local AI apps don't offer.

Prefix Cache

Reuse previously computed prefill tokens. Up to 9.7x faster time-to-first-token on cached prompts. Multi-context caching means switching conversations doesn't evict your cache.

Paged KV Cache

Memory-efficient key-value caching with configurable block sizes. Handle longer contexts without running out of unified memory.

Continuous Batching

Serve multiple concurrent requests with intelligent batch scheduling. Up to 256 concurrent sequences.

MCP Tools

Native Model Context Protocol support. Connect models to external tools and APIs for agentic workflows.

OpenAI-Compatible API

Drop-in replacement for OpenAI's chat completions endpoint. Use your existing tools, scripts, and integrations. Streaming, function calling, and structured output all work out of the box.

Performance

Built for Apple Silicon

Optimized for unified memory architecture. Run Llama, DeepSeek, Qwen, Gemma, and Mistral locally with maximum throughput.

256
Max concurrent sequences
512
Prefill batch size
20%
Auto cache memory
Concurrent sessions
Benchmarks

vMLX vs LM Studio

Real benchmarks on Apple M3 Ultra (256 GB) with Llama 3.2 3B Instruct 4-bit.

vMLX (vllm-mlx)
Engine: vllm-mlx v0.1 (SimpleEngine + mlx-lm)
Flags: --continuous-batching --enable-prefix-cache --use-paged-cache
Cache: Paged KV cache, multi-context
API: OpenAI /v1/chat/completions (streaming)
LM Studio (MLX engine)
Engine: LM Studio 0.3.x built-in MLX backend
Flags: Default settings (auto prefix caching)
Cache: Single-slot (1 active context)
API: OpenAI /v1/chat/completions (streaming)
Head-to-Head: TTFT (Time to First Token)
MetricvMLXLM Studio MLX
~2.5K Token Context
Cold TTFT0.50s
Warm TTFT (cached)0.05s
Cache Speedup9.7×
~10K Token Context
Cold TTFT0.12s6.12s
Warm TTFT (cached)0.08s0.29s
Cache Speedup1.6×21×
~50K Token Context
Cold TTFT0.30s
Warm TTFT (cached)0.22s
Cache Speedup1.4×
~100K Token Context
Cold TTFT0.65s131.06s
Warm TTFT (cached)0.45s1.14s
Cold PP/s154,121686
Warm PP/s222,46278,635
Architecture
Cache typePaged multi-contextSingle-slot
Multi-conversation✓ concurrent caching✗ evicts on switch
Concurrent sequencesUp to 2561

All measurements: TTFT via streaming OpenAI-compatible API. Cold = first request, no cache. Warm = same prefix cached. vMLX flags: --continuous-batching --enable-prefix-cache --use-paged-cache. LM Studio: default MLX engine settings. Model: mlx-community/Llama-3.2-3B-Instruct-4bit. Hardware: Apple M3 Ultra, 256 GB unified memory. Feb 2026.

Prompt Processing Speed

Cold = first request (full processing). Warm = same prefix cached. Up to 18.6× faster at 50K tokens.

Multi-Turn Prefix Cache

8-turn coding conversation with 12K system prompt. After turn 1, 99%+ tokens served from cache.

Get Started

Up and running in seconds

Download vMLX, install the MLX inference backend with one click, pick any model from HuggingFace, and start generating. No cloud, no API keys, no Docker.

  • One-click vLLM-MLX installer
  • Download any MLX-compatible model
  • Automatic server start with smart defaults
  • OpenAI-compatible API on localhost
  • Full chat UI with advanced settings
  • Latest DMG: contains a Developer ID signed and notarized app
terminal
# First run: vMLX auto-installs vLLM-MLX
$ open vMLX.app
Installing vllm-mlx via uv...
 
# Select a model, configure settings, hit Start
Server started on http://127.0.0.1:8000
 
# OpenAI-compatible API ready instantly
$ curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[
    {"role":"user","content":"Hello!"}]}'
Server Configuration
vMLX server configuration
Configuration

Every parameter
at your fingertips

Fine-tune the MLX inference pipeline. Prefill batch sizes, cache memory, paged KV cache blocks, MCP tool bindings — vMLX exposes all 23 configuration flags.

prefill_batch_size max_concurrent_seq prefix_caching paged_kv_cache block_size cache_memory_% continuous_batching mcp_tools temperature top_p
FAQ

Questions & answers

What is the best app to run AI locally on a Mac?+

vMLX is the fastest open source Mac app for running LLMs locally on Apple Silicon. Unlike LM Studio or Ollama, vMLX provides prefix caching (up to 9.7x faster time-to-first-token), paged KV cache, continuous batching for up to 256 concurrent sequences, and MCP tool integration. It's free, requires no cloud connection, and works on any M1+ Mac.

How does vMLX compare to LM Studio and Ollama?+

At 100K token context, vMLX achieves 154,121 prompt tokens/sec (cold) vs LM Studio's 686 tok/s. vMLX uses paged multi-context KV caching (concurrent conversations stay cached), while LM Studio uses single-slot caching that evicts on switch. vMLX supports up to 256 concurrent sequences vs 1 for LM Studio. All three offer OpenAI-compatible APIs, but only vMLX exposes all 23 inference parameters.

Can I run DeepSeek, Llama, Qwen, or Gemma locally?+

Yes. vMLX supports any MLX-compatible model from HuggingFace including DeepSeek V3, Llama 3/4, Qwen 2.5/3, Gemma 3, Mistral, Phi, and hundreds more. Models run entirely on your Mac's Apple Silicon GPU. A 16GB Mac handles up to ~20B parameters, while 64GB+ handles 70B+ models.

What is prefix caching and why does it matter?+

Prefix caching stores computed KV states from previous prompt processing. When you send a new message that shares the same system prompt or history, cached tokens are reused instantly. In benchmarks, this reduces TTFT by up to 9.7x on 2.5K context. Critical for multi-turn conversations and agentic workflows.

Do I need internet or API keys?+

No. vMLX runs entirely on your Mac with zero cloud dependency. No API keys, no subscriptions, no rate limits. Your conversations and model weights stay 100% local and private. Internet is only needed to download models initially.

What Mac hardware do I need?+

Any Mac with Apple Silicon (M1, M2, M3, M4, M5 or later). More unified memory = larger models: 8GB handles ~3-7B, 16GB up to ~20B, 32-64GB handles 30-70B, and 128-512GB runs the largest open models at full precision.

Ready to run AI locally?

Free, open source, built for the Apple Silicon Mac you already have.
No cloud. No API keys. No rate limits.