Auto-detection
mistralrs run -m <model> works without flags. The binary infers the
architecture, chat template, and accelerator from the Hugging Face
repository.
# Linux or macOS. Auto-detects CUDA, Metal, or MKL.curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh
# Windows, from PowerShell.irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex# Check detected hardware, build features, and Hugging Face connectivity.mistralrs doctor
# Chat with a model in your terminal.mistralrs run --quant 4 -m Qwen/Qwen3-4B
# Or serve the same model behind an OpenAI-compatible HTTP API with a built-in web UI.mistralrs serve --quant 4 -m Qwen/Qwen3-4BAuto-detection
mistralrs run -m <model> works without flags. The binary infers the
architecture, chat template, and accelerator from the Hugging Face
repository.
Quantization
--quant 4 prefers a prebuilt UQFF from mistralrs-community if one is
published, otherwise applies in-situ quantization. --quant auto picks
for your hardware. Supported methods: ISQ, GGUF, GPTQ, AWQ, HQQ, AFQ, FP8,
and MXFP4.
Auto-tuning
mistralrs tune -m <model> benchmarks the host and prints recommended
settings for context length, batch size, and quantization level.
Diagnostics
mistralrs doctor reports detected hardware, compiled accelerator
features, and Hugging Face connectivity before you load a model.
Agents
Local agent runtime with a server-side tool loop, web search, code execution, MCP, generated media, and sessions.
Multi-GPU
Tensor parallelism across local GPUs via NCCL, or across machines via the built-in ring backend.
Web UI
Mounted at /ui by default. Browser chat with reasoning blocks, tool and
code-execution call visualization, and inline search results. Pass --no-ui
to disable.