Thanks to visit codestin.com
Credit goes to EricLBuehler.github.io

Skip to content

Fast, flexible LLM inference.

A single binary for local inference, OpenAI-compatible serving, and agentic workloads.
# Linux or macOS. Auto-detects CUDA, Metal, or MKL.
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh
# Windows, from PowerShell.
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex
# Check detected hardware, build features, and Hugging Face connectivity.
mistralrs doctor
# Chat with a model in your terminal.
mistralrs run --quant 4 -m Qwen/Qwen3-4B
# Or serve the same model behind an OpenAI-compatible HTTP API with a built-in web UI.
mistralrs serve --quant 4 -m Qwen/Qwen3-4B

Auto-detection

mistralrs run -m <model> works without flags. The binary infers the architecture, chat template, and accelerator from the Hugging Face repository.

Quantization

--quant 4 prefers a prebuilt UQFF from mistralrs-community if one is published, otherwise applies in-situ quantization. --quant auto picks for your hardware. Supported methods: ISQ, GGUF, GPTQ, AWQ, HQQ, AFQ, FP8, and MXFP4.

Auto-tuning

mistralrs tune -m <model> benchmarks the host and prints recommended settings for context length, batch size, and quantization level.

Diagnostics

mistralrs doctor reports detected hardware, compiled accelerator features, and Hugging Face connectivity before you load a model.

Agents

Local agent runtime with a server-side tool loop, web search, code execution, MCP, generated media, and sessions.

Multi-GPU

Tensor parallelism across local GPUs via NCCL, or across machines via the built-in ring backend.

Web UI

Mounted at /ui by default. Browser chat with reasoning blocks, tool and code-execution call visualization, and inline search results. Pass --no-ui to disable.