Fast, flexible LLM inference.

A single binary for local inference, OpenAI-compatible serving, and agentic workloads.

Start with a task

Run a model locally Install mistral.rs and chat with a model in the terminal.

Serve an API Expose a model through an OpenAI-compatible HTTP endpoint.

Use Python Call models from Python without running a separate server.

Use Rust Embed mistral.rs in a Rust application.

Build an agent Use mistral.rs as a local runtime for tools, code execution, web search, multimodal inputs, and session state.

Fit a larger model Use smart quantization to reduce model memory use.

Install, verify, and run

# Linux or macOS. Auto-detects CUDA, Metal, or MKL.
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.sh | sh

# Windows, from PowerShell.
irm https://raw.githubusercontent.com/EricLBuehler/mistral.rs/master/install.ps1 | iex

# Check detected hardware, build features, and Hugging Face connectivity.
mistralrs doctor

# Chat with a model in your terminal.
mistralrs run --quant 4 -m Qwen/Qwen3-4B

# Or serve the same model behind an OpenAI-compatible HTTP API with a built-in web UI.
mistralrs serve --quant 4 -m Qwen/Qwen3-4B

Documentation sections

Start here Choose the right entry point for your task.

Tutorials Linear lessons from install to a working setup.

Guides Task-oriented recipes.

Reference Lookup pages for CLI flags, HTTP endpoints, and SDK methods.

Explanation Concepts and design decisions.

Features

Auto-detection

mistralrs run -m <model> works without flags. The binary infers the architecture, chat template, and accelerator from the Hugging Face repository.

Quantization

--quant 4 prefers a prebuilt UQFF from mistralrs-community if one is published, otherwise applies in-situ quantization. --quant auto picks for your hardware. Supported methods: ISQ, GGUF, GPTQ, AWQ, HQQ, AFQ, FP8, and MXFP4.

Auto-tuning

mistralrs tune -m <model> benchmarks the host and prints recommended settings for context length, batch size, and quantization level.

Diagnostics

mistralrs doctor reports detected hardware, compiled accelerator features, and Hugging Face connectivity before you load a model.

Agents

Local agent runtime with a server-side tool loop, web search, code execution, MCP, generated media, and sessions.

Multi-GPU

Tensor parallelism across local GPUs via NCCL, or across machines via the built-in ring backend.

Web UI

Mounted at /ui by default. Browser chat with reasoning blocks, tool and code-execution call visualization, and inline search results. Pass --no-ui to disable.

Fast, flexible LLM inference.

Start with a task

Install, verify, and run

Documentation sections

Features

Community