A blazing-fast ⚡, lightweight Rust 🦀 implementation of vLLM.
- 🔧 Pure Rust Backend – Absolutely no PyTorch required
- 🚀 High Performance (with session-based context cache) – Superior than Python counterparts
- 🧠 Minimalist Core – Core logic written in < 2000 lines of clean Rust
- 💻 Cross-Platform – Supports CUDA (Linux/Windows) and Metal (macOS)
- 🤖 Built-in Chatbot/API Server – Native Rust server for both CUDA and Metal
- 🐍 Lightweight Python Interface – PyO3-powered bindings for chat completion
- 🤝 Open for Contributions – PRs, issues, and stars are welcome!
A100 (Single Card, 40G)
| Model | Format | Size | Decoding Speed |
|---|---|---|---|
| Llama-3.1-8B | ISQ (BF16->Q4K) | 8B | 90.19 tokens/s |
| DeepSeek-R1-Distill-Llama-8B | Q2_K | 8B | 94.47 tokens/s |
| DeepSeek-R1-0528-Qwen3-8B | Q4_K_M | 8B | 95 tokens/s |
| GLM-4-9B-0414 | Q4_K_M | 9B | 70.38 tokens/s |
| QwQ-32B | Q4_K_M | 32B | 35.69 tokens/s |
| Qwen3-30B-A3B | Q4_K_M | 30B (MoE) | 75.91 tokens/s |
Models: Qwen3-0.6B (BF16), Qwen3-4B (Q4_K_M), Qwen3-8B (Q2_K); Concurrent Requests: 1 - 128; Max Model Length: 512 - 2048; Max Output Tokens / Request: 512 - 2048;
| Model | Batch Size | Output Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|---|
| Qwen3-0.6B (BF16) | 128 | 63488 | 83.13s | 763.73 |
| Qwen3-0.6B (BF16) | 32 | 15872 | 23.53s | 674.43 |
| Qwen3-0.6B (BF16) | 1 | 456 | 9.23s | 49.42 |
| Qwen3-4B (Q4_K_M) | 1 | 1683 | 52.62s | 31.98 |
| Qwen3-8B (Q2_K) | 1 | 1300 | 80.88s | 16.07 |
Model: Qwen3-0.6B (BF16); Concurrent Requests: 256; Max Model Length: 1024; Max Output Tokens / Request: 1024
| Inference Engine | Tokens | Time (s) | Throughput (tokens/s) |
|---|---|---|---|
| vLLM (RTX 4070) (Reference) | 133,966 | 98.37 | 1361.84 |
| Nano-vLLM (RTX 4070) (Reference) | 133,966 | 93.41 | 1434.13 |
| vLLM.rs (A100) | 262,144 | 23.88s | 10977.55 (40%+ speedup) |
| Nano-vLLM (A100) | 262,144 | 34.22s | 7660.26 |
- ✅ LLaMa (LLaMa2, LLaMa3)
- ✅ Qwen (Qwen2, Qwen3)
- ✅ Qwen2 Moe
- ✅ Qwen3 Moe
- ✅ Mistral
- ✅ GLM4 (0414, Not ChatGLM)
Supports both Safetensor (including GPTQ and AWQ formats) and GGUF formats.
💡 1. Manual build required for CUDA compute capability < 8.0 (e.g., V100)
💡 2. Prebuilt package has native context cache feature without relying on flash attention, manual build required to use flash-context feature.
python3 -m pip install vllm_rs💡 You can use any client compatible with the OpenAI API.
🤖 Here is the client usage of context cache
# install server dependency
pip install fastapi uvicorn
# Start OpenAI API Server (default http://0.0.0.0:8000)
# openai.base_url = "http://localhost:8000/v1/"
# openai.api_key = "EMPTY"
# Local gguf file (`--f`), max output tokens for each request (`--max-tokens`), FP8 KV Cache (`--fp8-kvcache`, slight accuracy degradation)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000 --max-tokens 32768 --max-model-len 128000 --fp8-kvcache
# Use model weights from huggingface (`--m`: model_id, `--f`: gguf file)
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --host 0.0.0.0 --port 8000
# Multi-GPU (`--d`)
python -m vllm_rs.server --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000
# Multi-GPU for safetensors model: local safetensors model (`--w`) with in-situ quant to Q4K during model loading (enable maximum context length)
python -m vllm_rs.server --w /path/Qwen3-30B-A3B-Instruct-2507 --d 0,1 --host 0.0.0.0 --port 8000 --isq q4k --max-model-len 262144 --max-num-seqs 1
# multi-GPU inference + context caching for GGUF model (to cache context, you need to include a `session_id` in the `extra_body` field when making a request through the OpenAI API. The session_id should remain the same throughout a conversation, and a new `session_id` should be used for a new conversation, unsed session cache will be cleared. No need to change other settings of the API).
python -m vllm_rs.server --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --d 0,1 --host 0.0.0.0 --port 8000 --max-model-len 64000 --max-num-seqs 8 --context-cache# Interactive chat
# Load with model id
python -m vllm_rs.chat --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --fp8-kvcache
# local gguf file on second device (device order 1,`--d 1`)
python -m vllm_rs.chat --d 1 --f /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
# Load unquantized safetensors model as GGUF quantized (e.g., q4k), with maximum model context length
python -m vllm_rs.chat --d 0 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --max-num-seqs 1 --max-tokens 16384
# Enable context cache for fast response (CUDA)
python -m vllm_rs.chat --d 0,1 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --max-model-len 262144 --max-num-seqs 1 --context-cache
# ISQ q4k (macOS/Metal recommended, optional `--context-cache`)
python -m vllm_rs.chat --w /path/Qwen3-0.6B --isq q4k
# Chat completion
python -m vllm_rs.completion --f /path/qwq-32b-q4_k_m.gguf --prompts "How are you? | How to make money?"
# Chat completion (Multi-GPU, CUDA)
python -m vllm_rs.completion --w /home/GLM-4-9B-0414 --d 0,1 --batch 8 --max-model-len 1024 --max-tokens 1024from vllm_rs import Engine, EngineConfig, SamplingParams, Message
cfg = EngineConfig(weight_path="/path/Qwen3-8B-Q2_K.gguf", max_model_len=4096)
engine = Engine(cfg, "bf16")
params = SamplingParams(temperature=0.6, max_tokens=256)
prompt = engine.apply_chat_template([Message("user", "How are you?")], True)
# Synchronous generation for batched input
outputs = engine.generate_sync([params,params], [prompt, prompt])
print(outputs)
params.session_id = xxx # pass session to use context cache
# Streaming generation for single request
(seq_id, prompt_length, stream) = engine.generate_stream(params, prompt)
for item in stream:
# item.datatype == "TOKEN"
print(item.data)
⚠️ The first build may take time ifFlash Attentionis enabled.
⚠️ When enabling context caching or multi-GPU inference, you also need to compileRunner(usingbuild.shorrun.sh).
- Install the Rust toolchain
- On macOS, install Xcode command line tools
- For Python bindings, install Maturin
- Install Maturin
# install build dependencies (Linux)
sudo apt install libssl-dev pkg-config -y
pip install maturin
pip install maturin[patchelf] # For Linux/Windows- Build the Python package
# Naive CUDA (single GPU only)
maturin build --release --features cuda,python
# Naive CUDA (+CUDA Graph, experimental)
./build.sh --release --features cuda,graph,python
# CUDA (with context-cache and FP8 KV Cache, no Flash Attention)
./build.sh --release --features cuda,nccl,python
# CUDA (+Flash Attention, only used in prefill stage)
./build.sh --release --features cuda,nccl,flash-attn,python
# CUDA (+Flash Attention, used in both prefill and decode stage, long time to build)
./build.sh --release --features cuda,nccl,flash-context,python
# macOS (Metal, single GPU only, with Context-cache and FP8 kvcache)
maturin build --release --features metal,python- Install packages
# the package you built
pip install target/wheels/vllm_rs-*-cp38-abi3-*.whl --force-reinstall
pip install fastapi uvicornRun with --i for interactive chat and --w to specify safetensors model path, or --f load local gguf file:
# Naive CUDA (single card only, optional `--fp8-kvcache`)
cargo run --release --features cuda,nccl -- --i --d 0 --m unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF --f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --max-model-len 262144 --context-cache
# Multi-GPU CUDA (+Flash Attention, this scirpt help build the runner)
./run.sh --release --features cuda,nccl,flash-attn -- --i --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 262144 --context-cache
# Multi-GPU server mode (with `--fp8-kvcache` or `--context-cache`)
./run.sh --release --features cuda,nccl,flash-attn -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 100000 --max-num-seqs 4 --server --port 8000 --fp8-kvcache
# Multi-GPU server mode (with `--context-cache`, Flash Attention used in both prefill/decode, long time to build)
./run.sh --release --features cuda,nccl,flash-context -- --d 0,1 --w /path/Qwen3-30B-A3B-Instruct-2507 --isq q4k --max-model-len 100000 --max-num-seqs 4 --server --port 8000 --context-cache
# Naive CUDA (+CUDA Graph, experimental)
cargo run --release --features cuda,graph -- --i --f /path/qwq-32b-q4_k_m.gguf --presence-penalty 1.2 --frequency-penalty 1.2
# macOS (Metal)
cargo run --release --features metal -- --i --f /path/DeepSeek-R1-Distill-Llama-8B-Q2_K.gguf
#macOS (Metal, ISQ)
cargo run --release --features metal -- --i --w /path/Qwen3-0.6B --isq q4k --context-cacheSafetensor Models (Unquantized)
# CUDA
cargo run --release --features cuda,flash-attn -- --w /path/Qwen3-8B/ --prompts "How are you today?"
# Metal
cargo run --release --features metal -- --w /path/Qwen3-8B/ --prompts "How are you today?"
# Multi-GPUs (interactive mode)
./run.sh --release --features cuda,nccl -- --w /home/GLM-4-9B-0414 --d 0,1 --i --max-tokens 1024 --max-model-len 1024
# Multi-GPUs (server mode)
./run.sh --release --features cuda,nccl -- --w /home/GLM-4-9B-0414 --d 0,1 --max-tokens 1024 --max-model-len 1024 --server
# Multi-GPUs with Context Cache (interactive mode)
./run.sh --release --features cuda,nccl,flash-attn -- --w /home/GLM-4-9B-0414 --d 0,1 --i --max-tokens 1024 --max-model-len 1024 --context-cache| Flag | Description | |
|---|---|---|
--m |
Hugginface Model ID | |
--w |
Path to Safetensors model | |
--f |
GGUF filename when model_id given or GGUF file path | |
--d |
Device ID (e.g. --d 0) |
|
--max-num-seqs |
Maximum number of concurrent requests (default: 32, 8 on macOS) |
|
--max-tokens |
Max tokens per response (default: 4096, up to max_model_len) |
|
--batch |
Only used for benchmark (this will replace max-num-seqs and ignore prompts) |
|
--prompts |
Prompts separated by | | |
--dtype |
KV cache dtype: bf16 (default), f16, or f32 |
|
--isq |
Load unquantized model as GGUF quantized format such as q2k, q4k, etc. |
|
--temperature |
Controls randomness: lower (0.) → deterministic, higher (1.0) → creative/random. | |
--top-k |
Limits choices to the top k highest-probability tokens. smaller k → more stable;larger k → more random | |
--top-p |
Dynamically chooses the smallest set of tokens whose cumulative probability ≥ p. Range: 0.8 ~ 0.95 | |
--presence-penalty |
Presence penalty, controls whether the model avoids reusing tokens that have already appeared. Range [-2, 2]. Higher positive values → more likely to introduce new tokens; negative values → more likely to repeat previously used tokens |
|
--frequency-penalty |
Frequency penalty, controls whether the model reduces the probability of tokens that appear too often. Range [-2, 2]. Higher positive values → stronger penalty for frequently repeated tokens; negative values → encourages more repetition |
|
--server |
server mode used in Rust CLI, while Python use python -m vllm.server |
|
--fp8-kvcache |
Use FP8 KV Cache (when flash-context not enabled) |
Watch it in action 🎉
vLLM.rs_Qwen3_Demo-Compressed.mp4
💡 Run any unquantized models as GGUF quantized format, but it may takes few minutes for --isq other than q4k and q8_0.
# macOS
cargo run --release --features metal -- --w /path/Qwen3-0.6B/ --isq q4k --prompts "How are you today?"
# CUDA
cargo run --release --features cuda,flash-attn -- --w /path/Qwen3-8B/ --isq q4k --prompts "How are you today?"🚧 Under active development – breaking changes may occur!
- Batched inference (Metal)
- GGUF format support
- FlashAttention (CUDA)
- CUDA Graph
- OpenAI-compatible API (streaming support)
- Continuous batching
- Multi-gpu inference (Safetensors, GPTQ, AWQ, GGUF)
- Speedup prompt processing on Metal/macOS
- Chunked Prefill
- Session-based context cache (available on
CUDAwhencontext-cacheenabled) - Model loading from hugginface hub
- Model loading from ModelScope (China)
- Context cache for Metal/macOS
- FP8 KV Cache (CUDA)
- FP8 KV Cache (Metal)
- FP8 KV Cache (with Flash-Attn)
- Additional model support
- Candle-vLLM
- Python nano-vllm
💡 Like this project? Give it a ⭐ and contribute!