All notable changes to Atlas are documented here. The format is based on Keep a Changelog, and the project adheres to Semantic Versioning.
For per-release deep dives — kernel-level wins, the engineering history behind specific subsystems — see the Atlas Spark Journey.
0.1.0 — 2026-05-06
Initial public release. Atlas is a pure-Rust LLM inference engine targeting NVIDIA GB10 (DGX Spark, SM121) with twelve hand-tuned (Hardware × Model × Quantization) targets.
- Pure-Rust runtime — no Python, no PyTorch — for hybrid Attention + SSM/GDN/Mamba-2 architectures with NVFP4 / FP8 / BF16 quantization.
- 35 hyperoptimized CUDA kernels per target, compiled to PTX and
embedded in the binary at build time. Multi-model image dispatches
the right kernel set at startup from
config.json. - OpenAI- and Anthropic-compatible HTTP API (
/v1/chat/completions,/v1/responses,/v1/messages,/v1/models,/v1/conversations,/tokenize,/detokenize,/health,/metrics). - Tool calling with grammar-constrained decoding (Hermes, Qwen3-Coder, Mistral, MiniMax-XML formats).
- MTP speculative decoding (K=2 pipelined verify), self-speculative layer-skipping, and N-gram speculative decoding.
- Prefix caching: radix-tree (RadixAttention) + SSM snapshot cache (Marconi-style). 10× warm-cache TTFT reduction.
- KV cache dtypes: BF16, FP8, NVFP4, turbo3, turbo4. Optional
per-layer high-precision overlay (
--kv-high-precision-layers). - Multi-GPU expert parallelism (EP=2 over RoCEv2) for models that exceed a single GB10's weight budget (122B-class, MiniMax M2.7).
- Vision encoder (Qwen3-VL, Qwen3.6 ViT).
- High-speed NVMe KV swap (sliding-window, io_uring) for long-context decoding past the HBM cap.
- Bearer-token authentication (
--require-auth+--auth-tokens-file), constant-time validated. Default bind is127.0.0.1;--bind 0.0.0.0warns when used. - Twelve supported (GB10, model, quant) targets across Qwen3.5 / Qwen3.6 / Qwen3-Next / Qwen3-VL / Gemma-4 / Mistral-Small-4 / MiniMax-M2.7 / Nemotron-H families.
- mdBook documentation at
book/src/, rustdoc attarget/doc/, Docker imageavarok/atlas-gb10:latest.
For the kernel-level perf history — long-context regression sweeps,
the parking_lot migration, the libcuda + libnccl CI stubs, the
multi-stage scheduler refactor — see
docs/ATLAS_SPARK_JOURNEY.md and the
book/ chapters under deep-dives/.