Atlas Inference Engine

Pure Rust LLM Inference
Universal Inference At Unimaginable Speeds

📑 Table of Contents

🧭 Philosophy

The foundation of any given field of science is philosophy. It is that which inspires direction, structure, and mission.

Atlas began as a solution to widely known problem in using other (python) inference engines built by data scientists: the code was steeped in a poly codebase with an ever shifting ecosystem of dependencies, patches, and cross-dependencies. One day your workaround for running a model works, the next day you have to update to a nightly branch of several dependencies and inject a new workaround. This is not how you build a software ecosystem; that's how you build a proof of concept. We thank the great and hard work data scientists made in proving LLMs can revolutionize our world, its economy, and how it challenges us to higher epochs. Now, the software engineers take the torch to turn a proof of concept into something that is designed to withstand the test of time.

Main Objective

Similar to how llama.cpp was built with the intent to prove you don't need $10000-$100000 GPUs to run LLMs, Atlas is built with the intent to consistently force the narrative that as hardware continues to advance, we should not have to pay premium Cloud API prices for inference. Atlas, by virtue of its philosophy, maximizes speed for each hardware/model combination, thus paving the way for meaningfully powerful and intelligent LLMs to be run locally in such a way the model is truly useful.

Design Choices

Free and Open Source, Always

We promised this since the beginning. We believe great software comes from opening the source, not from just keeping it closed. The more eyes, the better. And therein brings us to the next point.

Community-First

For those who've followed us this far since the inception of our Discord, you know the extent to which our commitment to the community is, according to one user humourously put, "cracked". We want to build something incredible, and that means we not only build for you, but you, now having access to the source code, can now build for others in ways that triumph over existing solutions. This is the only way we all win. We are the Pirates of the inference space.

Monorepo

We chose a monorepo design to ensure that, as we head further into the agentic age of coding, the average data scientist or engineer can contribute meaningful PRs to any part of the system. Eventually, since this is a monorepo, there will be a day where the repo is autonomously self-improving and self-patching. This is most efficient and most effective when all the code is in one place, not many.

Hardware+Model Specific Kernels

We make no compromises or generalizations. Each hardware and model combination has its own unique properties that require fine-tuning custom kernels that leverage the model for that specific hardware configuration. The end result? 2-3x faster kernels all around.

AI-Friendly Codebase

It took a significant amount of time to build this codebase. We also know people will want to submit AI-generated PRs. We can't stop you, and in fact, given SOTA, you might just have to! The good news is that this codebase was built with enough railguards, structure, and abstraction to guide your AI to absorb the entire monorepo and contribute meaningfully. There's enough context to keep this going off the rails like a crazy train.This means ultimately that instead of waiting for days to weeks before getting model support, you can just fork this repo, and ask your AI to integrate it, then within hours you'll more likely than not have a working model running. We will not be condescending, unlike some other inference engines out there when good-faith PRs that simply work are posted. We are not stymied by bureacracy, and want to enable the community to rapidly expand this monorepo ecosystem safely and effectively.

Theory-Friendly Codebase

Arxiv is getting countless papers published every day on AI. Nobody can keep up. Yet, some papers may be relevant to this project, others may not. Research endeavors to improve quality, alignment, and speed ought to be considered by our community as something we can integrate cleanly. Feel free to open a PoC PR here and just explain what you did and why, and how it works.

Plug and Play Design

Our system is modular, with tight abstraction boundaries and trait requirements that force the architecture to take on a certain form. This form is designed to prevent pigeon-holing the project into the wrong direction. The business logic is the same across all hardware/model combinations, just the concrete implementations differ.

🏛️ Architecture

The diagram below shows how a single HTTP request flows from the API surface down to hardware-specific CUDA kernel execution. Dashed borders mark the plug-and-play abstraction boundaries — the traits and registries where a new hardware target, model family, communication backend, or storage backend plugs in without touching the layers above or below it.

flowchart TB
    %% ── Colours & styles ──────────────────────────────────────────────
    classDef server fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc
    classDef scheduler fill:#1e6091,stroke:#184e77,color:#d9ed92
    classDef model fill:#b5179e,stroke:#7209b7,color:#ffe5fc
    classDef layer fill:#7209b7,stroke:#560bad,color:#ffd6ff
    classDef kernel fill:#f48c06,stroke:#dc2f02,color:#fff
    classDef storage fill:#264653,stroke:#1d3557,color:#a8dadc
    classDef comm fill:#3a86ff,stroke:#1d3557,color:#fff
    classDef trait stroke-dasharray: 6 4,stroke-width:2px

    %% ── Top layer: HTTP API ───────────────────────────────────────────
    HTTP["HTTP Server (spark-server)<br/>OpenAI · Anthropic · Responses"]:::server
    SCHED["Scheduler<br/>batches, MTP verify, KV alloc"]:::scheduler

    HTTP --> SCHED

    %% ── Model abstraction (plug-in #1) ────────────────────────────────
    subgraph MODEL ["🔌 trait Model"]
      direction TB
      TRANSFORMER["TransformerModel<br/>generic prefill/decode loop"]:::model
    end
    class MODEL trait

    SCHED --> MODEL

    %% ── Weight loader abstraction (plug-in #2) ────────────────────────
    subgraph LOADER ["🔌 trait ModelWeightLoader"]
      direction LR
      QW35["Qwen3.5<br/>27B/35B/122B"]:::layer
      QW36["Qwen3.6<br/>35B-A3B"]:::layer
      QWNEXT["Qwen3-Next<br/>80B-A3B"]:::layer
      QWVL["Qwen3-VL<br/>30B-A3B"]:::layer
      GEMMA["Gemma-4<br/>26B/31B"]:::layer
      MISTRAL["Mistral-Small-4<br/>119B"]:::layer
      MINIMAX["MiniMax M2.7<br/>229B-A10B"]:::layer
      NEMO["Nemotron-3<br/>Nano/Super"]:::layer
    end
    class LOADER trait

    TRANSFORMER --> LOADER

    %% ── Layer trait (plug-in #3) ──────────────────────────────────────
    subgraph LAYERS ["🔌 trait TransformerLayer"]
      direction LR
      ATTN["Attention<br/>(GQA, MLA, sliding)"]:::layer
      SSM["SSM<br/>(Mamba-2, GDN)"]:::layer
      MOE["MoE<br/>(routed + shared)"]:::layer
      FFN["Dense FFN<br/>(GeGLU, SwiGLU)"]:::layer
      MTP["MTP Head<br/>(draft proposer)"]:::layer
    end
    class LAYERS trait

    LOADER --> LAYERS

    %% ── GPU backend (plug-in #4) ──────────────────────────────────────
    subgraph GPU ["🔌 trait GpuBackend"]
      direction LR
      CUDA["CUDA backend<br/>(GB10 / Blackwell)"]:::kernel
      AMD["AMD ROCm<br/>(future)"]:::kernel
      APPLE["Apple Metal<br/>(future)"]:::kernel
    end
    class GPU trait

    LAYERS --> GPU

    %% ── Kernel registry (plug-in #5) ──────────────────────────────────
    subgraph KERNELS ["🔌 kernels/<hw>/<model>/<quant>/ — auto-discovered"]
      direction LR
      K_GB10["gb10/qwen3.5-35b-a3b/nvfp4<br/>+ 11 other targets"]:::kernel
    end
    class KERNELS trait

    CUDA --> KERNELS

    %% ── EP / multi-GPU (plug-in #6) ───────────────────────────────────
    subgraph EP ["🔌 trait CommBackend"]
      direction LR
      NCCL["NCCL<br/>(EP=2, all-reduce)"]:::comm
    end
    class EP trait

    LAYERS -.-> EP

    %% ── Storage backend (plug-in #7) ──────────────────────────────────
    subgraph STORE ["🔌 trait StorageBackend"]
      direction LR
      IORING["io_uring<br/>(NVMe KV offload)"]:::storage
    end
    class STORE trait

    SCHED -.-> STORE

    %% ── Cross-references ──────────────────────────────────────────────
    KERNELS -. "kernels selected by<br/>(hardware × model × quant)<br/>at build time" .-> CUDA

Reading the Diagram

Solid boxes are concrete implementations. Dashed borders with 🔌 are the trait-based abstraction boundaries — each is a Rust trait (or a filesystem convention for kernels) where a new integration plugs in:

Plug Point	What It Abstracts	To Add New Support
`trait Model`	Full model forward pass	Rarely needed — the existing `TransformerModel` handles all architectures via composable layers
`trait ModelWeightLoader`	HuggingFace → layer translation	Implement one struct with weight-name patterns for your model family (`factory.rs` adds one match arm)
`trait TransformerLayer`	Per-layer compute (attn, SSM, MoE, FFN)	Compose existing layer types or implement a new one for novel architectures
`trait GpuBackend`	All GPU memory and kernel ops	Swap the CUDA driver for another accelerator backend
`kernels/<hw>/<model>/<quant>/`	Hardware-tuned CUDA kernels	Drop a new directory with `MODEL.toml` + `.cu` files; `build.rs` auto-discovers it
`trait CommBackend`	Multi-GPU collective communication	Implement for MPI, GDR, or custom interconnects
`trait StorageBackend`	NVMe KV-cache offload I/O	Implement for CXL, RDMA, or other storage tiers

Data Flow Summary

HTTP → spark-server receives OpenAI/Anthropic requests, tokenizes, and enqueues
Scheduler → batches sequences, orchestrates prefill/decode/speculative-verify steps
Model → generic loop: embed → [layer₀ … layerₙ] → norm → lm_head
Layers → each layer dispatches through GpuBackend to launch kernels from AtlasRegistry
Kernels → pre-compiled PTX selected by (hardware × model × quant) target at build time
EP → CommBackend handles cross-GPU all-reduce after MoE expert computation
Storage → StorageBackend spills/restores KV blocks to NVMe for long-context sequences

📦 What We Ship Today

We have to walk before we can run. Today's Atlas is targeted at a single hardware platform — NVIDIA's GB10 (DGX Spark, SM121) — and twelve hand-tuned (Hardware × Model × Quantization) targets. Every supported model below runs off one multi-model binary; the right kernel set is selected at startup from the model's config.json. No swapping images, no rebuilding, no per-model magic — just point Atlas at a HuggingFace ID.

Family	Model	HuggingFace ID	Params / active	Architecture
Qwen3.5	Qwen3.5-27B	`Kbenkhaled/Qwen3.5-27B-NVFP4`	27B dense	Hybrid SSM + attention, dense FFN, MRoPE
Qwen3.5	Qwen3.5-35B-A3B	`Sehyo/Qwen3.5-35B-A3B-NVFP4`	35B / 3B	GDN + attention + MoE, MTP
Qwen3.5	Qwen3.5-122B-A10B	`Sehyo/Qwen3.5-122B-A10B-NVFP4`	122B / 10B	GDN + attention + MoE, MTP
Qwen3.6	Qwen3.6-35B-A3B	`Qwen/Qwen3.6-35B-A3B-FP8`	35B / 3B	GDN + attention + MoE, MRoPE, vision tower
Qwen3-Next	Qwen3-Next-80B-A3B	`nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4`	80B / 3B	SSM + attention + MoE
Qwen3-VL	Qwen3-VL-30B-A3B	`ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4`	30B / 3B	Vision + attention + MoE
Gemma-4	Gemma-4-26B-A4B	`bg-digitalservices/Gemma-4-26B-A4B-it-NVFP4A16`	26B / 4B	Attention + MoE, GeGLU
Gemma-4	Gemma-4-31B	`nvidia/Gemma-4-31B-IT-NVFP4`	31B dense	Attention (sliding + full), GeGLU
Mistral	Mistral-Small-4-119B	`mistralai/Mistral-Small-4-119B-2603-NVFP4`	119B / 6.5B	Attention + MoE
MiniMax	MiniMax-M2.7	`lukealonso/MiniMax-M2.7-NVFP4`	229B / ~10B	Attention + 256-expert MoE + MTP
Nemotron-H	Nemotron-3-Nano-30B-A3B	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`	30B / 3B	Mamba-2 + attention + MoE
Nemotron-H	Nemotron-3-Super-120B-A12B	`nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4`	120B / 12B	Mamba-2 + attention + MoE

This is a starting point, not a destination. The plug-and-play design above exists precisely so that AMD, Apple Silicon, Intel, and the next round of Blackwell parts can land here as community contributions, and so that the Llama 4s and DeepSeek V4s of next quarter slot in the same way the Qwens did this quarter. We did the hard part — bolting in the abstractions while bringing up the first twelve targets — so that adding the thirteenth is a weekend, not a quarter.

⚡ Performance

We're not going to spend much real estate on benchmark theatre. The numbers below are what the binary in this repository does on a single NVIDIA GB10, on a short prompt ("What is the capital of France?", max_tokens ≤ 30, temperature = 0.1), measured end-to-end through the HTTP API. They are reproducible: scripts/sweep_all_models.sh is the harness, and the source for every kernel that produced them is in this repository.

Model	Mode	tok/s
Qwen3.5-35B-A3B	MTP speculative (K=2)	131
Qwen3.5-35B-A3B	turbo4 KV	77
Qwen3.5-35B-A3B	No speculative	70
Qwen3-Next-80B-A3B	FP8 KV	74
Qwen3.5-122B-A10B	EP=2, MTP K=2 (600-tok sustained)	46
Qwen3.5-122B-A10B	FP8 KV, single-GPU tuned	32
Qwen3-VL-30B-A3B	NVFP4 KV	97
Nemotron-3-Nano-30B-A3B	FP8 KV	88
Nemotron-3-Super-120B	FP8 KV	24
Gemma-4-26B-A4B	default	67
Gemma-4-31B	`--max-batch-size 2`	9
Mistral-Small-4-119B	NVFP4	33
Qwen3.5-27B (dense hybrid)	FP8 KV	13

We compete with vLLM and TensorRT-LLM on the same GB10. On Qwen3.5-35B-A3B with MTP speculative decoding, Atlas decodes faster than the same model under NVIDIA's own vLLM build on the same hardware — meaningfully faster, on numbers we can hand you the script for. We will not put a bigger figure in this paragraph than the one that comes off our own benchmark scripts, and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.

The kernel-by-kernel comparison against PyTorch eager (35 hyperoptimized CUDA kernels, all wins on production-relevant shapes) lives in the benchmarks chapter along with the methodology footnotes — read them; they matter.

🗜️ KV Cache Quantization

Atlas stores attention key/value state in one of six quantized formats, selected via --kv-cache-dtype. Lower bit-widths fit more tokens in GPU memory at the cost of precision; the Turbo family adds Walsh-Hadamard rotation and Lloyd-Max optimal codebooks to recover accuracy at the same bit rate. Mix dtypes per layer with --kv-high-precision-layers to keep boundary layers at BF16 while compressing the middle.

CLI flag	Bits/element	Scale overhead	Technique	When to use
`bf16`	16	—	Raw BF16 storage	Maximum precision; short-context or quality-critical workloads
`fp8`	8	Per-tensor FP32 scale (from checkpoint or online calibration via `--fp8-kv-calibration-tokens`)	FP8 E4M3 with static or calibrated per-tensor scale	Default. Safe baseline — half the memory of BF16, minimal quality loss for most models
`turbo8`	8	Per-group BF16 scale (2 bytes / 16 elements)	Walsh-Hadamard rotation → FP8 E4M3 + BF16 per-group scales	FP8-level memory with outlier suppression; recommended for many-layer models (e.g. MiniMax M2.7, 58 layers) where per-group FP8 scales compound
`nvfp4`	4	Per-group FP8 scale (1 byte / 16 elements)	E2M1 packed nibbles (NVIDIA NVFP4 format)	4× compression vs BF16; good for long-context with `--kv-high-precision-layers auto`
`turbo4`	4	Per-group FP8 scale (1 byte / 16 elements)	Walsh-Hadamard rotation → Lloyd-Max optimal 4-bit codebook	~2× lower MSE than NVFP4 at the same bit rate; same memory footprint
`turbo3`	3	Per-group FP8 scale (1 byte / 16 elements)	Walsh-Hadamard rotation → Lloyd-Max 3-bit codebook (8 levels, packed 8 values → 3 bytes)	Maximum compression (22% smaller than turbo4); experimental

🚀 Quick Start

The whole supported model matrix lives in one Docker image. Pull it, mount your HuggingFace cache, point Atlas at any model ID from the model table.

Defaults below are tuned for maximum accuracy under agentic-coding workloads — 64K context window, BF16 MTP draft head (highest acceptance rate ⇒ highest end-to-end throughput), prefix caching for multi-turn tool loops, and FP8 KV cache with auto-promoted boundary layers. These are the recipes we use to drive opencode / Claude Code / Cline through Atlas on a single Spark.

Recipe A — Qwen3.6-35B-A3B (FP8 hybrid MoE, ~130 tok/s)

The default daily driver — 35 B params, 3 B active, GDN + attention + 256-expert MoE, MRoPE-positioned vision tower (text-only here).

docker pull avarok/atlas-gb10:latest

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Qwen/Qwen3.6-35B-A3B-FP8 \
    --port 8888 \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --enable-prefix-caching \
    --speculative \
    --num-drafts 2 \
    --tool-call-parser qwen3_coder

Why these flags:

--max-seq-len 65536 — 64K window for long agent traces, file reads, multi-step tool use.
--kv-cache-dtype fp8 --kv-high-precision-layers auto — half the memory of BF16, no measurable quality loss; the auto heuristic keeps first/last attention blocks at BF16 where the routing distribution is most sensitive.
--scheduling-policy slai — SLAi scheduler (Atlas's default) reorders concurrent sequences to keep MTP verify batches dense.
--enable-prefix-caching — radix-tree prefix cache; tool-use sessions reuse the system prompt + tool-defs + earlier turns.
--speculative --num-drafts 2 — MTP draft head proposes 2 tokens per step. No --mtp-quantization flag ⇒ defaults to BF16, which gives the highest acceptance rate (lossier MTP projections lower acceptance and usually worsen end-to-end tok/s, despite the faster draft forward).
--tool-call-parser qwen3_coder — explicit Qwen XML tool format. Atlas auto-resolves the right parser from tool_defaults.toml per model; pass it anyway in production scripts.

Recipe B — Qwen3.5-35B-A3B (NVFP4, ~131 tok/s with MTP K=2)

The fastest model in the matrix on a single Spark.

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-35B-A3B-NVFP4 \
    --port 8888 \
    --max-seq-len 65536 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.90 \
    --scheduling-policy slai \
    --enable-prefix-caching \
    --speculative \
    --num-drafts 2 \
    --tool-call-parser qwen3_coder

Recipe C — Qwen3.5-122B-A10B (NVFP4, single Spark)

The 122B NVFP4 weights + Atlas runtime overhead leave only ~2 GB for KV cache on a 119.7 GB GB10, so this recipe sacrifices --speculative (the MTP draft head + draft KV costs ~1.5 GB) to keep a real 16 K context window. Verified end-to-end: model loads, /v1/chat/completions answers correctly, 4-way concurrent serves cleanly.

sudo docker run -d --name atlas \
  --network host --gpus all --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  avarok/atlas-gb10:latest \
  serve Sehyo/Qwen3.5-122B-A10B-NVFP4 \
    --port 8888 \
    --max-seq-len 16384 \
    --kv-cache-dtype fp8 \
    --kv-high-precision-layers auto \
    --gpu-memory-utilization 0.92 \
    --scheduling-policy slai \
    --max-batch-size 1 \
    --max-num-seqs 4 \
    --oom-guard-mb 1024 \
    --ssm-cache-slots 0 \
    --tool-call-parser qwen3_coder

For 122B with both --speculative and a 64 K window, move to EP=2 across two Sparks (QUICKSTART.md §5). For long contexts on a single Spark, add --high-speed-swap --high-speed-swap-dir /path/on/nvme --high-speed-swap-cache-blocks-per-seq 64 — HSS keeps a rolling 1024-token KV window in HBM and streams older blocks to NVMe through an io_uring orchestrator. The container needs --security-opt seccomp=unconfined --ulimit memlock=-1 for io_uring access.

Hitting the Endpoint

Atlas speaks OpenAI, Anthropic, and Responses APIs on the same port. curl, the OpenAI SDK, Open WebUI, opencode, Cline, Claude Code — point them at port 8888:

curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"atlas",
    "messages":[{"role":"user","content":"Hello!"}],
    "max_tokens":256
  }'

Per-model recipes (vision input, multi-node EP=2, single-GPU 122B with the tighter budget) live in QUICKSTART.md. Build-from-source instructions are in CONTRIBUTING.md, and the kernel build pipeline is documented in docs/ARCHITECTURE.md.

🔬 Kernel Debugging

Atlas exposes a focused set of environment-gated diagnostic dumps for tracking down quality regressions — magnitude drift, expert-routing skew, MoE under-counting, and the rest of the bug class where the kernels run cleanly but the output slowly degrades. The dumps are zero-overhead when their env var is unset (single var() lookup per call, no GPU sync, no copy) so leaving the production binary instrumented is safe.

For the full diagnostic playbook — including the cheapest-signal-first elimination ladder, how to build a byte-exact HF CPU oracle, the per-layer divergence comparator, and the methodological reversals that cost us hours — see DEBUGGING_METHODOLOGY.md. What follows is the env-var reference.

MoE-path dumps — `ATLAS_DUMP_EXPERT_IDS=1`

Set -e ATLAS_DUMP_EXPERT_IDS=1 on the container. The MoE prefill paths (both forward_prefill_fp8.rs and forward_prefill.rs) emit the following per-fire log lines, scoped to the last token of the chunk so the values are directly comparable to a single-pass reference forward at the same position:

Log marker	Fires	What it captures	Use it to localize
`ATLAS_FP8_GROUPED_KERNEL`	once (FP8 only)	v1 vs v2 grouped-GEMM selection	Confirm which kernel variant is active
`ATLAS_EXPERT_LOAD`	once / server	Per-expert histogram + `truncated=true/false` flag	Spot `max_m_tiles` truncation against actual routing skew
`ATLAS_GATE_INPUT`	per layer × chunk	post-norm router input (`\|x\|` + `first5`)	Verify the MoE block input matches the reference
`ATLAS_GATE_LOGITS`	per layer × chunk	top-10 `(idx, val)` + mean + std of raw gate logits	Catch gate-matmul drift before softmax/topK
`ATLAS_EXPERT_IDS`	per layer × chunk	top-K indices + renormalized weights + sum	Confirm routing decisions match HF
`ATLAS_ROUTED_ONLY`	per layer × chunk	routed sum before shared blend	Isolate the routed-expert contribution
`ATLAS_SHARED_OUT`	per layer × chunk	shared-expert output (pre-sigmoid)	Verify the dense FFN branch independently
`ATLAS_SHARED_GATE`	per layer × chunk	`dot(input, gate_weight)` + sigmoid value	Confirm shared-expert attenuation matches
`ATLAS_MOE_OUT`	per layer × chunk	final MoE block output (routed + blended)	The full-block ground truth vs the reference

SSM-path dumps

The SSM (GDN / Mamba-2) prefill in qwen3_ssm/trait_prefill.rs adds three pre-norm hooks under the same env var:

Log marker	What it captures
`ATLAS_PRENORM_HIDDEN`	Residual stream entering this layer (= previous layer's output)
`ATLAS_PRENORM_OUTPROJ`	SSM `out_proj` output before residual add
`ATLAS_PRENORM_SUM`	hidden + out_proj (the input to `post_attention_layernorm`)

Together, those plus the MoE dumps above give a complete trace of the residual stream at every layer boundary for any token in any chunk.

Path-toggle env vars

For bisecting which code path is at fault, two override toggles let you swap the routed-expert dispatch at runtime without rebuilding:

Env var	Effect
`ATLAS_FP8_MOE_COALESCED=0`	Forces the FP8 grouped-GEMM v1 kernel (default is v2; v1 has a documented numerical bug for some `(token, expert)` tiles)
`ATLAS_FORCE_NVFP4_MOE=1`	Routes an FP8 model's MoE through the NVFP4 path — useful for cross-validating that the bug is in one specific quant path

How we use these in practice — 3-step workflow

The order matters; this is the same workflow that found and fixed three compounding MoE bugs (commits 6a5fd3d, 34626d3, adf39ce, ffdb41d) on the Qwen3.6-A3B long-context investigation:

Build an HF reference oracle. A single-precision forward pass through HF Transformers on the same token IDs (read them back from Atlas's /tokenize — do not re-render the chat template), with output_hidden_states=True and per-layer hooks on mlp.gate, mlp.shared_expert, and mlp.shared_expert_gate. Record \|x\| + first5 per layer for the last token.
Spin up Atlas with -e ATLAS_DUMP_EXPERT_IDS=1. Fire the same prompt. The MoE markers above give you per-layer Atlas values comparable to the oracle.
Per-layer comparator. A short script (the comparator pattern is captured in DEBUGGING_METHODOLOGY.md §4) prints ratio = |Atlas| / |HF| and overlap = |top-K_Atlas ∩ top-K_HF| per layer. The first layer where the ratio falls outside [0.95, 1.05] or overlap drops below 6/8 is your first-divergent layer — start drilling there.

For the 2026-05-20 MoE bug hunt this localized the issue from "16K context produces gibberish" to "L0 MoE output magnitude 3.4× too large because of three compounding bugs: v1 grouped-GEMM, missing zero-init, broken max_m_tiles heuristic" within a few iterations. After all three fixes, all 40 layers landed in [0.977, 1.021] of HF baseline — at the FP8 quantization noise floor.

🔌 Adding a New Hardware Target

The full recipe is in docs/HARDWARE.md. The short version: implement two traits (ComputeTarget for the build-time compiler, GpuBackend for the runtime), drop kernel sources into kernels/<your-hw>/, add one match arm in the registry. There is a MockGpuBackend in spark-runtime that lets you write and test the entire scaffold without owning the hardware — every layer above the GPU trait is hardware-agnostic, so unit tests can run on a laptop. We bolted the project from "single CUDA target" to "trait-pluggable across vendors" specifically so that the AMD, Apple, and Intel ports stop being our problem and start being yours.

🧬 Adding a New Model

Same story, smaller surface. Implement ModelWeightLoader (one struct, the existing Qwen3AttentionLayer/MoeLayer/Qwen3SsmLayer/NemotronMamba2Layer primitives cover most architectures), add one line to the factory dispatch, optionally drop a MODEL.toml for sampling defaults and behavior knobs. Kernels are reused; the scheduler is untouched; the server is oblivious. The step-by-step cookbook is in docs/HARDWARE.md. Once your loader produces coherent output on the integration coherence prompt, you are done — file the PR.

📚 Citations

We did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Atlas owes a direct intellectual debt to:

FlashAttention-2 — Tri Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ICLR 2024. arXiv:2307.08691 — tiled online softmax, Q/K/V SMEM staging, causal masking. Foundation of our prefill kernel.
FlashAttention-4 — Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao. FlashAttention-4: Taming the Hardware. 2025. arXiv:2603.05451 — conditional softmax rescaling and software polynomial sw_exp (3 FMA + ldexpf instead of going through the SFU). Both shipped in our GQA-fused paged Flash Attention.
FlashInfer — Ye, Chen, Lai, Zhao, Zheng, Shao, Hou, Jin, Zuo, Yin, Chen, Ceze. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving. MLSys 2025 (Best Paper). arXiv:2501.01005 — block-sparse paged KV cache, page index prefetch to SMEM, the gather-SMEM-MMA pattern for scattered pages. Informed our paged attention design.
SageAttention 3 — Zhang, Huang, Zhang, Wei, Zhu, Chen. SageAttention3: Microscaling FP4 Attention on Blackwell GPUs. NeurIPS 2025 Spotlight. arXiv:2505.11594 — FP4 attention with FP8 per-block microscales. On the SM121 roadmap once silicon-level FP4 MMA arrives upstream.
LeanAttention — Roy, Vassilieva, Willke, Mendis. LeanAttention: Hardware-Aware Scalable Attention for LLM Inference. 2024. arXiv:2405.10480 — stream-K tile scheduling for near-100% SM occupancy in split-K decode attention. Planned next.

If you wrote one of these papers and you spot a misattribution or a wrong technique credit on our side, open an issue. We would rather be corrected than wrong.

⚖️ License and Enterprise Edition

Atlas operates under a dual-license model. Both are real, both are intentional, and neither is a teaser for the other.

Community Edition — AGPLv3. Free, open, copyleft. Use it for yourself to run inference on your own hardware, research, hobby projects, side-projects, and/or hosted demos, as examples. If you want to make money from Atlas, purchase a commercial license.
Enterprise Edition — commercial license. If you need to ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, or simply want a support relationship with the people who wrote the kernels, contact sales. Enterprise customers also receive prioritized model and hardware ports.

This split exists for a single reason: a permissive license keeps us building Atlas full-time, and the AGPL community license keeps the project honest. What is in this repository is what we run.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github		.github
assets		assets
bench		bench
book		book
crates		crates
data		data
docker		docker
docs		docs
examples		examples
jinja-templates		jinja-templates
kernels		kernels
scripts		scripts
signatures/version1		signatures/version1
site		site
test_data		test_data
tests		tests
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.licenserc.yaml		.licenserc.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLA.md		CLA.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEBUGGING_METHODOLOGY.md		DEBUGGING_METHODOLOGY.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
_typos.toml		_typos.toml
deny.toml		deny.toml
rust-toolchain.toml		rust-toolchain.toml

Folders and files

Latest commit

History

Repository files navigation

Atlas Inference Engine

📑 Table of Contents

🧭 Philosophy

Main Objective

Design Choices

Free and Open Source, Always

Community-First

Monorepo

Hardware+Model Specific Kernels

AI-Friendly Codebase

Theory-Friendly Codebase

Plug and Play Design

🏛️ Architecture

Reading the Diagram

Data Flow Summary

📦 What We Ship Today

⚡ Performance

🗜️ KV Cache Quantization

🚀 Quick Start

Recipe A — Qwen3.6-35B-A3B (FP8 hybrid MoE, ~130 tok/s)

Recipe B — Qwen3.5-35B-A3B (NVFP4, ~131 tok/s with MTP K=2)

Recipe C — Qwen3.5-122B-A10B (NVFP4, single Spark)

Hitting the Endpoint

🔬 Kernel Debugging

MoE-path dumps — ATLAS_DUMP_EXPERT_IDS=1

SSM-path dumps

Path-toggle env vars

How we use these in practice — 3-step workflow

🔌 Adding a New Hardware Target

🧬 Adding a New Model

📚 Citations

⚖️ License and Enterprise Edition

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

MoE-path dumps — `ATLAS_DUMP_EXPERT_IDS=1`

Packages