Home

forge-infer

A minimal LLM inference server with a real paged KV-cache, continuous batching and speculative decoding. The model is a deterministic stand-in so the serving systems can be read, tested and benchmarked without a GPU.

Thirty-second orientation

A request arrives over HTTP, is tokenised, and joins a waiting queue. An engine loop runs one decode iteration at a time: a scheduler admits what fits, reserves KV blocks, preempts under pressure, runs a forward pass through the model, and retires finished sequences. Tokens stream back as Server-Sent Events. Three systems do the real work; the model is a one-method trait you swap out.

flowchart LR
    C["HTTP client"] --> SRV["axum server<br/>src/server.rs"]
    SRV --> ENG["Engine loop<br/>src/engine.rs"]
    ENG --> SCH["Scheduler<br/>src/scheduler.rs"]
    SCH --> KV["PagedKVCache<br/>src/paged_cache.rs"]
    ENG --> M["Model trait<br/>src/model.rs"]
    SRV --> TK["Tokeniser<br/>src/tokeniser.rs"]
    ENG -. "single stream" .-> SP["SpeculativeDecoder<br/>src/speculative.rs"]

    classDef sky fill:#0d1117,stroke:#38bdf8,color:#f5f7fa;
    classDef cyan fill:#0d1117,stroke:#22d3ee,color:#f5f7fa;
    classDef em fill:#0d1117,stroke:#34d399,color:#f5f7fa;
    class C,SRV,TK sky;
    class ENG,SCH,KV cyan;
    class M,SP em;

The pages

This wiki is the complete documentation set. Start with Architecture, then the subsystem page for whatever you came to understand.

Orientation and design

Page	What it covers
Architecture	Module map, request lifecycle, the split that makes the policy testable, concurrency model.
Design-Decisions	The roads not taken: hash model, recompute preemption, the StepPlan split, and more, each with the rejected alternative.
Comparisons	forge-infer set honestly against vLLM, TGI, llama.cpp and the typical tutorial.
Roadmap-and-Limitations	What I will add, what I will not, and what this is not.

The subsystems

Page	What it covers
Paged-KV-Cache	Block allocation, why it kills fragmentation, the transactional `append`, a worked allocation.
Continuous-Batching	Iteration-level scheduling, admission, preemption and resume, with the exact decision rules.
Speculative-Decoding	Draft-then-verify, the rejection-sampling acceptance test, the exactness proof-by-test.
Engine	The loop that joins the scheduler to the model, step by step, with its two entry points.
Model-and-Tokeniser	The `Model` trait, the deterministic `TinyTransformer`, the byte-level codec and its data format.
HTTP-Server	The axum endpoints, the OpenAI-compatible shape, the SSE streaming protocol.

Reference and operations

Page	What it covers
API-Reference	Every public Rust symbol and the full HTTP surface, grouped by module.
Configuration-and-Tuning	Every knob, what it trades off, and how to set it for a goal.
Benchmarks	What forge-bench measures, real numbers on Apple Silicon, how to read them honestly.
Testing-Strategy	The 37 tests, how they are organised, and why determinism makes the hard parts testable.
Security-Model	The threat model, what is defended and what is not, and a hardening checklist.
Writing-a-Model-Backend	Implementing `Model` against real weights and wiring the KV-cache through.
Examples-and-Recipes	Copy-paste recipes over HTTP and as a Rust library.
FAQ	Short answers to the questions people actually ask.
Troubleshooting	Concrete symptoms and fixes for build and runtime issues.

Try it now

cargo test                                   # 37 tests across the serving stack
cargo run --release --bin forge-infer        # serves 127.0.0.1:8080
cargo run --release --bin forge-bench        # prints the throughput table

curl -s localhost:8080/generate -d '{"prompt":"hello forge","max_tokens":24}'
curl -sN localhost:8080/v1/completions -d '{"prompt":"stream me","max_tokens":20,"stream":true}'

The one simplification, stated plainly

Every component a real engine has is present and named the way the literature names it: a Model trait, a PagedKVCache with a block table, a Scheduler that returns a StepPlan each iteration, a SpeculativeDecoder with an acceptance test. The only thing faked is the model itself, a deterministic hash-based TinyTransformer. That keeps the build fast, the speculative acceptance test assertable, and every benchmark reproducible. To serve real text you implement Model::forward against your weights; nothing else in the stack changes. The reasoning is on Design-Decisions, and the how-to is on Writing-a-Model-Backend.

SarmaLinux . sarmalinux.com . forge-infer repository

forge-infer

Home

Design

Subsystems

Reference and ops

SarmaLinux . sarmalinux.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

forge-infer

Thirty-second orientation

The pages

Orientation and design

The subsystems

Reference and operations

Try it now

The one simplification, stated plainly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

forge-infer

Clone this wiki locally