-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A minimal LLM inference server with a real paged KV-cache, continuous batching and speculative decoding. The model is a deterministic stand-in so the serving systems can be read, tested and benchmarked without a GPU.
A request arrives over HTTP, is tokenised, and joins a waiting queue. An engine loop runs one decode iteration at a time: a scheduler admits what fits, reserves KV blocks, preempts under pressure, runs a forward pass through the model, and retires finished sequences. Tokens stream back as Server-Sent Events. Three systems do the real work; the model is a one-method trait you swap out.
flowchart LR
C["HTTP client"] --> SRV["axum server<br/>src/server.rs"]
SRV --> ENG["Engine loop<br/>src/engine.rs"]
ENG --> SCH["Scheduler<br/>src/scheduler.rs"]
SCH --> KV["PagedKVCache<br/>src/paged_cache.rs"]
ENG --> M["Model trait<br/>src/model.rs"]
SRV --> TK["Tokeniser<br/>src/tokeniser.rs"]
ENG -. "single stream" .-> SP["SpeculativeDecoder<br/>src/speculative.rs"]
classDef sky fill:#0d1117,stroke:#38bdf8,color:#f5f7fa;
classDef cyan fill:#0d1117,stroke:#22d3ee,color:#f5f7fa;
classDef em fill:#0d1117,stroke:#34d399,color:#f5f7fa;
class C,SRV,TK sky;
class ENG,SCH,KV cyan;
class M,SP em;
This wiki is the complete documentation set. Start with Architecture, then the subsystem page for whatever you came to understand.
| Page | What it covers |
|---|---|
| Architecture | Module map, request lifecycle, the split that makes the policy testable, concurrency model. |
| Design-Decisions | The roads not taken: hash model, recompute preemption, the StepPlan split, and more, each with the rejected alternative. |
| Comparisons | forge-infer set honestly against vLLM, TGI, llama.cpp and the typical tutorial. |
| Roadmap-and-Limitations | What I will add, what I will not, and what this is not. |
| Page | What it covers |
|---|---|
| Paged-KV-Cache | Block allocation, why it kills fragmentation, the transactional append, a worked allocation. |
| Continuous-Batching | Iteration-level scheduling, admission, preemption and resume, with the exact decision rules. |
| Speculative-Decoding | Draft-then-verify, the rejection-sampling acceptance test, the exactness proof-by-test. |
| Engine | The loop that joins the scheduler to the model, step by step, with its two entry points. |
| Model-and-Tokeniser | The Model trait, the deterministic TinyTransformer, the byte-level codec and its data format. |
| HTTP-Server | The axum endpoints, the OpenAI-compatible shape, the SSE streaming protocol. |
| Page | What it covers |
|---|---|
| API-Reference | Every public Rust symbol and the full HTTP surface, grouped by module. |
| Configuration-and-Tuning | Every knob, what it trades off, and how to set it for a goal. |
| Benchmarks | What forge-bench measures, real numbers on Apple Silicon, how to read them honestly. |
| Testing-Strategy | The 37 tests, how they are organised, and why determinism makes the hard parts testable. |
| Security-Model | The threat model, what is defended and what is not, and a hardening checklist. |
| Writing-a-Model-Backend | Implementing Model against real weights and wiring the KV-cache through. |
| Examples-and-Recipes | Copy-paste recipes over HTTP and as a Rust library. |
| FAQ | Short answers to the questions people actually ask. |
| Troubleshooting | Concrete symptoms and fixes for build and runtime issues. |
cargo test # 37 tests across the serving stack
cargo run --release --bin forge-infer # serves 127.0.0.1:8080
cargo run --release --bin forge-bench # prints the throughput tablecurl -s localhost:8080/generate -d '{"prompt":"hello forge","max_tokens":24}'
curl -sN localhost:8080/v1/completions -d '{"prompt":"stream me","max_tokens":20,"stream":true}'Every component a real engine has is present and named the way the literature names it: a Model trait, a PagedKVCache with a block table, a Scheduler that returns a StepPlan each iteration, a SpeculativeDecoder with an acceptance test. The only thing faked is the model itself, a deterministic hash-based TinyTransformer. That keeps the build fast, the speculative acceptance test assertable, and every benchmark reproducible. To serve real text you implement Model::forward against your weights; nothing else in the stack changes. The reasoning is on Design-Decisions, and the how-to is on Writing-a-Model-Backend.
SarmaLinux . sarmalinux.com . forge-infer repository
Design
Subsystems
Reference and ops
- API Reference
- Configuration and Tuning
- Benchmarks
- Testing Strategy
- Security Model
- Writing a Model Backend
- Examples and Recipes
- FAQ
- Troubleshooting
SarmaLinux . sarmalinux.com