-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
forge-infer is organised so each serving technique lives in its own module with its own tests, and the model is a swappable trait. This page is the map: the modules, the request lifecycle, the one design decision that makes the whole policy testable, and the concurrency model.
| Module | File | Key symbols | Responsibility |
|---|---|---|---|
model |
src/model.rs |
Model, TinyTransformer, StepLogits
|
The model trait and the deterministic stand-in. |
tokeniser |
src/tokeniser.rs |
encode, decode, decode_one, EOS_TOKEN
|
Reversible byte-level codec. |
paged_cache |
src/paged_cache.rs |
PagedKVCache, BlockTable, CacheError
|
Block-based KV allocation and block tables. |
scheduler |
src/scheduler.rs |
Scheduler::schedule, StepPlan, Sequence, SeqState
|
Continuous batching, admission, preemption. |
speculative |
src/speculative.rs |
SpeculativeDecoder::step, SpeculationResult
|
Draft-then-verify decoding with an acceptance test. |
engine |
src/engine.rs |
Engine::step, Engine::run_to_completion
|
The loop joining the scheduler to the model. |
server |
src/server.rs |
router, AppState, completions
|
The axum HTTP server and OpenAI-compatible endpoint. |
sequenceDiagram
participant C as Client
participant H as axum handler
participant T as Tokeniser
participant Sc as Scheduler
participant K as PagedKVCache
participant M as Model
C->>H: POST /v1/completions {prompt, max_tokens, stream}
H->>T: encode(prompt)
T-->>H: token ids
H->>Sc: submit(Sequence)
loop one iteration per decode step
Sc->>Sc: schedule() builds a StepPlan
Sc->>K: admit + reserve blocks
alt blocks exhausted
Sc->>K: free blocks of least-progressed sequence
Sc->>Sc: preempt and requeue
end
Sc->>M: forward(context) for each id in decode_batch
M-->>Sc: logits, then argmax token
Sc->>Sc: push_token, retire on eos or limit
end
H->>T: decode(tokens)
H-->>C: SSE stream of token deltas, then [DONE]
The scheduler and the engine are deliberately split. Scheduler::schedule (src/scheduler.rs) mutates only the cache and the waiting and running sets, and returns its decisions as a plain StepPlan value. Engine::step (src/engine.rs) is the only thing that ever calls Model::forward. Two payoffs:
-
The scheduling policy is unit-testable without a model or a GPU. The tests in
src/scheduler.rsconstruct a scheduler, submit sequences, callschedule(), and assert directly on the returnedStepPlanand on cache block counts.preempts_when_blocks_run_outandadmission_blocks_when_prompt_does_not_fitrun with no forward pass at all. -
The model is swappable. Because the engine talks to a
Modeltrait, replacingTinyTransformerwith a real backend is a matter of implementing one method,forward(&self, context: &[TokenId]) -> StepLogits.
The cost of the split is one extra indirection in the engine loop: schedule() returns the batch, then step() walks plan.decode_batch calling the model. I pay that every time for a policy I can test in isolation. The tempting alternative, a single tick() that schedules and runs the forward pass together, would tangle the two and force every scheduler test to stand up a model.
pub trait Model: Send + Sync {
fn vocab_size(&self) -> usize;
fn num_layers(&self) -> usize;
fn eos_token(&self) -> TokenId;
fn forward(&self, context: &[TokenId]) -> StepLogits;
}StepLogits carries the next-token logits and offers argmax for greedy decoding and prob_of(token) for the speculative acceptance test. TinyTransformer::forward maps a trailing eight-token window through a splitmix-style hash (hash_context) to a peaked logits distribution. It is a pure function of the context, which is exactly what lets the cache, the scheduler and the decoder be verified deterministically. The draft variant from TinyTransformer::draft shifts its peak on a deterministic quarter of contexts, so a draft and a target agree on most tokens and disagree on some, the realistic regime for speculation.
The server shares one Arc<dyn Model> across requests so the model weights stay resident, and gives each request its own Engine with its own PagedKVCache and scheduler (AppState::new_engine in src/server.rs). That keeps the example readable, at the cost of not yet sharing one long-lived engine loop across live HTTP traffic.
A production deployment would route all requests into a single engine loop so the continuous batching scheduler interleaves decode steps across concurrent requests. That loop is exactly what Engine::step implements, and the benchmark's continuous-batching path drives 64 requests through one engine to demonstrate it. Closing that gap on the server side is the first item on the Roadmap.
- An oversized prompt (more blocks than the cache holds) is never admitted.
schedule()rolls the admit back cleanly and the request waits;admission_blocks_when_prompt_does_not_fitpins this down. - Under block pressure the scheduler preempts. A preempted request is invisible to the client: it still completes, it just yields the engine for a while.
-
run_to_completioncarries a one-million-iteration guard so a pathological loop cannot hang the process.
Each module has its own page: Engine, Model-and-Tokeniser, HTTP-Server, Paged-KV-Cache, Continuous-Batching and Speculative-Decoding. The choices behind the split are on Design-Decisions, and the full symbol list is on API-Reference.
SarmaLinux . sarmalinux.com . forge-infer repository
Design
Subsystems
Reference and ops
- API Reference
- Configuration and Tuning
- Benchmarks
- Testing Strategy
- Security Model
- Writing a Model Backend
- Examples and Recipes
- FAQ
- Troubleshooting
SarmaLinux . sarmalinux.com