A small LLM serving control plane for learning batching, streaming, token scheduling, and cache-aware inference systems.
中文 · Docs · Quick Start
mini-llm-serve is a compact LLM serving system focused on the serving control plane around model execution.
It does not try to replace vLLM, TensorRT-LLM, SGLang, or llama.cpp. Instead, it isolates the scheduling and systems layer so the core serving problems are easier to study end-to-end:
- request lifecycle management
- prefill / decode separation
- token-budget-based scheduling
- streaming response delivery
- TTFT / TBT observability
- prefix cache metadata
- executor dispatch and result routing
- reproducible benchmark scenarios
The execution backend is currently a Python mock executor. The point is to make scheduler behavior visible and testable before introducing real GPU inference.
Modern LLM serving stacks are powerful, but they are also large and difficult to understand from first principles.
This project takes the opposite approach:
- small enough to read
- real enough to expose serving tradeoffs
- structured enough to extend toward production-style components
The design goal is not "toy demo". It is a minimal, runnable model of the control plane behind LLM inference serving.
| Area | What exists today |
|---|---|
| API | Connect RPC inference service and admin/metrics endpoints |
| Control plane | Go request lifecycle, scheduler, executor manager, metrics |
| Execution backend | Python mock LLM executor over Connect RPC |
| Scheduling | prefill/decode separation, token budget, small/large prefill queues |
| Streaming | unary and server-streaming generation paths |
| Observability | Prometheus metrics, runtime stats, TTFT, TBT, queue wait, batch size |
| Cache model | prefix cache metadata with hit/miss and saved-token metrics |
| Benchmarks | cache miss, cache hit, mixed prompt workloads |
Mini LLM Serve uses token-aware work scheduling. A request is represented as a lifecycle object, while prefill and decode are scheduled as separate work items.
The important internal loop is:
GenerateRequest
-> Request
-> WorkItem
-> Scheduler
-> ExecutorManager
-> Python Mock Executor
-> Event
-> next WorkItem or final response
This split keeps responsibilities clear:
Requestowns the user-visible lifecycle.WorkItemis one schedulable unit of execution.Eventdrives the state machine after executor output.Schedulerchooses work by sequence and token budget.ExecutorManagerdispatches batches to backend executors.
The benchmark uses a Python mock executor, so the results should be read as control-plane behavior, not GPU inference performance.
Workload:
1000requests per scenario100concurrency- Go server + Python mock executor
- metrics computed as per-run deltas
| Scenario | Throughput | Avg Latency | Avg TTFT | Avg TBT | Prefix Hits | Tokens Saved |
|---|---|---|---|---|---|---|
cache_miss |
3.28 req/s | 30.502s | 1.7322s | 0.4109s | 0 | 0 |
cache_hit |
4.10 req/s | 24.341s | 0.3250s | 0.3430s | 1000 | 147000 |
mixed_prompt |
4.22 req/s | 23.682s | 1.2117s | 0.3209s | 0 | 0 |
Key observation:
Prefix cache metadata reduced average TTFT from
1.7322sto0.3250s, about an81%reduction in this mock workload.
Detailed benchmark notes and reports are available under docs.
cd llm_serve
make runThe executor listens on 127.0.0.1:19991 by default.
make runDefault endpoints:
- inference service:
127.0.0.1:8800 - admin / metrics:
127.0.0.1:8801
curl http://127.0.0.1:8801/metricsmake bench-smoke
make bench-cache-miss
make bench-cache-hit
make bench-mixed-promptOverride benchmark parameters directly through the CLI:
go run ./cmd/bench --mode mixed_prompt --requests 1000 --concurrency 50 --timeout-ms 15000cmd/
bench/ benchmark CLI
client/ simple client wrapper
server/ Go serving process
internal/
cache/ prefix cache metadata
executor/ executor manager and Connect backend
handler/ request admission
metrics/ Prometheus metrics and runtime stats
model/ Request, WorkItem, Event, Batch
scheduler/ token-budget scheduler and queues
state/ request lifecycle state machine
transport/ Connect RPC transport handlers
llm_serve/ Python mock executor
proto/ protobuf API definitions
docs/ reports, plans, benchmark notes
Detailed reports, benchmark notes, and implementation plans are available under docs.
This repository intentionally does not implement:
- real GPU kernels
- real KV block allocation
- PagedAttention or FlashAttention
- distributed multi-node inference
- production autoscaling
- full OpenAI API compatibility
Those are inference-engine or production-platform concerns. This project focuses on the serving control plane.