Mini LLM Serve

A small LLM serving control plane for learning batching, streaming, token scheduling, and cache-aware inference systems.

中文 · Docs · Quick Start

Overview

mini-llm-serve is a compact LLM serving system focused on the serving control plane around model execution.

It does not try to replace vLLM, TensorRT-LLM, SGLang, or llama.cpp. Instead, it isolates the scheduling and systems layer so the core serving problems are easier to study end-to-end:

request lifecycle management
prefill / decode separation
token-budget-based scheduling
streaming response delivery
TTFT / TBT observability
prefix cache metadata
executor dispatch and result routing
reproducible benchmark scenarios

The execution backend is currently a Python mock executor. The point is to make scheduler behavior visible and testable before introducing real GPU inference.

Motivation

Modern LLM serving stacks are powerful, but they are also large and difficult to understand from first principles.

This project takes the opposite approach:

small enough to read
real enough to expose serving tradeoffs
structured enough to extend toward production-style components

The design goal is not "toy demo". It is a minimal, runnable model of the control plane behind LLM inference serving.

Feature Highlights

Area	What exists today
API	Connect RPC inference service and admin/metrics endpoints
Control plane	Go request lifecycle, scheduler, executor manager, metrics
Execution backend	Python mock LLM executor over Connect RPC
Scheduling	prefill/decode separation, token budget, small/large prefill queues
Streaming	unary and server-streaming generation paths
Observability	Prometheus metrics, runtime stats, TTFT, TBT, queue wait, batch size
Cache model	prefix cache metadata with hit/miss and saved-token metrics
Benchmarks	cache miss, cache hit, mixed prompt workloads

Architecture

Mini LLM Serve uses token-aware work scheduling. A request is represented as a lifecycle object, while prefill and decode are scheduled as separate work items.

The important internal loop is:

GenerateRequest
  -> Request
  -> WorkItem
  -> Scheduler
  -> ExecutorManager
  -> Python Mock Executor
  -> Event
  -> next WorkItem or final response

This split keeps responsibilities clear:

Request owns the user-visible lifecycle.
WorkItem is one schedulable unit of execution.
Event drives the state machine after executor output.
Scheduler chooses work by sequence and token budget.
ExecutorManager dispatches batches to backend executors.

Benchmark Highlights

The benchmark uses a Python mock executor, so the results should be read as control-plane behavior, not GPU inference performance.

Workload:

1000 requests per scenario
100 concurrency
Go server + Python mock executor
metrics computed as per-run deltas

Scenario	Throughput	Avg Latency	Avg TTFT	Avg TBT	Prefix Hits	Tokens Saved
`cache_miss`	3.28 req/s	30.502s	1.7322s	0.4109s	0	0
`cache_hit`	4.10 req/s	24.341s	0.3250s	0.3430s	1000	147000
`mixed_prompt`	4.22 req/s	23.682s	1.2117s	0.3209s	0	0

Key observation:

Prefix cache metadata reduced average TTFT from 1.7322s to 0.3250s, about an 81% reduction in this mock workload.

Detailed benchmark notes and reports are available under docs.

Quick Start

1. Start the Python mock executor

cd llm_serve
make run

The executor listens on 127.0.0.1:19991 by default.

2. Start the Go server

make run

Default endpoints:

inference service: 127.0.0.1:8800
admin / metrics: 127.0.0.1:8801

3. Check metrics

curl http://127.0.0.1:8801/metrics

4. Run benchmarks

make bench-smoke
make bench-cache-miss
make bench-cache-hit
make bench-mixed-prompt

Override benchmark parameters directly through the CLI:

go run ./cmd/bench --mode mixed_prompt --requests 1000 --concurrency 50 --timeout-ms 15000

Project Layout

cmd/
  bench/        benchmark CLI
  client/       simple client wrapper
  server/       Go serving process
internal/
  cache/        prefix cache metadata
  executor/     executor manager and Connect backend
  handler/      request admission
  metrics/      Prometheus metrics and runtime stats
  model/        Request, WorkItem, Event, Batch
  scheduler/    token-budget scheduler and queues
  state/        request lifecycle state machine
  transport/    Connect RPC transport handlers
llm_serve/      Python mock executor
proto/          protobuf API definitions
docs/           reports, plans, benchmark notes

Documentation

Detailed reports, benchmark notes, and implementation plans are available under docs.

Non-Goals

This repository intentionally does not implement:

real GPU kernels
real KV block allocation
PagedAttention or FlashAttention
distributed multi-node inference
production autoscaling
full OpenAI API compatibility

Those are inference-engine or production-platform concerns. This project focuses on the serving control plane.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
cmd		cmd
docker		docker
docs		docs
gen/go		gen/go
internal		internal
k8s		k8s
llm_serve		llm_serve
proto		proto
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README_zh.md		README_zh.md
go.mod		go.mod
go.sum		go.sum
mise.toml		mise.toml
pyrightconfig.json		pyrightconfig.json
server.toml		server.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini LLM Serve

Overview

Motivation

Feature Highlights

Architecture

Benchmark Highlights

Quick Start

1. Start the Python mock executor

2. Start the Go server

3. Check metrics

4. Run benchmarks

Project Layout

Documentation

Non-Goals

Related Systems

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini LLM Serve

Overview

Motivation

Feature Highlights

Architecture

Benchmark Highlights

Quick Start

1. Start the Python mock executor

2. Start the Go server

3. Check metrics

4. Run benchmarks

Project Layout

Documentation

Non-Goals

Related Systems

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages