Readme
Inference Lab
Documentation
LLM inference simulator for analyzing serving systems.
Simulates GPU clusters serving LLM inference workloads with realistic
performance modeling.
Features
Accurate Performance Modeling : Models compute (FLOPS) and memory bandwidth constraints
Multiple Scheduling Policies : FCFS, Priority, SJF, and more
Chunked Prefill : Simulates realistic request interleaving
KV Cache Management : Models GPU memory and KV cache utilization
Workload Generation : Supports Poisson, Gamma, and closed-loop patterns
WebAssembly Support : Run simulations in the browser via WASM
CLI Tool : Standalone binary for command-line usage
How does it work?
inference-lab uses discrete-event simulation to model the behavior of a
multi-GPU node serving LLM inference requests with the vLLM library. It
contains a facsimile of the vLLM queueing, scheduling, and execution logic,
with only the actual model inference replaced by a performance model based on
the supplied GPU specs and model architecture.
Within each simulation step, the simulator:
Processes any newly arrived requests, adding them to the scheduling queue.
Schedules requests to serve based on the selected scheduling policy.
Calculates the compute and memory bandwidth usage for the workload that the
scheduled requests represent, and the theoretical time required to execute
the workload on the specified hardware.
Increments the simulation time by the calculated execution time, updating the
state of all requests accordingly.
Caveats:
This assumes perfectly optimized GPU execution, ignoring kernel launch
overheads, poorly optimized kernels, application overhead, thermals, etc.
We simulate tensor parallel execution, but don't model multi-GPU
communication overheads.
Installation
As a Rust Library
cargo add inference-lab
As an npm Package (WASM)
npm install @doublewordai/inference-lab
cargo install inference-lab
Usage
CLI
Note: The CLI tool is only available if you install it using cargo install inference-lab (see above).
# Run with default configuration
inference-lab --config configs/config.toml
# Example output shows TTFT, E2E latency, throughput, and utilization metrics
Rust Library
use inference_lab:: simulation:: Simulator;
use inference_lab:: config:: SimulationConfig;
let config = SimulationConfig:: from_file( " config.toml" ) ? ;
let mut simulator = Simulator:: new( config) ;
let results = simulator. run ( ) ;
println! ( " Mean TTFT: {:.2} ms" , results. ttft_mean * 1000. 0 ) ;
println! ( " P99 E2E: {:.2} ms" , results. e2e_p99 * 1000. 0 ) ;
println! ( " Throughput: {:.1} tok/s" , results. throughput) ;
WebAssembly
import init , { run_simulation } from ' @doubleword/inference-lab' ;
await init ( ) ;
const config = {
hardware : {
name : " H100" ,
compute_flops : 1. 513e15 ,
memory_bandwidth : 3. 35e12 ,
memory_capacity : 85899345920 ,
bytes_per_param : 2
} ,
model : {
name : " Llama-3-70B" ,
num_parameters : 70000000000 ,
num_layers : 80 ,
hidden_dim : 8192 ,
num_heads : 64 ,
num_kv_heads : 8 ,
max_seq_len : 8192
} ,
scheduler : {
max_num_batched_tokens : 8192 ,
max_num_seqs : 256 ,
policy : " fcfs" ,
enable_chunked_prefill : true ,
block_size : 16
} ,
workload : {
arrival_pattern : " poisson" ,
arrival_rate : 5. 0 ,
num_requests : 400 ,
seed : 42 ,
input_len_dist : {
type : " lognormal" ,
mean : 6. 9 ,
std_dev : 0. 7
} ,
output_len_dist : {
type : " lognormal" ,
mean : 5. 3 ,
std_dev : 0. 8
}
}
} ;
const results = run_simulation ( JSON . stringify ( config ) ) ;
console . log ( ' TTFT P50:' , results . metrics . ttft_p50 ) ;
console . log ( ' Throughput:' , results . metrics . output_tokens_per_sec ) ;
Configuration
Configuration files use TOML format and specify:
Hardware : GPU specs (FLOPS, bandwidth, VRAM)
Model : LLM architecture (parameters, layers, heads)
Scheduler : Policies, max tokens, chunked prefill settings
Workload : Request arrival patterns and distributions
Example configurations are in the configs/ directory:
config.toml - Default H100 + Llama-3-70B setup
test_blog.toml - Closed-loop benchmark (64 users)
qwen3_30b_a3b.toml - Qwen model configuration
Building
Native Binary
cargo build -- release
./target/release/inference-lab -- config configs/config.toml
WASM Package
npm run build
# Outputs to pkg/ directory
Publishing
# Publish to npm (requires authentication)
npm run build
npm publish --access public
# Publish Rust crate
cargo publish
Project Structure
inference- lab/
├── src/
│ ├── simulation/ # Core simulator logic
│ ├── scheduler/ # Scheduling policies ( FCFS , Priority, SJF )
│ ├── compute/ # Performance calculations
│ ├── kv_cache/ # KV cache management
│ ├── request/ # Request generation and tracking
│ ├── metrics/ # Performance metrics collection
│ ├── config/ # Configuration structures
│ ├── lib. rs # Library root
│ ├── main. rs # CLI entry point
│ └── wasm. rs # WebAssembly bindings
├── configs/ # Example configurations
├── Cargo. toml # Rust package manifest
└── package. json # npm package manifest
Metrics
The simulator tracks:
TTFT (Time to First Token): Prefill latency
E2E (End-to-End): Total request latency
TPOT (Time Per Output Token): Decode latency per token
Throughput : Tokens generated per second
Utilization : Compute and memory bandwidth usage
KV Cache : Memory utilization over time
Results include percentiles (p50, p90, p95, p99) and means.
License
MIT
Repository
https://github.com/doublewordai/inference-lab