Thanks to visit codestin.com
Credit goes to lib.rs

#openai #llm #mocking

bin+lib llmsim

LLM Traffic Simulator - A lightweight, high-performance LLM API simulator

2 unstable releases

new 0.2.0 Jan 18, 2026
0.1.0 Jan 10, 2026

#637 in Simulation

Codestin Search App Codestin Search App Codestin Search App

1,778 downloads per month

MIT license

1MB
6.5K SLoC

Codestin Search App Codestin Search App Codestin Search App

LLMSim

LLM Traffic Simulator - A lightweight, high-performance LLM API simulator for load testing, CI/CD, and local development.

Overview

LLMSim replicates realistic LLM API behavior without running actual models. It solves common challenges when testing LLM-integrated applications:

  • Cost: Real API calls during load tests are expensive
  • Rate Limits: Production APIs prevent realistic load testing
  • Reproducibility: Real models produce variable responses
  • Traffic Realism: LLM responses have unique characteristics (streaming, variable latency, token-based billing)

Features

  • Multi-Provider API Support - OpenAI Chat Completions and OpenResponses APIs
  • Realistic Latency Simulation - Time-to-first-token (TTFT) and inter-token delays with normal distribution
  • Streaming Support - Server-Sent Events (SSE) for both OpenAI and OpenResponses streaming formats
  • Accurate Token Counting - Uses tiktoken-rs (OpenAI's tokenizer implementation)
  • Error Injection - Rate limits (429), server errors (500/503), timeouts
  • Multiple Response Generators - Lorem ipsum, echo, fixed, random, sequence
  • Model-Specific Profiles - GPT-5, GPT-4, Claude, Gemini latency profiles
  • Real-time Stats Dashboard - TUI dashboard with live metrics (requests, tokens, latency, errors)
  • Stats API - JSON endpoint for programmatic access to server metrics

Installation

cargo install llmsim

Demo

Console UI Demo

Usage

CLI Server

# Start with defaults (port 8080, lorem generator)
llmsim serve

# Start with real-time stats dashboard (TUI)
llmsim serve --tui

# All options
llmsim serve \
  --port 8080 \
  --host 0.0.0.0 \
  --generator lorem \
  --target-tokens 150 \
  --tui

# Using config file
llmsim serve --config config.yaml

Stats Dashboard

The --tui flag launches an interactive terminal dashboard showing real-time metrics:

  • Requests: Total, active, streaming vs non-streaming, requests/sec
  • Tokens: Prompt, completion, total, tokens/sec
  • Latency: Average, min, max response times
  • Errors: Total errors, rate limits (429), server errors (5xx), timeouts
  • Charts: RPS and token rate sparklines, model distribution

Controls: q to quit, r to force refresh.

As a Library

use llmsim::{
    openai::{ChatCompletionRequest, Message},
    generator::LoremGenerator,
    latency::LatencyProfile,
};

// Create a latency profile
let latency = LatencyProfile::gpt5();

// Count tokens
let tokens = llmsim::count_tokens("Hello, world!", "gpt-5").unwrap();

// Generate responses
let generator = LoremGenerator::new(100);
let response = generator.generate(&request);

API Endpoints

OpenAI API (/openai/v1/...)

Endpoint Method Description
/openai/v1/chat/completions POST Chat completions (streaming & non-streaming)
/openai/v1/models GET List available models
/openai/v1/models/{model_id} GET Get specific model details
/openai/v1/responses POST Responses API (streaming & non-streaming)

When using OpenAI SDKs, set the base URL to http://localhost:8080/openai/v1.

OpenResponses API (/openresponses/v1/...)

OpenResponses is an open-source specification for building multi-provider, interoperable LLM interfaces.

Endpoint Method Description
/openresponses/v1/responses POST Create response (streaming & non-streaming)

LLMSim endpoints

Endpoint Method Description
/health GET Health check
/llmsim/stats GET Real-time server statistics (JSON)

Configuration

YAML Config File

server:
  port: 8080
  host: "0.0.0.0"

latency:
  profile: "gpt5"
  # Custom values (optional):
  # ttft_mean_ms: 600
  # ttft_stddev_ms: 150
  # tbt_mean_ms: 40
  # tbt_stddev_ms: 12

response:
  generator: "lorem"
  target_tokens: 100

errors:
  rate_limit_rate: 0.01
  server_error_rate: 0.001
  timeout_rate: 0.0
  timeout_after_ms: 30000

models:
  available:
    - "gpt-5"
    - "gpt-5-mini"
    - "gpt-4o"
    - "claude-opus"

Supported Models

Family Models
GPT-5 gpt-5, gpt-5-mini, gpt-5.1, gpt-5.2, gpt-5-codex
O-Series o3, o3-mini, o4, o4-mini
GPT-4 gpt-4, gpt-4-turbo, gpt-4o, gpt-4o-mini, gpt-4.1
Claude claude-opus, claude-sonnet, claude-haiku (with versions)
Gemini gemini-pro

Latency Profiles

Profile TTFT Mean TBT Mean
gpt-5 600ms 40ms
gpt-5-mini 300ms 20ms
gpt-4 800ms 50ms
gpt-4o 400ms 25ms
o-series 2000ms 30ms
claude-opus 1000ms 60ms
claude-sonnet 500ms 30ms
claude-haiku 200ms 15ms
instant 0ms 0ms
fast 10ms 1ms

Use Cases

  • Load Testing - Simulate thousands of concurrent LLM requests
  • CI/CD Pipelines - Fast, deterministic tests for LLM integrations
  • Local Development - Develop without API keys or costs
  • Chaos Engineering - Test behavior under failure scenarios
  • Cost Estimation - Estimate token usage before production

Requirements

  • Rust 1.83+ (for building from source)
  • OR Docker

License

MIT License - see LICENSE for details.

Contributing

See CONTRIBUTING.md for contribution guidelines.

Dependencies

~39–59MB
~707K SLoC