Save 15-95% on eligible context automatically. For a quick overview, see the README Compression section.
OmniRoute implements a modular prompt compression pipeline that runs proactively before requests hit upstream providers. This means your token savings happen transparently β no changes needed to your workflow.
Client Request
β Compression Strategy Selector
β Combo override? β Use combo setting
β Auto-trigger threshold? β Use auto mode
β Default mode? β Use global setting
β Off? β Skip compression
β Selected Compression Mode
β Off: No compression
β Lite: Safe whitespace/formatting cleanup (~15%)
β Standard: Caveman-speak filler removal (~30%)
β Aggressive: History aging + summarization (~50%)
β Ultra: Heuristic pruning + code-block thinning (~75%)
β RTK: Command-aware terminal/tool-output filtering (60-90% upstream range)
β Stacked: Ordered multi-engine pipeline, usually RTK then Caveman (78-95% eligible range)
β Compressed Request β Provider
No compression applied. All messages pass through unchanged.
The safest mode β zero semantic change, only formatting cleanup:
| Technique | Description |
|---|---|
collapseWhitespace |
Merge consecutive blank lines and trailing spaces |
dedupSystemPrompt |
Remove duplicate system messages |
compressToolResults |
Compress verbose tool/function outputs |
removeRedundantContent |
Strip repeated instructions |
replaceImageUrls |
Shorten base64 image data URIs |
Best for: Always-on usage, safety-critical workflows.
Inspired by Caveman β removes filler words and verbose phrasing while preserving meaning:
- Removes filler words ("please", "I think", "basically", "actually")
- Condenses verbose phrases ("in order to" β "to", "as a result of" β "because")
- Strips polite hedging ("Would you mind...", "If you could possibly...")
- 30+ regex rules tuned for coding prompts
Best for: Daily coding workflows, cost-conscious teams.
Smart history management for long sessions:
- Message Aging β older messages get progressively compressed
- Tool Result Summarization β long tool outputs replaced with summaries
- Structural Integrity Guards β ensures
tool_use+tool_resultpairs stay consistent - Context Window Awareness β respects per-model token limits
Best for: Extended debugging sessions, large codebases.
Maximum compression for token-critical scenarios:
- Heuristic Pruning β removes messages below relevance threshold
- Code Block Thinning β compresses repetitive code examples
- Binary Search Truncation β finds optimal cut point for context window
- All Aggressive mode features included
Best for: When you're hitting context limits repeatedly.
RTK mode is optimized for verbose tool outputs that appear in coding-agent sessions:
- Detects command/output classes such as
git status,git diff,git log, test runners, TypeScript/Vite/Webpack builds, ESLint/Biome/Prettier, npm audit/installs, Docker logs, infra output, and generic shell output - Applies JSON filter packs from
open-sse/services/compression/engines/rtk/filters/ - Ships 49 built-in filters with inline verify samples
- Removes ANSI control sequences, progress bars, repeated lines, and non-actionable noise
- Preserves failures, errors, warnings, changed files, summaries, and the tail of long output
- Supports trust-gated project filters, global filters, and optional redacted raw-output recovery
Best for: Agent sessions with shell, build, test, git, grep, and file-output transcripts.
Stacked mode runs multiple compression engines in a deterministic order. The default pipeline is:
RTK -> CavemanThat order keeps terminal/tool output compact first, then applies Caveman semantic condensation to the remaining natural-language prompt. Stacked pipelines can be configured globally or through compression combos assigned to routing combos.
Best for: Mixed context with large tool logs plus human instructions or assistant summaries.
OmniRoute documents compression savings from two sources: upstream project benchmarks and OmniRoute's own engine composition.
| Source | Upstream README number used here |
|---|---|
| Caveman | ~75% fewer output tokens, 65% benchmark average output savings, 22-87% range, and ~46% input compression tool |
| RTK | 60-90% command-output savings; sample session ~118,000 -> ~23,900 tokens, or 79.7% saved (~80%) |
For overlapping tool/context payloads, the default OmniRoute combo stacks the engines:
RTK -> CavemanThe combined savings are multiplicative, not additive:
combined = 1 - (1 - RTK savings) * (1 - Caveman input savings)
average = 1 - (1 - 0.80) * (1 - 0.46) = 89.2%
range = 1 - (1 - 0.60..0.90) * (1 - 0.46) = 78.4-94.6%That 78-95% number applies when both RTK and Caveman can reduce the same input/context payload.
Caveman response output mode is separate: when enabled, use Caveman's own output savings (65%
average, ~75% headline, 22-87% range). Total billing savings depend on your prompt/output mix.
Without compression: 47K tokens sent to LLM
With Lite: 40K tokens sent (15% saved β safe, always-on)
With Standard: 33K tokens sent (30% saved β caveman-speak rules)
With Aggressive: 24K tokens sent (50% saved β aging + summarization)
With Ultra: 12K tokens sent (75% saved β heuristic pruning)
With RTK: 19K-5K tokens sent (60-90% saved on command/tool output)
With Stacked: 10K-2.5K tokens sent (78-95% eligible RTK+Caveman range)
Navigate to Dashboard β Context & Cache:
- Caveman β mode selection, language packs, preview, and global defaults
- RTK β command-filter preview, RTK safety settings, and filter catalog
- Compression Combos β named engine pipelines assigned to routing combos
- Auto-Trigger Threshold β automatically engage compression when token count exceeds threshold
In Dashboard β Context & Cache β Compression Combos, assign a compression combo to a routing
combo:
Combo: "free-forever"
Compression Combo: "coding-agent-stack"
Pipeline: RTK -> Caveman
Targets:
1. gc/gemini-3-flash
2. if/kimi-k2-thinkingThis lets you use stacked compression on free/coding providers while keeping lite mode on paid subscriptions.
# Get compression settings
curl http://localhost:20128/api/settings/compression
# Update compression settings
curl -X PUT http://localhost:20128/api/settings/compression \
-H "Content-Type: application/json" \
-d '{"defaultMode":"stacked","autoTriggerMode":"stacked","autoTriggerTokens":32000}'
# Preview a specific RTK/stacked payload
curl -X POST http://localhost:20128/api/compression/preview \
-H "Content-Type: application/json" \
-d '{"mode":"rtk","messages":[{"role":"tool","content":"npm test output here"}]}'
# List RTK filter packs
curl http://localhost:20128/api/context/rtk/filters
# Test RTK directly with optional command metadata
curl -X POST http://localhost:20128/api/context/rtk/test \
-H "Content-Type: application/json" \
-d '{"command":"npm test","text":"FAIL tests/example.test.ts\nError: boom"}'The compression engine always preserves:
- β Code blocks (fenced and inline)
- β URLs and file paths
- β JSON structures and structured data
- β Identifiers and protected technical tokens
- β Mathematical expressions
- β Tool/function call definitions
- β System prompts (in lite mode)
RTK raw-output recovery redacts common API keys, bearer tokens, Slack tokens, AWS access keys, passwords, tokens, and secrets before anything is persisted.
Every compressed request includes stats in the server logs:
{
"originalTokens": 47200,
"compressedTokens": 40120,
"savingsPercent": 15.0,
"techniquesUsed": ["collapseWhitespace", "dedupSystemPrompt"],
"mode": "lite",
"engine": "caveman",
"compressionComboId": "coding-agent-stack",
"durationMs": 0.8,
"rtkRawOutputPointers": []
}| Phase | Modes | Status |
|---|---|---|
| Phase 1 | Off, Lite | β Shipped |
| Phase 2 | Standard, Aggressive, Ultra | β Shipped |
| Phase 3 | RTK, Stacked, Compression Combos | β Shipped |
| Phase 4 | Per-model adaptive, ML-based pruning | ποΈ Planned |
Standard mode compression rules are inspired by Caveman by JuliusBrussee (β 51K+) β the viral "why use many token when few token do trick" project. Caveman reports ~75% fewer output tokens, 65% benchmark average output savings, a 22-87% output range, and a ~46% input-compression tool.
RTK mode is inspired by RTK - Rust Token Killer by RTK AI β the high-performance command-output compression project for terminal, build, test, git, and tool-output filtering. RTK reports 60-90% savings, with its README sample session showing ~80% saved.
- Environment Config β Compression environment variables
- Architecture Guide β Compression pipeline internals
- User Guide β Getting started with compression
- RTK Compression β RTK filters, trust model, verify gate, raw-output recovery
- Compression Engines β Caveman, RTK, stacked, APIs, MCP, dashboard
- Compression Rules Format β JSON rule-pack format
- Compression Language Packs β Language-specific Caveman rules