The Context Optimization Layer for LLM Applications
Tool outputs are 70-95% redundant boilerplate. Headroom compresses that away.
The setup: 100 production log entries. One critical error buried at position 67.
BEFORE: 100 log entries (18,952 chars) - click to expand
[
{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", "message": "Request processed successfully - latency=50ms", "request_id": "req-000000", "status_code": 200},
{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", "message": "Request processed successfully - latency=51ms", "request_id": "req-000001", "status_code": 200},
{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", "message": "Request processed successfully - latency=52ms", "request_id": "req-000002", "status_code": 200},
// ... 64 more INFO entries ...
{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "message": "Connection pool exhausted", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
// ... 32 more INFO entries ...
]AFTER: Headroom compresses to 6 entries (1,155 chars):
[
{"timestamp": "2024-12-15T00:00:00Z", "level": "INFO", "service": "api-gateway", ...},
{"timestamp": "2024-12-15T01:01:00Z", "level": "INFO", "service": "user-service", ...},
{"timestamp": "2024-12-15T02:02:00Z", "level": "INFO", "service": "inventory", ...},
{"timestamp": "2024-12-15T03:47:23Z", "level": "FATAL", "service": "payment-gateway", "error_code": "PG-5523", "resolution": "Increase max_connections to 500 in config/database.yml", "affected_transactions": 1847},
{"timestamp": "2024-12-15T02:38:00Z", "level": "INFO", "service": "inventory", ...},
{"timestamp": "2024-12-15T03:39:00Z", "level": "INFO", "service": "auth", ...}
]What happened: First 3 items + the FATAL error + last 2 items. The critical error at position 67 was automatically preserved.
The question we asked Claude: "What caused the outage? What's the error code? What's the fix?"
| Baseline | Headroom | |
|---|---|---|
| Input tokens | 10,144 | 1,260 |
| Correct answers | 4/4 | 4/4 |
Both responses: "payment-gateway service, error PG-5523, fix: Increase max_connections to 500, 1,847 transactions affected"
87.6% fewer tokens. Same answer.
Run it yourself: python examples/needle_in_haystack_test.py
The setup: An Agno agent with 4 tools (GitHub Issues, ArXiv Papers, Code Search, Database Logs) investigating a memory leak. Total tool output: 62,323 chars (~15,580 tokens).
from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap your model - that's it!
base_model = Claude(id="claude-sonnet-4-20250514")
model = HeadroomAgnoModel(wrapped_model=base_model)
agent = Agent(model=model, tools=[search_github, search_arxiv, search_code, query_db])
response = agent.run("Investigate the memory leak and recommend a fix")Results with Claude Sonnet:
| Baseline | Headroom | |
|---|---|---|
| Tokens sent to API | 15,662 | 6,100 |
| API requests | 2 | 2 |
| Tool calls | 4 | 4 |
| Duration | 26.5s | 27.0s |
76.3% fewer tokens. Same comprehensive answer.
Both found: Issue #42 (memory leak), the cleanup_worker() fix, OutOfMemoryError logs (7.8GB/8GB, 847 threads), and relevant research papers.
Run it yourself: python examples/multi_tool_agent_test.py
Headroom optimizes LLM context before it hits the provider — without changing your agent logic or tools.
flowchart LR
User["Your App"]
Entry["Headroom"]
Transform["Context<br/>Optimization"]
LLM["LLM Provider"]
Response["Response"]
User --> Entry --> Transform --> LLM --> Response
flowchart TB
subgraph Pipeline["Transform Pipeline"]
CA["Cache Aligner<br/><i>Stabilizes dynamic tokens</i>"]
SC["Smart Crusher<br/><i>Removes redundant tool output</i>"]
CM["Context Manager<br/><i>Fits token budget</i>"]
CA --> SC --> CM
end
subgraph CCR["CCR: Compress-Cache-Retrieve"]
Store[("Compressed<br/>Store")]
Tool["Retrieve Tool"]
Tool <--> Store
end
LLM["LLM Provider"]
CM --> LLM
SC -. "Stores originals" .-> Store
LLM -. "Requests full context<br/>if needed" .-> Tool
Headroom never throws data away. It compresses aggressively and retrieves precisely.
-
Headroom intercepts context — Tool outputs, logs, search results, and intermediate agent steps.
-
Dynamic content is stabilized — Timestamps, UUIDs, request IDs are normalized so prompts cache cleanly.
-
Low-signal content is removed — Repetitive or redundant data is crushed, not truncated.
-
Original data is preserved — Full content is stored separately and retrieved only if the LLM asks.
-
Provider caches finally work — Headroom aligns prompts so OpenAI, Anthropic, and Google caches actually hit.
For deep technical details, see Architecture Documentation.
- Zero code changes - works as a transparent proxy
- 47-92% savings - depends on your workload (tool-heavy = more savings)
- Reversible compression - LLM retrieves original data via CCR
- Content-aware - code, logs, JSON each handled optimally
- Provider caching - automatic prefix optimization for cache hits
- Framework native - LangChain, Agno, MCP, agents supported
pip install "headroom-ai[proxy]"
headroom proxy --port 8787Point your tools at the proxy:
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude
# Any OpenAI-compatible client
OPENAI_BASE_URL=http://localhost:8787/v1 cursorpip install "headroom-ai[langchain]"from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel
# Wrap your model - that's it!
llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
# Use exactly like before
response = llm.invoke("Hello!")See the full LangChain Integration Guide for memory, retrievers, agents, and more.
pip install "headroom-ai[agno]"from agno.agent import Agent
from agno.models.openai import OpenAIChat
from headroom.integrations.agno import HeadroomAgnoModel
# Wrap your model - that's it!
model = HeadroomAgnoModel(OpenAIChat(id="gpt-4o"))
agent = Agent(model=model)
# Use exactly like before
response = agent.run("Hello!")
# Check savings
print(f"Tokens saved: {model.total_tokens_saved}")See the full Agno Integration Guide for hooks, multi-provider support, and more.
| Framework | Integration | Docs |
|---|---|---|
| LangChain | HeadroomChatModel, memory, retrievers, agents |
Guide |
| Agno | HeadroomAgnoModel, hooks, multi-provider |
Guide |
| MCP | Tool output compression for Claude | Guide |
| Any OpenAI Client | Proxy server | Guide |
| Feature | Description | Docs |
|---|---|---|
| Memory | Persistent memory across conversations (zero-latency inline extraction) | Memory |
| Universal Compression | ML-based content detection + structure-preserving compression | Compression |
| SmartCrusher | Compresses JSON tool outputs statistically | Transforms |
| CacheAligner | Stabilizes prefixes for provider caching | Transforms |
| RollingWindow | Manages context limits without breaking tools | Transforms |
| CCR | Reversible compression with automatic retrieval | CCR Guide |
| LangChain | Memory, retrievers, agents, streaming | LangChain |
| Agno | Agent framework integration with hooks | Agno |
| Text Utilities | Opt-in compression for search/logs | Text Compression |
| LLMLingua-2 | ML-based 20x compression (opt-in) | LLMLingua |
| Code-Aware | AST-based code compression (tree-sitter) | Transforms |
These numbers are from actual API calls, not estimates:
| Scenario | Before | After | Savings | Verified |
|---|---|---|---|---|
| Code search (100 results) | 17,765 tokens | 1,408 tokens | 92% | Claude Sonnet |
| SRE incident debugging | 65,694 tokens | 5,118 tokens | 92% | GPT-4o |
| Codebase exploration | 78,502 tokens | 41,254 tokens | 47% | GPT-4o |
| GitHub issue triage | 54,174 tokens | 14,761 tokens | 73% | GPT-4o |
Overhead: ~1-5ms compression latency
When savings are highest: Tool-heavy workloads (search, logs, database queries) When savings are lowest: Conversation-heavy workloads with minimal tool use
| Provider | Token Counting | Cache Optimization |
|---|---|---|
| OpenAI | tiktoken (exact) | Automatic prefix caching |
| Anthropic | Official API | cache_control blocks |
| Official API | Context caching | |
| Cohere | Official API | - |
| Mistral | Official tokenizer | - |
New models auto-supported via naming pattern detection.
- Never removes human content - user/assistant messages preserved
- Never breaks tool ordering - tool calls and responses stay paired
- Parse failures are no-ops - malformed content passes through unchanged
- Compression is reversible - LLM retrieves original data via CCR
pip install headroom-ai # SDK only
pip install "headroom-ai[proxy]" # Proxy server
pip install "headroom-ai[langchain]" # LangChain integration
pip install "headroom-ai[agno]" # Agno agent framework
pip install "headroom-ai[code]" # AST-based code compression
pip install "headroom-ai[llmlingua]" # ML-based compression
pip install "headroom-ai[all]" # EverythingRequirements: Python 3.10+
| Guide | Description |
|---|---|
| Memory Guide | Persistent memory for LLMs |
| Compression Guide | Universal compression with ML detection |
| LangChain Integration | Full LangChain support |
| Agno Integration | Full Agno agent framework support |
| SDK Guide | Fine-grained control |
| Proxy Guide | Production deployment |
| Configuration | All options |
| CCR Guide | Reversible compression |
| Metrics | Monitoring |
| Troubleshooting | Common issues |
Add your project here! Open a PR or start a discussion.
git clone https://github.com/chopratejas/headroom.git
cd headroom
pip install -e ".[dev]"
pytestSee CONTRIBUTING.md for details.
Apache License 2.0 - see LICENSE.
Built for the AI developer community