Project Status: Alpha (launched February 2026) AIR Blackbox is an early-stage project actively seeking feedback and contributors. The architecture is stable, the APIs may evolve. If you're deploying AI agents and care about audit trails, we'd love your input — open an issue or start a discussion.
Project Status: Alpha (launched February 2026) AIR Blackbox is an early-stage project actively seeking feedback and contributors. The architecture is stable, the APIs may evolve. If you're deploying AI agents and care about audit trails, we'd love your input — open an issue or start a discussion.
View Interactive Demo — Walk through every feature with animated examples.
Your AI agent just sent an email, moved money, or changed production data. Someone asks: "Show me exactly what it saw and why it made that decision."
Can you answer that today?
AIR Blackbox Gateway is a flight recorder for AI systems. Drop it in front of any OpenAI-compatible provider and every LLM call produces a tamper-evident, replayable audit record — without exposing sensitive content to your observability stack.
# Add one line. Every AI decision is now recorded.
from openai import OpenAI
import air
client = air.air_wrap(OpenAI())15 repos. 200+ tests. CI on every push. Apache-2.0.
See it live: Interactive Demo — watch an agent run, inspect the audit chain, tamper with a record, and see the chain break.
Also: Test Suite — 30 tests across 8 LLM providers.
1. Start the stack
git clone https://github.com/airblackbox/gateway.git
cd gateway
cp .env.example .env # add your OPENAI_API_KEY
docker compose up --build2. Install the SDK
pip install air-blackbox-sdk3. Record everything
from openai import OpenAI
import air
client = air.air_wrap(OpenAI())
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is a flight recorder?"}],
)
# x-run-id header = your audit trail
# Prompts vaulted in your MinIO, not a third-party cloud
# HMAC-SHA256 chain = tamper-proof recordWorks with your framework:
# LangChain
from air.integrations.langchain import air_langchain_llm
llm = air_langchain_llm("gpt-4o-mini")
# CrewAI
from air.integrations.crewai import air_crewai_llm
llm = air_crewai_llm("gpt-4o-mini")4. View traces → localhost:16686 (Jaeger)
5. Replay any run
go run ./cmd/replayctl replay runs/<run_id>.air.jsonLangfuse, Helicone, and Datadog answer "how is the system performing?"
AIR answers "what exactly happened, and can we prove it?"
| Observability Tools | AIR Blackbox Gateway | |
|---|---|---|
| Dashboards & latency | ✅ | ❌ (use Jaeger/Grafana) |
| Where data lives | Their cloud | Your vault (S3/MinIO) |
| PII in traces | ❌ Raw content exposed | ✅ Vault references only |
| Tamper-evident records | ❌ | ✅ SHA-256 + HMAC chain |
| Deterministic replay | ❌ | ✅ replayctl |
| Compliance reporting | ❌ | ✅ 22 controls (SOC 2 + ISO 27001) |
| Signed evidence export | ❌ | ✅ HMAC-attested packages |
| Agent guardrails | ❌ | ✅ Cost, loop, tool, PII |
AIR Blackbox provides tamper-evident audit chains for AI systems — an approach inspired by certificate transparency logs, applied to agent infrastructure. Not Langfuse (6k+ stars), not Helicone, not LangSmith. They're observability. This is accountability.
Platform engineers deploying agents that call LLMs. You need every request recorded without leaking PII into your observability stack. Drop this in front of your provider — zero code changes.
Compliance teams whose regulators are asking "show me what the AI did." AIR records give you structured reconstruction with SHA-256 checksums and signed evidence packages.
Startup CTOs who know "we can't prove what our AI did" will block enterprise deals, SOC 2, or insurance. Install this now so you're not scrambling later.
Agent builders moving beyond chatbots toward systems that operate across hours, call tools, and interact with production data. You need decision provenance, replay, and the ability to prove your agent did the right thing — or a clear record of where it didn't.
- Your agent sends an OpenAI-compatible request to the gateway (just change the base URL)
- The gateway assigns a
run_id, forwards the request, captures the response - Prompts and completions are vaulted in MinIO (S3-compatible) — traces contain references, not content
- An
.air.jsonrecord captures the full run: vault refs, model, tokens, timing, tool calls - OTel spans flow through the collector pipeline (normalize → vault → redact → export)
- Later:
replayctlreplays the run and reports behavioral drift
This is the part nobody else has.
Audit Chain — Every proxied request is appended to an HMAC-SHA256 chain. Each entry links to the previous entry's hash. Modify any record and the chain breaks from that point forward. Same integrity model as certificate transparency logs, without the blockchain overhead.
Compliance Reporting — The gateway evaluates your live configuration against 22 controls across SOC 2 (12 controls) and ISO 27001 (10 controls). Controls pass or fail based on what's actually enabled — vault, guardrails, analytics, audit chain. No self-assessment forms. The gateway evaluates itself.
Evidence Export — GET /v1/audit/export generates a signed evidence package: full audit chain, compliance report, time range, HMAC attestation. Hand it to your auditor as a single JSON document. The attestation can be independently verified against your signing key.
| Endpoint | Method | Description |
|---|---|---|
/v1/audit |
GET | Chain integrity + live compliance evaluation |
/v1/audit/export |
GET | Signed evidence package for regulators |
AIR is a witness, not a gatekeeper. It cannot cause your AI system to fail.
Non-blocking — Vault unreachable? Gateway still proxies. Your AI never stops because recording failed.
Lossy-safe — A dropped record is acceptable. A dropped request is not. Recording is best-effort; proxying is guaranteed.
Self-degrading — OTel Collector down? Spans dropped silently. Filesystem full? AIR records fail gracefully. Warnings logged, never errors returned.
Same contract as Datadog agents, OTel collectors, and every other production observability tool. Companies won't insert infrastructure that can break their pipeline.
You control all data. You choose what gets recorded.
| Mode | What's Stored | Use Case |
|---|---|---|
| Full vault (default) | Prompts + completions in your MinIO | Complete reconstruction |
| Metadata only | Model, tokens, timing, run_id | Lightweight audit, no content |
| Hash only | SHA-256 of request/response | Prove a call happened without storing what was said |
| Selective redaction | Content with PII/PHI stripped | Healthcare, fintech, enterprise |
"We can prove what happened without exposing the data." That's what makes this viable for regulated industries.
15 repos, all tested, all with CI/CD, all Apache-2.0.
| Layer | Repos | What It Does |
|---|---|---|
| Gateway | gateway (this repo) |
Proxy + vault + AIR records + guardrails + trust |
| SDK | python-sdk |
Python integrations — OpenAI, LangChain, CrewAI |
| Episode Ledger | agent-episode-store |
Groups AIR records into replayable task-level episodes |
| Eval Harness | eval-harness |
Replays episodes, scores results, detects regressions |
| Policy Engine | agent-policy-engine |
Risk-tiered autonomy, runtime enforcement |
| Collector | genai-semantic-normalizer, prompt-vault-processor, otel-processor-genai |
Normalize → vault → redact → metrics |
| Platform | air-platform |
Docker Compose orchestration + integration tests |
| Replay | agent-vcr, trace-regression-harness |
Record/replay agent runs, policy assertions on traces |
| Governance | mcp-policy-gateway, mcp-security-scanner, agent-tool-sandbox, aibom-policy-engine, runtime-aibom-emitter |
Tool firewall, security scanning, sandboxing, AI bill of materials |
| Trust | pkg/trust (this repo) |
HMAC audit chain, SOC 2 + ISO 27001 compliance, evidence export |
Visibility (what happened)
→ Detection (something is wrong)
→ Prevention (stop it automatically)
→ Optimization (make it better)
→ Trust (prove it to regulators)
→ Autonomy (let the agent act, safely)
Each layer builds on the one below. You can't detect what you can't see. You can't prevent what you can't detect. You can't trust what you can't prove. And you can't grant autonomy without trust.
| Version | Capability | Status |
|---|---|---|
| v0.1 | Recording, replay, vault, OTel pipeline, 8 providers | ✅ |
| v0.1 | Non-blocking proxy with streaming, auth, timeout safety | ✅ |
| v0.4 | Runaway agent kill-switch, cost guardrails, loop detection | ✅ |
| v0.5 | Policy enforcement, PII blocking, tool allowlists, HITL approval | ✅ |
| v0.6 | Cross-agent analytics, model routing, failure taxonomy | ✅ |
| v0.7 | HMAC-SHA256 audit chain, SOC 2 + ISO 27001 reporting, evidence export | ✅ |
| v0.8 | Python SDK (OpenAI, LangChain, CrewAI), CI/CD across all repos | ✅ |
| Phase | Timeline | Focus |
|---|---|---|
| Foundation | Q1–Q2 2026 | Episode model, durable state, pause/resume |
| Risk-Tiered Autonomy | Q3–Q4 2026 | Cost-of-error gating, approval workflows, sandbox replay |
| Multi-Agent Orchestration | Q1–Q2 2027 | Planner/executor/critic, escalation policies, shared state |
| External Trust Wedge | Q3–Q4 2027 | Trust layer as add-on, onboarding templates, incident runbooks |
| Enterprise Scale | 2028–2029 | Multi-tenant isolation, provenance search, SCIM provisioning |
| Variable | Default | Description |
|---|---|---|
LISTEN_ADDR |
:8080 |
Gateway listen address |
PROVIDER_URL |
https://api.openai.com |
Upstream LLM provider |
VAULT_ENDPOINT |
localhost:9000 |
MinIO/S3 endpoint |
VAULT_ACCESS_KEY |
minioadmin |
S3 access key |
VAULT_SECRET_KEY |
minioadmin |
S3 secret key |
VAULT_BUCKET |
air-runs |
S3 bucket name |
VAULT_USE_SSL |
false |
TLS for S3 |
OTEL_EXPORTER_OTLP_ENDPOINT |
localhost:4317 |
OTel collector gRPC |
RUNS_DIR |
./runs |
AIR record directory |
TRUST_SIGNING_KEY |
(none) | HMAC-SHA256 signing key |
Each run produces a .air.json file:
{
"version": "1.0.0",
"run_id": "550e8400-e29b-41d4-a716-446655440000",
"trace_id": "abc123...",
"timestamp": "2025-02-14T10:30:00Z",
"model": "gpt-4o-mini",
"provider": "openai",
"request_vault_ref": "vault://air-runs/550e8400.../request.json",
"response_vault_ref": "vault://air-runs/550e8400.../response.json",
"request_checksum": "sha256:a1b2c3...",
"response_checksum": "sha256:d4e5f6...",
"tokens": { "prompt": 25, "completion": 142, "total": 167 },
"duration_ms": 1230,
"status": "success"
}Apache-2.0. The open-source protocol layer will always be Apache-2.0.
The path to adoption: Open protocol → common dependency → operational expectation → compliance requirement.
See LICENSE for details. See Commercial Addendum for future commercial governance services.