AIR Blackbox Gateway

Project Status: Alpha (launched February 2026) AIR Blackbox is an early-stage project actively seeking feedback and contributors. The architecture is stable, the APIs may evolve. If you're deploying AI agents and care about audit trails, we'd love your input — open an issue or start a discussion.

View Interactive Demo — Walk through every feature with animated examples.

Your AI agent just sent an email, moved money, or changed production data. Someone asks: "Show me exactly what it saw and why it made that decision."

Can you answer that today?

AIR Blackbox Gateway is a flight recorder for AI systems. Drop it in front of any OpenAI-compatible provider and every LLM call produces a tamper-evident, replayable audit record — without exposing sensitive content to your observability stack.

# Add one line. Every AI decision is now recorded.
from openai import OpenAI
import air

client = air.air_wrap(OpenAI())

15 repos. 200+ tests. CI on every push. Apache-2.0.

See it live: Interactive Demo — watch an agent run, inspect the audit chain, tamper with a record, and see the chain break.

Also: Test Suite — 30 tests across 8 LLM providers.

Get Started in 5 Minutes

1. Start the stack

git clone https://github.com/airblackbox/gateway.git
cd gateway
cp .env.example .env   # add your OPENAI_API_KEY
docker compose up --build

2. Install the SDK

pip install air-blackbox-sdk

3. Record everything

from openai import OpenAI
import air

client = air.air_wrap(OpenAI())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is a flight recorder?"}],
)
# x-run-id header = your audit trail
# Prompts vaulted in your MinIO, not a third-party cloud
# HMAC-SHA256 chain = tamper-proof record

Works with your framework:

# LangChain
from air.integrations.langchain import air_langchain_llm
llm = air_langchain_llm("gpt-4o-mini")

# CrewAI
from air.integrations.crewai import air_crewai_llm
llm = air_crewai_llm("gpt-4o-mini")

4. View traces → localhost:16686 (Jaeger)

5. Replay any run

go run ./cmd/replayctl replay runs/<run_id>.air.json

Why This Exists

Langfuse, Helicone, and Datadog answer "how is the system performing?"

AIR answers "what exactly happened, and can we prove it?"

	Observability Tools	AIR Blackbox Gateway
Dashboards & latency	✅	❌ (use Jaeger/Grafana)
Where data lives	Their cloud	Your vault (S3/MinIO)
PII in traces	❌ Raw content exposed	✅ Vault references only
Tamper-evident records	❌	✅ SHA-256 + HMAC chain
Deterministic replay	❌	✅ `replayctl`
Compliance reporting	❌	✅ 22 controls (SOC 2 + ISO 27001)
Signed evidence export	❌	✅ HMAC-attested packages
Agent guardrails	❌	✅ Cost, loop, tool, PII

AIR Blackbox provides tamper-evident audit chains for AI systems — an approach inspired by certificate transparency logs, applied to agent infrastructure. Not Langfuse (6k+ stars), not Helicone, not LangSmith. They're observability. This is accountability.

Who This Is For

Platform engineers deploying agents that call LLMs. You need every request recorded without leaking PII into your observability stack. Drop this in front of your provider — zero code changes.

Compliance teams whose regulators are asking "show me what the AI did." AIR records give you structured reconstruction with SHA-256 checksums and signed evidence packages.

Startup CTOs who know "we can't prove what our AI did" will block enterprise deals, SOC 2, or insurance. Install this now so you're not scrambling later.

Agent builders moving beyond chatbots toward systems that operate across hours, call tools, and interact with production data. You need decision provenance, replay, and the ability to prove your agent did the right thing — or a clear record of where it didn't.

How It Works

Your agent sends an OpenAI-compatible request to the gateway (just change the base URL)
The gateway assigns a run_id, forwards the request, captures the response
Prompts and completions are vaulted in MinIO (S3-compatible) — traces contain references, not content
An .air.json record captures the full run: vault refs, model, tokens, timing, tool calls
OTel spans flow through the collector pipeline (normalize → vault → redact → export)
Later: replayctl replays the run and reports behavioral drift

The Trust Layer

This is the part nobody else has.

Audit Chain — Every proxied request is appended to an HMAC-SHA256 chain. Each entry links to the previous entry's hash. Modify any record and the chain breaks from that point forward. Same integrity model as certificate transparency logs, without the blockchain overhead.

Compliance Reporting — The gateway evaluates your live configuration against 22 controls across SOC 2 (12 controls) and ISO 27001 (10 controls). Controls pass or fail based on what's actually enabled — vault, guardrails, analytics, audit chain. No self-assessment forms. The gateway evaluates itself.

Evidence Export — GET /v1/audit/export generates a signed evidence package: full audit chain, compliance report, time range, HMAC attestation. Hand it to your auditor as a single JSON document. The attestation can be independently verified against your signing key.

Endpoint	Method	Description
`/v1/audit`	GET	Chain integrity + live compliance evaluation
`/v1/audit/export`	GET	Signed evidence package for regulators

Operational Guarantees

AIR is a witness, not a gatekeeper. It cannot cause your AI system to fail.

Non-blocking — Vault unreachable? Gateway still proxies. Your AI never stops because recording failed.

Lossy-safe — A dropped record is acceptable. A dropped request is not. Recording is best-effort; proxying is guaranteed.

Self-degrading — OTel Collector down? Spans dropped silently. Filesystem full? AIR records fail gracefully. Warnings logged, never errors returned.

Same contract as Datadog agents, OTel collectors, and every other production observability tool. Companies won't insert infrastructure that can break their pipeline.

Privacy & Data Boundaries

You control all data. You choose what gets recorded.

Mode	What's Stored	Use Case
Full vault (default)	Prompts + completions in your MinIO	Complete reconstruction
Metadata only	Model, tokens, timing, run_id	Lightweight audit, no content
Hash only	SHA-256 of request/response	Prove a call happened without storing what was said
Selective redaction	Content with PII/PHI stripped	Healthcare, fintech, enterprise

"We can prove what happened without exposing the data." That's what makes this viable for regulated industries.

The Ecosystem

15 repos, all tested, all with CI/CD, all Apache-2.0.

Layer	Repos	What It Does
Gateway	`gateway` (this repo)	Proxy + vault + AIR records + guardrails + trust
SDK	`python-sdk`	Python integrations — OpenAI, LangChain, CrewAI
Episode Ledger	`agent-episode-store`	Groups AIR records into replayable task-level episodes
Eval Harness	`eval-harness`	Replays episodes, scores results, detects regressions
Policy Engine	`agent-policy-engine`	Risk-tiered autonomy, runtime enforcement
Collector	`genai-semantic-normalizer`, `prompt-vault-processor`, `otel-processor-genai`	Normalize → vault → redact → metrics
Platform	`air-platform`	Docker Compose orchestration + integration tests
Replay	`agent-vcr`, `trace-regression-harness`	Record/replay agent runs, policy assertions on traces
Governance	`mcp-policy-gateway`, `mcp-security-scanner`, `agent-tool-sandbox`, `aibom-policy-engine`, `runtime-aibom-emitter`	Tool firewall, security scanning, sandboxing, AI bill of materials
Trust	`pkg/trust` (this repo)	HMAC audit chain, SOC 2 + ISO 27001 compliance, evidence export

The Value Ladder

Visibility (what happened)
  → Detection (something is wrong)
    → Prevention (stop it automatically)
      → Optimization (make it better)
        → Trust (prove it to regulators)
          → Autonomy (let the agent act, safely)

Each layer builds on the one below. You can't detect what you can't see. You can't prevent what you can't detect. You can't trust what you can't prove. And you can't grant autonomy without trust.

What's Shipped

Version	Capability	Status
v0.1	Recording, replay, vault, OTel pipeline, 8 providers	✅
v0.1	Non-blocking proxy with streaming, auth, timeout safety	✅
v0.4	Runaway agent kill-switch, cost guardrails, loop detection	✅
v0.5	Policy enforcement, PII blocking, tool allowlists, HITL approval	✅
v0.6	Cross-agent analytics, model routing, failure taxonomy	✅
v0.7	HMAC-SHA256 audit chain, SOC 2 + ISO 27001 reporting, evidence export	✅
v0.8	Python SDK (OpenAI, LangChain, CrewAI), CI/CD across all repos	✅

What's Next

Phase	Timeline	Focus
Foundation	Q1–Q2 2026	Episode model, durable state, pause/resume
Risk-Tiered Autonomy	Q3–Q4 2026	Cost-of-error gating, approval workflows, sandbox replay
Multi-Agent Orchestration	Q1–Q2 2027	Planner/executor/critic, escalation policies, shared state
External Trust Wedge	Q3–Q4 2027	Trust layer as add-on, onboarding templates, incident runbooks
Enterprise Scale	2028–2029	Multi-tenant isolation, provenance search, SCIM provisioning

Configuration

Variable	Default	Description
`LISTEN_ADDR`	`:8080`	Gateway listen address
`PROVIDER_URL`	`https://api.openai.com`	Upstream LLM provider
`VAULT_ENDPOINT`	`localhost:9000`	MinIO/S3 endpoint
`VAULT_ACCESS_KEY`	`minioadmin`	S3 access key
`VAULT_SECRET_KEY`	`minioadmin`	S3 secret key
`VAULT_BUCKET`	`air-runs`	S3 bucket name
`VAULT_USE_SSL`	`false`	TLS for S3
`OTEL_EXPORTER_OTLP_ENDPOINT`	`localhost:4317`	OTel collector gRPC
`RUNS_DIR`	`./runs`	AIR record directory
`TRUST_SIGNING_KEY`	(none)	HMAC-SHA256 signing key

AIR Record Format

Each run produces a .air.json file:

{
  "version": "1.0.0",
  "run_id": "550e8400-e29b-41d4-a716-446655440000",
  "trace_id": "abc123...",
  "timestamp": "2025-02-14T10:30:00Z",
  "model": "gpt-4o-mini",
  "provider": "openai",
  "request_vault_ref": "vault://air-runs/550e8400.../request.json",
  "response_vault_ref": "vault://air-runs/550e8400.../response.json",
  "request_checksum": "sha256:a1b2c3...",
  "response_checksum": "sha256:d4e5f6...",
  "tokens": { "prompt": 25, "completion": 142, "total": 167 },
  "duration_ms": 1230,
  "status": "success"
}

License

Apache-2.0. The open-source protocol layer will always be Apache-2.0.

The path to adoption: Open protocol → common dependency → operational expectation → compliance requirement.

See LICENSE for details. See Commercial Addendum for future commercial governance services.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
cmd		cmd
collector		collector
docs		docs
examples		examples
pkg		pkg
testdata		testdata
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
demo.html		demo.html
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum
guardrails.yaml.example		guardrails.yaml.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIR Blackbox Gateway

Get Started in 5 Minutes

Why This Exists

Who This Is For

How It Works

The Trust Layer

Operational Guarantees

Privacy & Data Boundaries

The Ecosystem

The Value Ladder

What's Shipped

What's Next

Configuration

AIR Record Format

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

airblackbox/gateway

Folders and files

Latest commit

History

Repository files navigation

AIR Blackbox Gateway

Get Started in 5 Minutes

Why This Exists

Who This Is For

How It Works

The Trust Layer

Operational Guarantees

Privacy & Data Boundaries

The Ecosystem

The Value Ladder

What's Shipped

What's Next

Configuration

AIR Record Format

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages