-
-
Notifications
You must be signed in to change notification settings - Fork 240
Description
Proposal
Agent-SRE is an AI-native SRE framework (1,071+ tests) that provides SLI/SLO tracking, chaos testing, canary deployments, and incident response for AI agent systems.
Integration Opportunity
OpenLit is OpenTelemetry-native with 50+ LLM provider integrations. Agent-SRE can complement this by adding the SRE discipline layer:
-
SLO-Based Alerting - Agent-SRE defines SLOs (latency P99 < 500ms, error rate < 1%) and tracks error budgets. These SLO metrics can be exported as OTel metrics that OpenLit visualizes alongside LLM traces.
-
Chaos Test Spans - Agent-SRE chaos experiments (fault injection, latency simulation, token exhaustion) emitted as OTel spans, visible in OpenLit's trace view alongside the LLM calls they affect.
-
Canary Metrics - During progressive rollouts of new LLM models/prompts, agent-sre tracks canary vs baseline. These comparison metrics enrich OpenLit's monitoring dashboard.
-
Error Budget Dashboard - OpenLit could display agent-sre's error budget consumption over time, helping teams decide when to deploy new features vs focus on reliability.
Proposed Approach
from agent_sre.exporters.otel import OTelSLOExporter
from opentelemetry import metrics
# Agent-SRE exports SLO metrics via OTel protocol
exporter = OTelSLOExporter(
endpoint='http://localhost:4318', # OpenLit OTLP endpoint
service_name='my-ai-service'
)
slo_engine.add_exporter(exporter)Since OpenLit already speaks OpenTelemetry natively, the integration is clean - agent-sre just needs to export its SLI measurements and chaos test data as OTel metrics and traces.
Our companion project Agent-Hypervisor also has a structured event bus that could export ring transitions, saga steps, and liability events via OTel.
Happy to submit a PR. See agent-sre for the full API.