Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Tracer-Cloud/opensre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2,086 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

OpenSRE

OpenSRE v0.1: Build Your Own AI SRE Agents

The open-source framework for AI SRE agents, and the training and evaluation environment they need to improve. Connect the 60+ tools you already run, define your own workflows, and investigate incidents on your own infrastructure.

CI status Project status: pre-alpha Apache 2.0 License Discord

Tracer-Cloud%2Fopensre | Trendshift

Quickstart ยท Docs ยท FAQ ยท Security


๐Ÿšง Public Alpha: Core workflows are usable for early exploration, though not yet fully stable. The project is in active development, and APIs and integrations may evolve


Table of Contents


Why OpenSRE?

When something breaks in production, the evidence is scattered across logs, metrics, traces, runbooks, and Slack threads. OpenSRE is an open-source framework for AI SRE agents that resolve production incidents, built to run on your own infrastructure.

We do that because SWE-bench1 gave coding agents scalable training data and clear feedback. Production incident response still lacks an equivalent.

Distributed failures are slower, noisier, and harder to simulate and evaluate than local code tasks, which is why AI SRE, and AI for production debugging more broadly, remains unsolved.

OpenSRE is building that missing layer:

an open reinforcement learning environment for agentic infrastructure incident response, with end-to-end tests and synthetic incident simulations for realistic production failures

We do that by:

  • building easy-to-deploy, customizable AI SRE agents for production incident investigation and response
  • running scored synthetic RCA suites that check root-cause accuracy, required evidence, and adversarial red herrings (tests/synthetic)
  • running real-world end-to-end tests across cloud-backed scenarios including Kubernetes, EC2, CloudWatch, Lambda, ECS Fargate, and Flink (tests/e2e)
  • keeping semantic test-catalog naming so e2e vs synthetic and local vs cloud boundaries stay obvious (tests/README.md)

Our mission is to build AI SRE agents on top of this, scale it to thousands of realistic infrastructure failure scenarios, and establish OpenSRE as the benchmark and training ground for AI SRE.

1 https://arxiv.org/abs/2310.06770


Install

The root installer URL auto-detects Unix shell vs PowerShell. Add --main when you want the latest rolling build from main instead of the latest stable release.

Latest stable release:

curl -fsSL https://install.opensre.com | bash

Latest build from main:

curl -fsSL https://install.opensre.com | bash -s -- --main

Homebrew:

brew tap tracer-cloud/tap
brew install tracer-cloud/tap/opensre

Windows (PowerShell):

irm https://install.opensre.com | iex

Quick Start

Configure once, then pick how you want to run investigations:

opensre onboard

Interactive shell โ€” with no subcommand, opensre starts a REPL (TTY required). Describe incidents in plain language, stream investigations, and use slash commands such as /help, /status, /clear, /reset, /trust, /effort, /exit. /effort sets reasoning depth for OpenAI and Codex providers (low, medium, high, xhigh, or max; other providers ignore it). Ctrl+C cancels an in-flight investigation without losing session state.

opensre

One-shot investigation โ€” run the agent once against an alert file:

opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json

Other useful commands:

opensre update
opensre uninstall   # remove opensre and all local data

Deployment

Deploy OpenSRE as a standard Python/FastAPI runtime using the repo Dockerfile or a managed app host such as Railway, EC2, ECS, or Vercel. Set LLM_PROVIDER plus the matching API key (see .env.example); hosted layouts that need persistence should also configure DATABASE_URI and REDIS_URI.

Full deployment steps, Railway notes, and opensre remote ops โ†’ docs/DEVELOPMENT.md


How OpenSRE Works

opensre-how-it-works-github

When an alert fires, OpenSRE automatically:

  1. Fetches the alert context and correlated logs, metrics, and traces
  2. Reasons across your connected systems to identify anomalies
  3. Generates a structured investigation report with probable root cause
  4. Suggests next steps and, optionally, executes remediation actions
  5. Posts a summary directly to Slack or PagerDuty โ€” no context switching needed

For the current code-level agent architecture after removing the old graph and chain framework layers, see AGENT_ARCHITECTURE.md.


Benchmark

Regenerate numbers with make benchmark; refresh this table from cached results via make benchmark-update-readme. See docs/DEVELOPMENT.md for details.

No benchmark results yet.


Capabilities & integrations

๐Ÿ” Structured incident investigation Correlated root-cause analysis across all your signals
๐Ÿ“‹ Runbook-aware reasoning OpenSRE reads your runbooks and applies them automatically
๐Ÿ”ฎ Predictive failure detection Catch emerging issues before they page you
๐Ÿ”— Evidence-backed root cause Every conclusion is linked to the data behind it
๐Ÿค– Full LLM flexibility Bring your own model โ€” Anthropic, OpenAI, Ollama, Gemini, OpenRouter, NVIDIA NIM

OpenSRE connects to 60+ tools across LLMs, observability, cloud infrastructure, data platforms, incident management, and MCP. The full matrix (with roadmap links) lives in the product docs; a detailed catalog is also maintained in-repo as the project grows.


Integrations

OpenSRE connects to 60+ tools and services across the modern cloud stack, from LLM providers and observability platforms to infrastructure, databases, and incident management.

Category Integrations Roadmap
AI / LLM Providers Anthropic ยท OpenAI ยท Ollama ยท Google Gemini ยท OpenRouter ยท NVIDIA NIM ยท Bedrock
Observability Grafana (Loki ยท Mimir ยท Tempo) ยท Datadog ยท Honeycomb ยท Coralogix ยท CloudWatch ยท Sentry ยท Elasticsearch ยท Better Stack Telemetry Splunk ยท New Relic ยท Victoria Logs
Infrastructure Kubernetes ยท AWS (S3 ยท Lambda ยท EKS ยท EC2 ยท Bedrock) ยท GCP ยท Azure Helm ยท ArgoCD
Database MongoDB ยท ClickHouse ยท PostgreSQL ยท MySQL ยท MariaDB ยท MongoDB Atlas ยท Azure SQL ยท Snowflake RDS
Data Platform Apache Airflow ยท Apache Kafka ยท Apache Spark ยท Prefect ยท RabbitMQ
Dev Tools GitHub ยท GitHub MCP ยท Bitbucket ยท GitLab
Incident Management PagerDuty ยท Opsgenie ยท Jira ยท Alertmanager Trello ยท ServiceNow ยท incident.io ยท Linear
Communication Slack ยท Google Docs ยท Discord ยท Telegram Notion ยท Teams ยท WhatsApp ยท Confluence
Agent Deployment Vercel ยท EC2 ยท ECS ยท Railway
Protocols MCP ยท ACP ยท OpenClaw

OpenSRE is community-built. Looking for a safe first contribution? Browse good first issue tickets or see the Good First Issues guide. See CONTRIBUTING.md for the full workflow.

Local environment: SETUP.md (all platforms, Windows, MCP/OpenClaw).

Developing in this repo: docs/DEVELOPMENT.md (install from source, CI parity checks, dev container, benchmark, deployment detail, telemetry reference).

Join our Discord

Star History Chart

Thanks goes to these amazing people:

Contributors

Security

OpenSRE is designed with production environments in mind: structured and auditable LLM prompts, local transcript handling by default, and no silent bulk export of raw logs. See SECURITY.md for responsible disclosure.


Telemetry

PostHog (product analytics) and Sentry (errors) are opt-out. Quick disable:

export OPENSRE_NO_TELEMETRY=1

Full matrix, DSN override, and local event logging โ†’ docs/DEVELOPMENT.md


License

Apache 2.0 โ€” see LICENSE.

Citations

1 https://arxiv.org/abs/2310.06770