Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pavangudiwada/awesome-ai-sre

Repository files navigation

Awesome AI SRE

If this repository is useful, please consider starring ⭐ it.

Tools

Items with πŸ’š indicate open source projects.

AUTO-GENERATED FILE - DO NOT EDIT MANUALLY. Auto-generated by CI workflow or pre-commit hooks using node generate-readme.js.

Jump to: Incident Response | Observability | AIOps | IDP | IaC | Security | Deployment

Incident Response (31)

Name Summary Deployment Links
Agent SRE AgentSRE is built for enterprises that can't afford downtime. A fleet of AI agents automates detection, root cause analysis, and remediation - delivering faster recovery, lower cloud costs, and resilient operations. Hybrid Website
AlertD AlertD is an agentic AI teammate for SRE and DevOps on AWS, cutting alert noise and dashboard fatigue while delivering contextual answers and automated actions. SaaS Website LinkedIn
AutonomOps AI Autonomous operations platform that applies AI to improve SRE and incident management workflows. SaaS Website LinkedIn
Azure SRE Agent AI-powered reliability assistant for Azure that automates incident response, root-cause analysis, and mitigation workflows. SaaS Website GitHub LinkedIn X
Bacca.ai AI SRE for high-scale platforms that uses tribal knowledge to triage and mitigate incidents accurately. SaaS Website LinkedIn
Beeps AI-powered operations assistant focused on helping teams handle alerts and incident workflows faster. SaaS Website LinkedIn
Cleric Cleric is an autonomous AI SRE that helps engineering teams quickly diagnose production issues in complex cloud-native environments. SaaS Website LinkedIn
DrDroid AI that understands your production system, infrastructure, applications, and business context to investigate incidents and explain root causes. SaaS Website GitHub LinkedIn X
FireHydrant All-in-one incident management software for modern teams. FireHydrant helps you plan, respond, and resolve faster with smart alerting, on-call scheduling, AI-powered. SaaS Website LinkedIn
Harness AI-SRE Most incidents start with changeβ€”so why manage them in isolation? Learn how Harness AI-SRE connects the dots between alerts, changes, and workflows, powered. SaaS Website GitHub LinkedIn X
incident.io incident.io is an all-in-one incident management platform unifying on-call scheduling, real-time incident response, and integrated status pages – helping teams resolve. SaaS Website LinkedIn
IncidentFox AI incident response platform designed to help teams investigate and resolve operational issues. SaaS Website GitHub LinkedIn
NeuBird AI NeuBird AI's agentic AI SRE delivers autonomous incident resolution, helping teams cut MTTR by up to 90% and reclaim engineering hours lost to troubleshooting. SaaS Website LinkedIn
NOFire AI NOFire handles alerts, flags risky changes, turns incidents and tribal knowledge into lasting reliability memory. SaaS Website LinkedIn
OpsCompanion OpsCompanion is the AI-driven Operations Intelligence Engine that automates root cause analysis, resolves alerts, and unifies observability across your stack helping. SaaS Website LinkedIn
PagerDuty SRE Agent Transform critical operations with PagerDuty's AI first Operations Platform. Harness agentic AI and automation to accelerate work and build resilience. SaaS Website GitHub LinkedIn X
Phoebe The immune system for your software. AI agents that continuously investigate live data, diagnose emerging issues and generate preemptive fixes. SaaS Website LinkedIn
ProdRescue AI Automates incident reports and evidence-backed RCA for SRE teams from Slack war rooms or logs in minutes. SaaS Website LinkedIn X
Resolve AI Resolve AI handles all alerts, performs root cause analysis, and troubleshoots incidents within minutes SaaS Website LinkedIn
RobinRelay AI on-call copilot for Slack that cuts MTTR by 75%. Reduce alert noise, recall past incident fixes, and save thousands of engineering hours yearly. SaaS Website LinkedIn
Rootly The all-in-one incident management platform, including AI SRE agentsβ€”built for fast-moving engineering teams to detect, manage, learn from, and resolve incidents faster. SaaS Website LinkedIn
RunLLM The AI SRE for mission-critical systems that delivers transparent investigations, evidence-backed root cause analysis, and continuous runbook improvement. SaaS Website LinkedIn
Scoutflo Your AI SRE for incident response and debugging. AI handles alerts, finds root causes, and fixes issues in minutes. SaaS Website LinkedIn
Sherlocks.ai Cut MTTR by 10x with AI SREs that investigate incidents 24/7, automate root cause analysis, and prevent outages before they happen. Try Sherlocks.ai free. SaaS Website LinkedIn
Steadwing Steadwing is an autonomous on-call engineer that finds root causes in under 5 minutes and fixes them. It correlates logs, metrics, traces, and code to deliver actionable RCAs and real remediation-PRs, rollbacks, config changes, and more-with 20+ integrations. SaaS Website LinkedIn X Product Hunt
TierZero AI TierZero's AI agents investigate incidents, triage alerts, and fix production problems automatically β€” so your engineers can ship faster. SaaS Website LinkedIn
πŸ’šTracer OS-level AI SRE platform for high-compute workloads that accelerates alert investigation, root-cause analysis, and mitigation inside your environment. On-Prem Website GitHub
Traversal Traversal cuts through alert noise, surfaces root causes, and guides your team to remediation β€” so incidents get fixed in minutes, not hours. SaaS Website LinkedIn
Vibranium Labs AI reliability tooling company focused on incident response automation and operations intelligence. SaaS Website LinkedIn
Vigiles Incident management platform for modern teams with outage detection, on-call alerting, response coordination, status pages, and AI postmortems. SaaS Website
Wild Moose Wild Moose helps developers solve production issues faster, kicking off any root cause investigation automatically. Triggered by alerts, the AI moose autonomously. SaaS Website LinkedIn

Back to top ↑

Observability (15)

Name Summary Deployment Links
Agent0 by Dash0 Dash0's agentic AI platform for observability that helps engineers with incident triage, query building, instrumentation guidance, trace analysis, and dashboard creation. SaaS Website GitHub LinkedIn X
Better Stack Observability and incident management platform with AI SRE, eBPF-based tracing, logs, metrics, uptime monitoring, and on-call workflows. SaaS Website LinkedIn
Causely Causely pinpoints the root cause of errors so that you can consistently meet reliability expectations of application users in complex, cloud native environments. SaaS Website LinkedIn
DagKnows, Inc AI operations company focused on improving incident diagnostics and reliability workflows. SaaS Website LinkedIn
Datadog (Bits AI) See metrics from all of your apps, tools & services in one place with Datadog’s cloud monitoring as a service solution. Try it for free. SaaS Website GitHub LinkedIn X
Deductive AI Deductive AI transforms your root-causing process by effortlessly understanding your entire codebase along with the telemetry data. SaaS Website LinkedIn
Deeptrace Automate and cut your on-call/debugging time in half with AI. SaaS Website LinkedIn
Edge Delta Observability pipeline and AI analytics platform for processing telemetry at scale and accelerating incident investigation. SaaS Website LinkedIn
Elastic Learn more about Elastic Observability. Elastic Observability resolves problems faster at reduced cost with an open source, AI-powered observability, that is accurate,. SaaS Website GitHub LinkedIn X
Lightrun Lightrun's AI SRE that handles alerts, prevent issues early with live runtime context during development, and resolve alerts in minutes with verified RCA. SaaS Website LinkedIn
Logz.io Stop Chasing Alerts. Get Ahead of Problems with AI-Powered Observability. SaaS Website LinkedIn
Mezmo Combine intelligent telemetry with AI-driven observability to detect issues, pinpoint root cause, and power agentic operations across logs, metrics, and traces. SaaS Website LinkedIn
Observe, Inc. Observe is a modern observability platform built on a streaming data lake, for faster search and correlation at lower cost. SaaS Website LinkedIn
Sentry Application performance monitoring for developers and software teams to see errors more clearly, solve issues faster, and improve reliability continuously. SaaS Website GitHub X
SIXTA AI-powered root cause analysis for database reliability SaaS Website LinkedIn

Back to top ↑

AIOps (18)

Name Summary Deployment Links
BigPanda AIOps platform for event correlation, incident detection, and response orchestration across modern IT operations. SaaS Website LinkedIn X
Ciroos Ciroos transforms SRE with AI-driven automation, reducing toil, detecting anomalies early, and accelerating incident investigations. SaaS Website LinkedIn
Cloudship AI AI platform for cloud and platform engineering workflows focused on reliability and operations. SaaS Website LinkedIn
Cokpit Cokpit scales with your needs β€” from startups to global enterprises. SaaS Website LinkedIn X
πŸ’šHolmesGPT Open source AI SRE agent that iteratively investigates incidents using data from your Kubernetes and observability stack. Hybrid Website GitHub LinkedIn X
πŸ’šK8sGPT K8sGPT is an AI-powered tool that helps diagnose and fix Kubernetes issues with intelligent insights and automated troubleshooting. Hybrid Website GitHub X
πŸ’šKagent Open-source Kubernetes-native framework for building and running AI agents that automate DevOps operations and troubleshooting tasks. Hybrid Website GitHub
Komodor Komodor automatically detects, investigates and remediates complex issues to proactively reduce cloud costs, slash MTTR and vanquish TicketOps. SaaS Website LinkedIn
Kura AI platform for engineering operations and incident response automation in modern infrastructure environments. SaaS Website
NudgeBee Agentic AI platform for SRE & CloudOps, troubleshooting, cost optimization, and no-code workflow automation. SaaS Website LinkedIn
πŸ’šObot Open source agent platform for creating, running, and integrating autonomous assistants across workflows. Hybrid Website GitHub
Opsy AI-powered reliability operations platform for faster incident response and SRE workflow automation. SaaS Website GitHub
Robusta Dev Robusta's AI assistant empowers teams to troubleshoot Prometheus and Kubernetes alerts faster, leading to reduced MTTR and enhanced engineering productivity. Multi Website GitHub LinkedIn X
RunWhen RunWhen is committed to simplifying troubleshooting for complex cloud systems with the help of AI powered Engineering Assistants capable of suggesting what to run, and. SaaS Website LinkedIn
SRE Bench Evaluation and benchmarking platform for SRE agents and operational AI reliability workflows. SaaS Website LinkedIn
SRE.ai SRE.ai is the most advanced natural language DevOps platform, powering automation and software delivery for fast-moving organizations at scale, freeing up teams to build. SaaS Website LinkedIn
πŸ’šStakpak An open source agent that lives on your machines 24/7, keeps your apps running, and only pings when it needs a human. SaaS Website GitHub LinkedIn X
StarSling Multi-agent automation platform that orchestrates AI workflows for operations, troubleshooting, and remediation. SaaS Website LinkedIn X

Back to top ↑

IDP (2)

Name Summary Deployment Links
Rebase Every company needs to become an AI company. Rebase is the infrastructure to get there β€” connect all your systems, access any LLMs, and deploy AI agents across your. SaaS Website LinkedIn
StackGen Autonomous infrastructure platform powered by Aiden for platform engineering, DevOps, and SRE teams to automate provisioning, governance, and operations. Hybrid Website LinkedIn

Back to top ↑

IaC (1)

Name Summary Deployment Links
Ops0 ops0 automates how infrastructure is created, managed, and operated. Turn intent into IaC, apply updates intelligently, and resolve issues before they happen all powered. SaaS Website LinkedIn X

Back to top ↑

Security (2)

Name Summary Deployment Links
Cloudgeni AI-powered cloud infrastructure platform that detects misconfigurations, remediates security and compliance issues, and generates reviewable infrastructure changes through deterministic workflows. SaaS Website LinkedIn
Infrabase Infrabase scans code and organizational context to surface security gaps, cost spikes, and policy breaks before they ever hit your cloud. SaaS Website

Back to top ↑

Deployment (3)

Name Summary Deployment Links
Cutover Cutover's cloud-hosted Collaborative Automation platform connects teams and technology, helping you manage disaster recovery, migration, and release. SaaS Website LinkedIn X
Lens K8s IDE Kubernetes IDE for cluster operations and troubleshooting with AI-assisted diagnostics via Lens Prism. Hybrid Website GitHub LinkedIn X
πŸ’šSkyflo.ai Skyflo is an open-source AI agent for DevOps and cloud operations. It plans, executes, and verifies infrastructure changes across Kubernetes, CI/CD, and cloud platforms. Hybrid Website GitHub X

Back to top ↑