An intelligent Site Reliability Engineering system that automatically processes alerts through sequential agent chains, retrieves runbooks, and uses MCP (Model Context Protocol) servers to gather system information for comprehensive multi-stage incident analysis.
Inspired by the spirit of sci-fi AI, TARSy is your reliable SRE operations companion for SRE operations. π
- README.md: This file - project overview and quick start
- docs/architecture-overview.md: High-level architecture concepts and design principles
- docs/requirements.md: Application requirements and specifications
- docs/design.md: Detailed system design and architecture documentation
Before running TARSy, ensure you have the following tools installed:
- Python 3.11+ - Core backend runtime
- Node.js 18+ - Frontend development and build tools
- npm - Node.js package manager (comes with Node.js)
- uv - Modern Python package and project manager
- Install:
pip install uv - Alternative:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Install:
Quick Check: Run
make check-prereqsto verify all prerequisites are installed.
# 1. Initial setup (one-time only)
make setup
# 2. Configure API keys (REQUIRED)
# Edit backend/.env and set your API keys:
# - GEMINI_API_KEY (get from https://aistudio.google.com/app/apikey)
# - GITHUB_TOKEN (get from https://github.com/settings/tokens)
# 3. Ensure Kubernetes/OpenShift access (REQUIRED)
# See [K8s Access Requirements](#k8s-access-reqs) section below for details
# 4. Start all services
make devServices will be available at:
- π₯οΈ TARSy Dashboard: http://localhost:5173 (operational monitoring)
- π οΈ Alert Dev UI: http://localhost:3001 (alert testing)
- π§ Backend API: http://localhost:8000 (docs at /docs)
Stop all services: make stop
- π Sequential Agent Chains: Multi-stage workflows where specialized agents build upon each other's work for comprehensive analysis
- π οΈ Configuration-Based Agents: Deploy new agents and chain definitions via YAML configuration without code changes
- π§ Flexible Alert Processing: Accept arbitrary JSON payloads from any monitoring system
- π§ Chain-Based Agent Architecture: Specialized agents with domain-specific tools and AI reasoning working in coordinated stages
- π Comprehensive Audit Trail: Complete visibility into chain processing workflows with stage-level timeline reconstruction
- π₯οΈ SRE Dashboard: Real-time monitoring and historical analysis with interactive chain timeline visualization
- π Data Masking: Automatic protection of sensitive data in logs and responses
Tarsy uses an AI-powered chain-based architecture where alerts flow through sequential stages of specialized agents that build upon each other's work using domain-specific tools to provide comprehensive expert recommendations to engineers.
π For high-level architecture concepts: See Architecture Overview
- Alert arrives from monitoring systems with flexible JSON payload
- Orchestrator selects appropriate agent chain based on alert type
- Runbook downloaded automatically from GitHub for chain guidance
- Sequential stages execute where each agent builds upon previous stage data using AI to select and execute domain-specific tools
- Comprehensive multi-stage analysis provided to engineers with actionable recommendations
- Full audit trail captured with stage-level detail for monitoring and continuous improvement
sequenceDiagram
participant MonitoringSystem
participant Orchestrator
participant AgentChains
participant GitHub
participant AI
participant MCPServers
participant Dashboard
participant Engineer
MonitoringSystem->>Orchestrator: Send Alert
Orchestrator->>AgentChains: Assign Alert & Context
AgentChains->>GitHub: Download Runbook
loop Investigation Loop
AgentChains->>AI: Investigate with LLM
AI->>MCPServers: Query/Actuate as needed
end
AgentChains->>Dashboard: Send Analysis & Recommendations
Engineer->>Dashboard: Review & Take Action
- Start All Services: Run
make devto start backend, dashboard, and alert UI - Submit an Alert: Use the alert dev UI at http://localhost:3001 to simulate an alert
- Monitor via Dashboard: Watch real-time progress updates and historical analysis at http://localhost:5173
- View Results: See detailed processing timelines and comprehensive LLM analysis
- Stop Services: Run
make stopwhen finished
Tip: Use
make urlsto see all available service endpoints andmake statusto check which services are running.
The system now supports flexible alert types from any monitoring source:
- Kubernetes Agent: Processes alerts from Kubernetes clusters (namespaces, pods, services, etc.)
- Any Monitoring System: Accepts arbitrary JSON payloads from Prometheus, AWS CloudWatch, ArgoCD, Datadog, etc.
- Agent-Agnostic Processing: New alert types can be added by creating specialized agents and updating agent registry
- LLM-Driven Analysis: Agents intelligently interpret any alert data structure without code changes to core system
The LLM-driven approach with flexible data structures means diverse alert types can be handled from any monitoring source, as long as:
- A runbook exists for the alert type
- An appropriate specialized agent is available or can be created
- The MCP servers have relevant tools for the monitoring domain
TARSy requires read-only access to a Kubernetes or OpenShift cluster to analyze and troubleshoot Kubernetes infrastructure issues. The system uses the kubernetes-mcp-server, which connects to your cluster via kubeconfig.
TARSy does not use oc or kubectl commands directly. Instead, it:
- Uses Kubernetes MCP Server: Runs
kubernetes-mcp-server@latestvia npm - Reads kubeconfig: Authenticates using your existing kubeconfig file
- Read-Only Operations: Configured with
--read-only --disable-destructiveflags - No Modifications: Cannot create, update, or delete cluster resources
If you're already logged into your OpenShift/Kubernetes cluster:
# Verify your current access
oc whoami
oc cluster-info
# TARSy-bot will automatically use your current kubeconfig
# Default location: ~/.kube/config or $KUBECONFIGTo use a specific kubeconfig file:
# Set in backend/.env
KUBECONFIG=/path/to/your/kubeconfig
# Or set environment variable
export KUBECONFIG=/path/to/your/kubeconfigCommon Issues:
# Check kubeconfig validity
oc cluster-info
# Verify TARSy can access cluster
# Check backend logs for kubernetes-mcp-server errors
tail -f backend/logs/tarsy.log | grep kubernetes
# Test kubernetes-mcp-server independently
npx -y kubernetes-mcp-server@latest --kubeconfig ~/.kube/config --helpPermission Errors:
- Ensure your user/service account has at least
viewcluster role - Verify kubeconfig points to correct cluster
- Check network connectivity to cluster API server
GET /- Health check endpointGET /health- Comprehensive health check with service statusPOST /alerts- Submit a new alert for processingGET /alert-types- Get supported alert typesGET /processing-status/{alert_id}- Get processing statusWebSocket /ws/{alert_id}- Real-time progress updates
GET /api/v1/history/sessions- List alert processing sessions with filtering and paginationGET /api/v1/history/sessions/{session_id}- Get detailed session with chronological timelineGET /api/v1/history/health- History service health check and database status
- Alert Types: Define any alert type in
config/agents.yaml- no hardcoding required, just create corresponding runbooks - MCP Servers: Update
mcp_serversconfiguration insettings.pyor define inconfig/agents.yaml - Agents: Create traditional hardcoded agent classes extending BaseAgent, or define configuration-based agents in
config/agents.yaml - LLM Providers: Add new providers by extending the LLM client configuration
π For detailed extensibility examples: See Extensibility section in the Architecture Overview
# Run back-end and front-end (dashboard) tests
make testThe test suite includes comprehensive end-to-end integration tests covering the complete alert processing pipeline, agent specialization, error handling, and performance scenarios with full mocking of external services.