Thanks to visit codestin.com
Credit goes to github.com

Skip to content

fbm3307/tarsy-bot

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CI Pipeline codecov

TARSy

TARSy (Thoughtful Alert Response System) is an intelligent Site Reliability Engineering system that automatically processes alerts through sequential agent chains, retrieves runbooks, and uses MCP (Model Context Protocol) servers to gather system information for comprehensive multi-stage incident analysis.

Inspired by the spirit of sci-fi AI, TARSy is your reliable companion for SRE operations. πŸš€

tarsy.webm

Documentation

Prerequisites

For Development Mode

  • Python 3.13+ - Core backend runtime
  • Node.js 18+ - Frontend development and build tools
  • npm - Node.js package manager (comes with Node.js)
  • uv - Modern Python package and project manager
    • Install: pip install uv
    • Alternative: curl -LsSf https://astral.sh/uv/install.sh | sh

For Container Deployment (Additional)

  • Podman (or Docker) - Container runtime
  • podman-compose - Multi-container application management
    • Install: pip install podman-compose

Quick Check: Run make check-prereqs to verify all prerequisites are installed.

Quick Start

Development Mode (Direct Backend)

# 1. Initial setup (one-time only)
make setup

# 2. Configure API keys (REQUIRED)
# Edit backend/.env and set your API keys:
# - GOOGLE_API_KEY (get from https://aistudio.google.com/app/apikey)
# - GITHUB_TOKEN (get from https://github.com/settings/tokens)

# 3. Ensure Kubernetes/OpenShift access (REQUIRED)
# See [K8s Access Requirements](#k8s-access-reqs) section below for details

# 4. Start all services  
make dev

Services will be available at:

Stop all services: make stop

Container Deployment (Production-like)

For production-like testing with containerized services, authentication, and database:

# 1. Initial setup (one-time only)
make setup

# 2. Configure API keys and OAuth (REQUIRED)
# Edit backend/.env and set your API keys + OAuth configuration
# - See config/README.md for OAuth2 proxy customization (client IDs, secrets, org/team)
# - See docs/oauth2-proxy-setup.md for detailed GitHub OAuth setup guide
# - Configure LLM providers in backend/.env (GOOGLE_API_KEY, etc.)

# 3. Deploy complete containerized stack
make containers-deploy        # Preserves database data (recommended)
# OR for fresh start:
make containers-deploy-fresh  # Clean rebuild including database

Services will be available at:

Container Management:

  • Update apps (preserve database): make containers-deploy
  • Fresh deployment: make containers-deploy-fresh
  • Stop containers: make containers-stop
  • View logs: make containers-logs
  • Check status: make containers-status
  • Clean up: make containers-clean (removes all containers and data)

OpenShift/Kubernetes Deployment

For deploying TARSy to OpenShift or Kubernetes clusters:

# Complete deployment with local builds
make openshift-deploy

πŸ“– For complete OpenShift deployment guide: See deploy/README.md

This deployment is designed for development and testing environments, serving as a reference for production deployments in separate repositories.

Key Features

  • πŸ› οΈ Configuration-Based Agents: Deploy new agents and chain definitions via YAML configuration without code changes
  • πŸ”§ Flexible Alert Processing: Accept arbitrary JSON payloads from any monitoring system
  • 🧠 Chain-Based Agent Architecture: Specialized agents with domain-specific tools and AI reasoning working in coordinated stages
  • πŸ“Š Comprehensive Audit Trail: Complete visibility into chain processing workflows with stage-level timeline reconstruction
  • πŸ–₯️ SRE Dashboard: Real-time monitoring with live LLM streaming and interactive chain timeline visualization
  • πŸ”’ Data Masking: Automatic protection of sensitive data in logs and responses

Architecture

Tarsy uses an AI-powered chain-based architecture where alerts flow through sequential stages of specialized agents that build upon each other's work using domain-specific tools to provide comprehensive expert recommendations to engineers.

πŸ“– For high-level architecture concepts: See Architecture Overview

How It Works

  1. Alert arrives from monitoring systems with flexible JSON payload
  2. Orchestrator selects appropriate agent chain based on alert type
  3. Runbook downloaded automatically from GitHub for chain guidance
  4. Sequential stages execute where each agent builds upon previous stage data using AI to select and execute domain-specific tools
  5. Comprehensive multi-stage analysis provided to engineers with actionable recommendations
  6. Full audit trail captured with stage-level detail for monitoring and continuous improvement
sequenceDiagram
    participant MonitoringSystem
    participant Orchestrator
    participant AgentChains
    participant GitHub
    participant AI
    participant MCPServers
    participant Dashboard
    participant Engineer

    MonitoringSystem->>Orchestrator: Send Alert
    Orchestrator->>AgentChains: Assign Alert & Context
    AgentChains->>GitHub: Download Runbook
    loop Investigation Loop
        AgentChains->>AI: Investigate with LLM
        AI->>MCPServers: Query/Actuate as needed
    end
    AgentChains->>Dashboard: Send Analysis & Recommendations
    Engineer->>Dashboard: Review & Take Action
Loading

Usage

Development Mode

  1. Start All Services: Run make dev to start backend and dashboard
  2. Submit an Alert: Use Manual Alert Submission at http://localhost:5173/submit-alert for testing TARSy
  3. Monitor via Dashboard: Watch real-time progress updates and historical analysis at http://localhost:5173
  4. View Results: See detailed processing timelines and comprehensive LLM analysis
  5. Stop Services: Run make stop when finished

Container Deployment Mode

  1. Deploy Stack: Run make containers-deploy (preserves database) or make containers-deploy-fresh (clean start)
  2. Login: Navigate to http://localhost:8080 and authenticate via GitHub OAuth
  3. Submit Alert: Use the dashboard at http://localhost:8080/submit-alert (OAuth protected)
  4. Monitor Processing: Watch real-time progress with full audit trail
  5. Stop Containers: Run make containers-stop when finished

Tip: Use make status or make containers-status to check which services are running.

Container Architecture

The containerized deployment provides a production-like environment with:

  • πŸ” OAuth2 Authentication: GitHub OAuth integration via oauth2-proxy
  • πŸ”„ Reverse Proxy: Nginx handles all traffic routing and CORS
  • πŸ—„οΈ PostgreSQL Database: Persistent storage for processing history
  • πŸ“¦ Production Builds: Optimized frontend and backend containers
  • πŸ”’ Security: All API endpoints protected behind authentication

Architecture Overview:

Browser β†’ Nginx (8080) β†’ OAuth2-Proxy β†’ Backend (FastAPI)
                      β†˜ Dashboard (Static Files)

πŸ“– For OAuth2-proxy setup instructions: See docs/oauth2-proxy-setup.md

Supported Alert Types

The system now supports flexible alert types from any monitoring source:

Current Agent Types

  • Kubernetes Agent: Processes alerts from Kubernetes clusters (namespaces, pods, services, etc.)

Flexible Alert Support

  • Any Monitoring System: Accepts arbitrary JSON payloads from Prometheus, AWS CloudWatch, ArgoCD, Datadog, etc.
  • Agent-Agnostic Processing: New alert types can be added by creating specialized agents and updating agent registry
  • LLM-Driven Analysis: Agents intelligently interpret any alert data structure without code changes to core system

The LLM-driven approach with flexible data structures means diverse alert types can be handled from any monitoring source, as long as:

  • A runbook exists for the alert type
  • An appropriate specialized agent is available or can be created
  • The MCP servers have relevant tools for the monitoring domain

Kubernetes/OpenShift Access Requirements

TARSy requires read-only access to a Kubernetes or OpenShift cluster to analyze and troubleshoot Kubernetes infrastructure issues. The system uses the kubernetes-mcp-server, which connects to your cluster via kubeconfig.

πŸ”— How TARSy Accesses Your Cluster

TARSy does not use oc or kubectl commands directly. Instead, it:

  1. Uses Kubernetes MCP Server
  2. Reads kubeconfig: Authenticates using your existing kubeconfig file
  3. Read-Only Operations: Configured with --read-only --disable-destructive flags
  4. No Modifications: Cannot create, update, or delete cluster resources

βš™οΈ Setup Instructions

Option 1: Use Existing Session (Recommended)

If you're already logged into your OpenShift/Kubernetes cluster:

# Verify your current access
oc whoami
oc cluster-info

# TARSy will automatically use your current kubeconfig
# Default location: ~/.kube/config or $KUBECONFIG

Option 2: Custom Kubeconfig

To use a specific kubeconfig file:

# Set in backend/.env
KUBECONFIG=/path/to/your/kubeconfig

# Or set environment variable
export KUBECONFIG=/path/to/your/kubeconfig

API Endpoints

Core API

  • GET /health - Comprehensive health check with service status and warnings (HTTP 503 for degraded/unhealthy)
  • POST /api/v1/alerts - Submit a new alert for processing (returns session_id immediately)
  • GET /api/v1/alert-types - Get supported alert types
  • WebSocket /api/v1/ws - Real-time progress updates via WebSocket with channel subscriptions

History API

  • GET /api/v1/history/sessions - List alert processing sessions with filtering and pagination
  • GET /api/v1/history/sessions/{session_id} - Get detailed session with chronological timeline

System API

  • GET /api/v1/system/warnings - Active system warnings (MCP/LLM init failures, etc.)

Development

Adding New Components

  • Alert Types: Define any alert type in config/agents.yaml - no hardcoding required, just create corresponding runbooks
  • MCP Servers: Define custom MCP servers in config/agents.yaml with support for stdio, HTTP, and SSE transports. Can override built-in MCP servers (e.g., customize kubernetes-server with specific kubeconfig)
  • Agents: Create traditional hardcoded agent classes extending BaseAgent, or define configuration-based agents in config/agents.yaml. Can override built-in agents to customize behavior
  • Chains: Define multi-stage workflows in config/agents.yaml. Can override built-in chains to customize investigation workflows
  • LLM Providers: Built-in providers work out-of-the-box (OpenAI, Google, xAI, Anthropic, Vertex AI). Add custom providers via config/llm_providers.yaml for proxy configurations or model overrides

πŸ“– For detailed extensibility examples: See Extensibility section in the Architecture Overview

Database Migrations

TARSy uses Alembic for database schema versioning and migrations. The migration system automatically applies pending migrations on startup, ensuring your database schema is always up-to-date.

Quick Migration Workflow:

# 1. Modify SQLModel in backend/tarsy/models/
# 2. Generate migration from model changes
make migration msg="Add new field to AlertSession"

# 3. Review generated file in backend/alembic/versions/
# 4. Test migration
make migration-upgrade

# 5. If needed, rollback
make migration-downgrade

Available Migration Commands:

make migration msg="Description"     # Generate migration from model changes
make migration-manual msg="Desc"     # Create empty migration for manual changes
make migration-upgrade               # Apply all pending migrations
make migration-downgrade             # Rollback last migration
make migration-status                # Show current database version
make migration-history               # Show full migration history

πŸ“– For complete migration documentation: See docs/database-migrations.md

Running Tests

# Run back-end and front-end (dashboard) tests
make test

The test suite includes comprehensive end-to-end integration tests covering the complete alert processing pipeline, agent specialization, error handling, and performance scenarios with full mocking of external services.

About

Intelligent Site Reliability Engineering agent for automatic alert processing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 74.7%
  • TypeScript 23.6%
  • Makefile 1.3%
  • HTML 0.2%
  • JavaScript 0.1%
  • Dockerfile 0.1%