A sophisticated multi-agent system for Site Reliability Engineering (SRE) operations with an interactive chat interface. This application enables engineers to query infrastructure, Kubernetes clusters, and monitoring systems using natural language.
This project is a multi-agent AI system that provides a conversational interface for SRE operations. It consists of:
- SRE Agent: A main orchestrating agent that intelligently routes queries to specialized agents
- Kubectl Agent: Handles Kubernetes-related queries through kubectl-ai MCP server (pods, deployments, services, logs, etc.)
- Prometheus Agent: Manages metrics and monitoring queries (HTTP request metrics, error rates, performance data)
- Web Interface: A modern React-based chat UI for interacting with the agents
The system is built with LangGraph for agent orchestration and supports multiple LLM providers. It provides real-time insights into infrastructure health, application performance, and helps correlate issues across different systems.
- π€ Intelligent Query Routing: Automatically determines which specialized agents to invoke based on user queries
- π Multi-Agent Coordination: Correlates data from multiple sources (Kubernetes, Prometheus) for unified insights
- π Rich Visualizations: Displays metrics, pod status, and error rates in formatted tables and summaries
- π Web UI: Modern, responsive chat interface for seamless interaction
- π Graceful Degradation: Falls back to mock data when external services are unavailable
- Node.js 20 or higher
- npm 11.2.1 or higher
- Access to Kubernetes cluster (optional, for kubectl agent)
- Prometheus endpoint (optional, for metrics queries)
git clone <repository-url>
cd SRE-AgentInstall all project dependencies (this will install dependencies for both the root workspace and all apps):
npm installCreate a .env file in the project root with the following configuration:
# Google Gemini API Key (required)
GOOGLE_API_KEY=your-gemini-api-key-here
# Kubectl-AI MCP Server Configuration (optional)
# Note: Default port in start-kubectl-ai-mcp.sh is 8180
KUBECTL_AI_MCP_ENDPOINT=http://localhost:8180
KUBECTL_AI_API_KEY=your_api_key_here
# Prometheus Configuration (optional)
# Use localhost:9090 when using port forwarding
PROMETHEUS_ENDPOINT=http://localhost:9090
PROMETHEUS_API_KEY=your_prometheus_api_key
# Default LLM Model (optional)
DEFAULT_MODEL=google/gemini-2.0-flash-lite
# Proxy Configuration (if behind corporate firewall)
# HTTP_PROXY=http://your-proxy-server:port
# HTTPS_PROXY=http://your-proxy-server:port
# NO_PROXY=localhost,127.0.0.1,.local,.internalNote:
- Refer to
.env.examplefile for full config. Copy contents to.envand update with relevant values- See
KUBECTL_AI_SETUP.mdfor kubectl-ai MCP server setup and Prometheus port forwarding instructions- See
PROXY_SETUP.mdfor detailed proxy configuration if you're behind a corporate firewall
You have two options to run the application:
Option A: Run both Agent and Web together (Recommended)
npm run devThis command starts both:
- Agent Server: Runs on port 2024 (default)
- Web UI: Runs on port 5173 (default Vite port)
Option B: Run with simplified agent setup
npm run dev:simpleThis uses a simplified agent configuration without full LangGraph Studio integration.
Once the servers are running:
- Web UI: Open your browser and navigate to
http://localhost:5173 - Agent API: Available at
http://localhost:2024
If you need to start components separately:
cd apps/agents
npm run devOr with simplified mode:
cd apps/agents
npm run dev:simplecd apps/web
npm run devThe web UI will be available at http://localhost:5173.
Before starting the agent, ensure:
-
kubectl-ai MCP Server is running (if using Kubernetes features):
./scripts/start-kubectl-ai-mcp.sh
-
Prometheus port forwarding is active (if using metrics features):
./scripts/accessP8sViaPortFwd.sh # Or run in background: nohup ./scripts/accessP8sViaPortFwd.sh &
See KUBECTL_AI_SETUP.md for detailed setup instructions.
Then:
- Access the Web UI: Open
http://localhost:5173in your browser - Configure Connection: Enter the following:
- Deployment URL:
http://localhost:2024(for local development) - Assistant/Graph ID:
sre_agent - LangSmith API Key: (Optional, only required for deployed servers)
- Deployment URL:
- Start Chatting: Begin asking questions about your infrastructure!
Kubernetes Queries:
- "What's the status of the user-service deployment?"
- "Show me pods with high CPU usage"
- "List all pods in the default namespace"
- "What's the version of metrics-demo microservice?"
Prometheus Queries:
- "What's the current HTTP request rate for the API?"
- "Show me the error rate for metrics-demo in the last hour"
- "What's the success/failure rate for my application?"
Multi-Agent Queries:
- "Check ms version, pod health, success/failure metrics for metrics-demo ms in default namespace"
- "Is the high error rate related to pod failures?"
Input Query - Example #1 :
check ms version, pod health, success/failure metrics for metrics-demo ms in default namespace
Response output #1:
Input Query - Example #2 :
perform same check again
The agent provides a comprehensive analysis including:
- π Summary Section
- π Detailed Analysis
- π― Recommendations
- π Suggested Next Steps
See
SampleOutput.mdfor the complete formatted output example.
SRE-Agent/
βββ apps/
β βββ agents/ # LangGraph agents (SRE Agent, Research Agent)
β β βββ src/
β β βββ sre-agent/ # Main SRE agent with kubectl & prometheus tools
β β βββ research-agent/ # Research agent for document retrieval
β βββ web/ # React-based web UI
β βββ src/
β βββ components/ # UI components
β βββ providers/ # LangGraph client providers
β βββ hooks/ # React hooks
βββ langgraph.json # LangGraph configuration
βββ package.json # Root workspace configuration
βββ .env # Environment variables (create this)
- Agent Documentation: See
apps/agents/src/sre-agent/README.mdfor detailed SRE agent architecture - Web UI Documentation: See
apps/web/README.mdfor web interface details - kubectl-ai Setup: See
KUBECTL_AI_SETUP.mdfor kubectl-ai MCP server setup and configuration - Proxy Setup: See
PROXY_SETUP.mdfor corporate firewall configuration - Sample Output: See
SampleOutput.mdfor a complete example response
npm run buildnpm run lint
npm run lint:fix # Auto-fix linting issuesnpm run format-
Agent Server Won't Start
- Verify
.envfile exists and contains required API keys - Check that port 2024 is not already in use
- Review logs in
apps/agents/logs/sre-agent.log
- Verify
-
Web UI Can't Connect to Agent
- Ensure agent server is running on port 2024
- Check that the Deployment URL in web UI is
http://localhost:2024 - Verify CORS configuration if accessing from different origin
-
MCP Server Connection Failed
- Verify kubectl-ai MCP server is running
- Check
KUBECTL_AI_MCP_ENDPOINTin.env - Agent will use mock data if MCP server is unavailable
-
Prometheus Connection Failed
- Verify Prometheus server is accessible
- Check
PROMETHEUS_ENDPOINTin.env - Ensure proper authentication if required
Private project - All rights reserved

