AI-Powered Autonomous Infrastructure Incident Response System
InfraGuard is an intelligent incident response platform that automatically detects, diagnoses, and fixes infrastructure issues using AI agents. Built for the AI Agents Assemble hackathon.
Infrastructure incidents cost companies millions in downtime. Traditional monitoring alerts humans who must:
- Wake up at 3 AM
- Manually diagnose the issue
- Research solutions
- Apply fixes
- Verify resolution
Average MTTR: 30-60 minutes per incident
InfraGuard reduces this to seconds by:
- π Detecting anomalies in real-time with Prometheus
- π§ Analyzing root causes with Kestra AI Agent
- π§ Generating fixes automatically with Cline CLI
- β Reviewing code quality with CodeRabbit
- π Visualizing everything on a Vercel dashboard
| Technology | Usage | Prize Track |
|---|---|---|
| Cline CLI | Autonomous code generation for fixes | Infinity Build ($5K) |
| Kestra | AI Agent for data summarization & decisions | Wakanda Data ($4K) |
| Oumi | RL fine-tuned action selection model | Iron Intelligence ($3K) |
| Vercel | Production dashboard deployment | Stormbreaker ($2K) |
| CodeRabbit | AI code review on all PRs | Captain Code ($1K) |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GITHUB + CODERABBIT β
β Reviews ALL PRs (human + AI-generated) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
β LOCAL ENVIRONMENT β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Minikube ββ βPrometheusββ β Kestra ββ β Cline β β
β β Cluster β β +Grafana β β AI Agent β β CLI β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββ
βΌ βΌ βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β AWS Bedrock β βGoogle Colab β β Vercel β
β (LLM API) β β(Oumi Train) β β (Dashboard) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
- Monitors Kubernetes pods, CPU, memory, restarts
- Custom Prometheus alerts for common issues
- Sub-minute incident detection
- Kestra AI Agent summarizes system state
- Correlates metrics from multiple sources
- Identifies root causes automatically
- Cline generates targeted code patches
- Creates K8s manifest updates
- Opens PRs with proper documentation
- CodeRabbit reviews all generated code
- Catches bugs before they reach production
- Ensures best practices
- Real-time incident feed
- System health at a glance
- Action log with PR links
- Docker
- Minikube
- Node.js 18+
- Python 3.10+
# Clone the repo
git clone https://github.com/YOUR_USERNAME/infraguard
cd infraguard
# Start Minikube
minikube start --cpus=4 --memory=8192
# Deploy sample apps
kubectl apply -f k8s/manifests/sample-apps.yaml
# Install monitoring stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f k8s/manifests/prometheus-values.yaml
# Start Kestra
docker run -d --name kestra -p 8080:8080 kestra/kestra:latest server local
# Start Metrics API
python scripts/metrics-api.py
# Open dashboard
npm run dev --prefix dashboard# Inject a crash loop
./scripts/inject-incident.sh crash-loop
# Watch the magic happen in the dashboard!
# Cleanup
./scripts/inject-incident.sh cleanupCodeRabbit reviews every PR in this repository:
- β Reviewed 20+ PRs during development
- β Caught 5 potential bugs
- β Improved documentation quality
- β Reviews AI-generated fixes from Cline
We fine-tuned an action selection model using Oumi's GRPO:
- Base Model: SmolLM2-360M-Instruct
- Training Data: 500 synthetic incident scenarios
- Reward Function: +10 (resolved), -10 (failed)
- Training Time: ~30 minutes on T4 GPU
See oumi/training/ for details.
infraguard/
βββ k8s/
β βββ manifests/ # Kubernetes configurations
βββ kestra/
β βββ workflows/ # Kestra flow definitions
βββ dashboard/ # Next.js Vercel app
βββ oumi/
β βββ training/ # Oumi training scripts
βββ scripts/
β βββ metrics-api.py # Prometheus API wrapper
β βββ inject-incident.sh # Demo incident injection
β βββ cline-incident-fix.py
βββ cline-tasks/ # Auto-generated Cline tasks
βββ .coderabbit.yaml # CodeRabbit configuration
βββ README.md
MIT License - see LICENSE
Built with β€οΈ for AI Agents Assemble