Welcome to Project Mayhem, the only platform where breaking your own stuff is not just encouragedβit's automated, AI-powered, and oddly satisfying! Our mission: unleash chaos (safely) on your cloud-native systems, so you can sleep at night knowing your infrastructure is tougher than a caffeinated SRE on call.
Why? Because real resilience is forged in the fires of (simulated) disaster. And because it's way more fun to watch your app survive a CPU spike than to explain to your boss why it didn't.
Our advanced NLP engine reads your postmortems and logs (so you don't have to) and invents new ways to break things. It learns from your past failures, so every disaster is a learning opportunityβliterally.
The platform picks the worst possible time and place to inject faultsβjust like real life! Network partitions, CPU spikes, memory leaks, disk fill-ups, and latency gremlins are all on the menu.
- API key authentication with rate limiting
- Secure secrets management
- Input validation and sanitization
- Non-root Docker containers
- RBAC-enabled Kubernetes deployment
The more chaos you unleash, the smarter the platform gets. It's like a chaos monkey, but with a PhD and a sense of humor.
- Prometheus metrics integration
- Grafana dashboards with 15+ visualizations
- System health monitoring
- Chaos experiment tracking and analytics
- CPU spikes and memory leaks
- Network latency and partitions
- Database slowdowns and timeouts
- Container restarts and process kills
- SSL certificate expiry simulation
- And many more enterprise-grade scenarios!
- Backend: Python 3.9+ with Flask and SQLAlchemy
- Frontend: Modern HTML5/CSS3/JavaScript with Bootstrap 5
- Containerization: Docker with multi-stage builds
- Orchestration: Kubernetes with RBAC and security policies
- Monitoring: Prometheus + Grafana with custom dashboards
- AI/ML: Advanced NLP for log analysis and pattern detection
- Security: API authentication, rate limiting, input validation
- CI/CD: GitHub Actions with automated testing and deployment
- Database: SQLite (dev) / PostgreSQL (production)
- Docker 20.0+ and Docker Compose 2.0+
- Kubernetes 1.21+ (for production deployment)
- Python 3.9+ (for local development)
- Git for version control
git clone https://github.com/yourusername/project-mayhem.git
cd project-mayhem# Copy environment template
cp .env.example .env
# Edit .env with your configuration
# CHAOS_API_KEY=your_secure_api_key_here
# GRAFANA_PASSWORD=your_secure_password_here# Start the complete stack
docker-compose up --build -d
# View logs
docker-compose logs -f chaos-orchestrator- Main UI: http://localhost:5000
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin / your_password)
# Using the UI - navigate to http://localhost:5000
# Or using API with your API key:
curl -X POST http://localhost:5000/inject \
-H "X-API-Key: your_api_key" \
-H "Content-Type: application/json" \
-d '{
"scenario": "cpu_spike",
"duration": 30,
"intensity": "medium"
}'# Production deployment to Kubernetes
./scripts/deploy_production.sh
# Or step by step:
kubectl apply -f configs/kubernetes_deployment_production.yaml
# Check deployment status
kubectl get pods -n chaos-engineering# Generate secure API key
export CHAOS_API_KEY=$(openssl rand -hex 32)
# Create Kubernetes secrets
kubectl create secret generic chaos-secrets \
--from-literal=api-key=${CHAOS_API_KEY} \
--from-literal=database-url="postgresql://..." \
-n chaos-engineering# Run all tests
python -m pytest src/tests/ -v
# Run with coverage
python -m pytest src/tests/ --cov=src --cov-report=html# Start test environment
docker-compose up -d
# Run integration tests
python -m pytest src/tests/test_integration.py -v
# Cleanup
docker-compose down# Install artillery
npm install -g artillery
# Run load tests
artillery run tests/load-test.yml- System Metrics: CPU, Memory, Disk, Network utilization
- Chaos Experiments: Success/failure rates, duration tracking
- Application Metrics: Request rates, response times, error rates
- AI Insights: Pattern detection, failure predictions
# System metrics
system_cpu_percent
system_memory_percent
chaos_injections_total
chaos_scenarios_total{scenario_type="cpu_spike"}
# Application metrics
flask_http_request_duration_seconds
flask_http_request_total
The platform includes advanced NLP-based log analysis that:
- Identifies failure patterns automatically
- Predicts potential system failures
- Recommends targeted chaos scenarios
- Generates actionable insights
# Example: AI-generated chaos recommendations
analyzer = NLPLogAnalyzer()
analysis = analyzer.analyze_logs(your_logs)
patterns = analyzer.extract_failure_patterns(analysis)
prediction = analyzer.predict_failure_likelihood(current_metrics, patterns)- Memory leaks: Detects memory-related failures in logs
- CPU exhaustion: Identifies CPU-related performance issues
- Network issues: Recognizes network partition patterns
- Service failures: Maps service-specific failure patterns
- API key-based authentication
- Rate limiting (configurable per endpoint)
- Input validation and sanitization
- Secure secrets management
- Non-root user execution
- Read-only root filesystem
- Security context policies
- Vulnerability scanning with Trivy
- RBAC policies with minimal permissions
- Network policies for traffic isolation
- Pod Security Standards enforcement
- Secret management with encryption
# Core Configuration
ENV=production
FLASK_ENV=production
SECRET_KEY=your_flask_secret_key
# Database
DATABASE_URL=sqlite:///chaos_platform.db
# Security
CHAOS_API_KEY=your_secure_api_key
GRAFANA_PASSWORD=your_grafana_password
# Monitoring
LOG_LEVEL=INFO
PROMETHEUS_RETENTION=30d
# Optional: External Services
OPENAI_API_KEY=your_openai_key
SLACK_WEBHOOK_URL=your_slack_webhook# configs/chaos_config.yaml
scenarios:
cpu_spike:
max_duration: 300
intensity_levels: [low, medium, high]
safety_checks: true
memory_leak:
max_duration: 180
memory_limit_mb: 500
cleanup_enabled: true- Minimum: 2 CPU cores, 4GB RAM, 10GB storage
- Recommended: 4 CPU cores, 8GB RAM, 50GB storage
- Production: 8+ CPU cores, 16GB+ RAM, 100GB+ storage
- Horizontal pod autoscaling in Kubernetes
- Database connection pooling
- Prometheus metrics for scaling decisions
- Load balancer support with session affinity
We welcome contributions! Please see our Contributing Guide for details.
# Clone and setup development environment
git clone https://github.com/yourusername/project-mayhem.git
cd project-mayhem
# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Run pre-commit hooks
pre-commit install
# Start development server
python src/app.py- Code formatting: Black + isort
- Linting: flake8 + pylint
- Security: bandit + safety
- Testing: pytest with 80%+ coverage
- Documentation: Sphinx with type hints
This project is licensed under the MIT License. See the LICENSE file for details. (You break it, you bought it!)
- Inspired by Netflix's Chaos Monkey, Gremlins, and every SRE who's ever said "What could possibly go wrong?"
- Special thanks to the open-source community for their contributions, memes, and moral support
- Built with β€οΈ and a healthy dose of controlled chaos
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Security: [email protected]
Remember: In chaos we trust, but in monitoring we verify! π₯π₯π