The Safe GenAI Platform: A reference implementation for ML inference with integrated safety, traffic governance, and observability.
Aether is a complete GenAI serving platform that treats safety as a first-class concern, not an afterthought. It integrates four purpose-built components into a unified system designed for demos and reference deployments.
Most GenAI platforms focus only on performance. But in production, you need more:
| Challenge | Typical Solution | Aether Solution |
|---|---|---|
| Prompt injection | Hope the LLM handles it | Sentinel: 3-tier defense for prompt-injection mitigation |
| PII leakage | Manual review | Sentinel: Auto-redaction before response |
| Cost explosion | Monthly bill shock | Atlas: Token-based quotas with forecasting |
| Performance at scale | Over-provision | Hyperion: Intelligent batching, 3x throughput |
| "What's happening?" | Grep through logs | MonitorX: ML-aware dashboards and alerting |
| Component | Role | Key Capabilities |
|---|---|---|
| Atlas | Traffic Gateway | Auth, rate limiting, quotas, safety compute budget |
| Sentinel | Safety Analysis | Tiered defense (HeuristicsβFast MLβDeep ML/LLM), PII detection |
| Hyperion | Model Serving | GPU inference, intelligent batching, caching, optimization |
| MonitorX | Observability | Real-time metrics, alerting, drift detection, dashboards |
- Safe by Design: Integrated safety layer catches attacks, PII leakage, toxic content
- One-Command Setup: Local infrastructure in minutes
- Self-Observing: Automatic metrics collection and alerting
- Quota-Aware: Built-in cost control and rate limiting
- High-Performance: GPU acceleration with intelligent batching
- Reference-Ready: Modular components, tests, and docs
- Docker & Docker Compose
- 8GB+ RAM recommended
- (Optional) NVIDIA GPU for acceleration
# Clone Aether
git clone https://github.com/BugVanquisher/Aether
cd Aether
# (Optional) copy demo env defaults
cp .env.example .env
# Clone all four component repositories
git clone https://github.com/BugVanquisher/Atlas
git clone https://github.com/BugVanquisher/Hyperion
git clone https://github.com/BugVanquisher/MonitorX
git clone https://github.com/BugVanquisher/Sentinel
# Setup and start WITH SAFETY (recommended!)
docker-compose -f docker-compose-with-sentinel.yml up -d
# Or without safety layer (original setup)
# ./setup-integrated-demo.shThat's it!
Each component (Atlas, Hyperion, MonitorX, Sentinel) is an independent repository. To pull the latest changes from all components:
./sync-repos.shThis script:
- Pulls
origin/mainfor each component - Skips repos with local changes (won't overwrite your work)
- Shows status for each repo
After syncing, rebuild containers if needed:
docker-compose -f docker-compose-with-sentinel.yml build# Run automated tests
./test-integrated-system.sh
# Expected output:
# β Hyperion inference successful
# β Atlas gateway working
# β Quota tracking active
# β MonitorX is collecting metrics
# β Rate limiting working- Atlas Gateway (entry point): http://localhost:8080
- Sentinel Safety API: http://localhost:8090
- Hyperion API: http://localhost:8000
- MonitorX Dashboard: http://localhost:8501
- InfluxDB UI: http://localhost:8086
Default API Key (demo only): demo-key-12345 (override via API_KEY)
- Demo credentials in this repo are for local testing only; rotate them before any real deployment.
- Set secrets through environment variables (see
.env.example) and remove demo defaults. - Disable demo bypass keys such as
SENTINEL_DEMO_KEYand set a strongADMIN_API_KEY.
- Set
API_KEY,ADMIN_API_KEY, and all storage tokens via env or secrets manager. - Enable Sentinel auth (
SENTINEL_API_KEYS) and RapidAPI proxy secret if applicable. - Rotate all demo credentials and disable any demo bypass keys.
Run the guided presentation:
./demo-presentation.shThis walks through:
- System health checks
- Complete inference flow
- Quota tracking
- Rate limiting in action
- Performance monitoring
- Cache optimization
- Prometheus metrics
Perfect for demos, presentations, or understanding the system!
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AETHER: SAFE GENAI PLATFORM β
β Production ML serving with integrated safety β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Clients
β
βΌ
βββββββββββββββββββββββββββ
β Atlas Gateway β β TRAFFIC GATEWAY
β Port 8080 β
β β
β β’ Authentication β
β β’ Rate Limiting β
β β’ Quota Control β
β β’ Safety Compute Budget β
βββββββββββββ¬ββββββββββββββ
β (authenticated)
βΌ
βββββββββββββββββββββββββββ
β Sentinel (Safety) β β SAFETY ANALYSIS
β Port 8090 β
β β
β β’ Tier 1: Heuristics β
β β’ Tier 2: Fast ML β
β β’ Tier 3: Deep ML/LLM β
β β’ Reports tier to Atlas β
βββββββββββββ¬ββββββββββββββ
β (if safe)
βΌ
βββββββββββββββββββββββββββ
β Hyperion Engine β β MODEL SERVING
β Port 8000 β
β β
β β’ GPU Inference β
β β’ Intelligent Batching β
β β’ Response Caching β
βββββββββββββ¬ββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Sentinel (Safety) β β OUTPUT FILTER
β β
β β’ PII Leakage Check β
β β’ Toxicity Filtering β
β β’ Secret Detection β
βββββββββββββ¬ββββββββββββββ
β
βΌ
Response
ββββββββββββββββββββββββββββββββββββββββ
β MonitorX (Observability) β
β API: 8001 | Dashboard: 8501 β
β β
β β’ Safety Metrics (block rate, FP) β
β β’ Performance Metrics (latency) β
β β’ ML-Aware Alerting β
ββββββββββββββββββββββββββββββββββββββββ
Supporting Infrastructure:
βββ Redis (6379): Caching + Quota Storage
βββ InfluxDB (8086): Time-Series Metrics
Why Atlas Before Sentinel?
Sentinel's Tier 3 uses LLM-based analysis (significant compute).
Atlas must protect this resource with quotas, just like inference.
Run the safety demonstration to see Sentinel in action:
./demo-safety.shThis showcases:
- Normal requests passing through safely
- Prompt injection attacks being blocked
- PII detection and flagging
- Toxic content being caught
- HIPAA violations detected
- Architecture Decision Records - Design decisions and rationale
- Integration Guide - Complete setup and configuration
- Troubleshooting Guide - Common issues and solutions
- API Reference - Endpoint documentation
- Production Deployment - Kubernetes and scaling
For detailed information on each component:
- Sentinel Documentation - Safety gateway details
- Atlas Documentation - Traffic governance details
- Hyperion Documentation - Inference engine internals
- MonitorX Documentation - Observability setup
# Send a request through the unified platform
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer demo-key-12345" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Explain machine learning in simple terms"}
],
"max_tokens": 100
}'curl http://localhost:8080/v1/usage \
-H "Authorization: Bearer demo-key-12345"# View batch statistics
curl http://localhost:8000/v1/batch/stats | jq
# Access real-time dashboard
open http://localhost:8501from aether import AetherClient
# Initialize client
client = AetherClient(
gateway_url="http://localhost:8080",
api_key="demo-key-12345"
)
# Make inference request
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "What is Aether?"}
],
max_tokens=100
)
print(response.choices[0].message.content)
# Check quota
usage = client.get_usage()
print(f"Daily usage: {usage['daily_usage']} / {usage['daily_limit']}")- API key authentication and authorization
- Rate limiting (QPS + burst control)
- Daily and monthly quota enforcement
- Safety compute budget (limit expensive Tier 3 checks)
- Token-level accounting (not just request counting)
- Priority-based routing (critical/high/normal/low)
- Usage forecasting and capacity planning
- Tiered Defense: Heuristics (<1ms) β Fast ML (5-15ms) β Deep ML/LLM (50-100ms)
- Prompt Injection Detection: Catches 95%+ of known attack patterns
- PII Detection & Redaction: Microsoft Presidio-powered entity recognition
- Toxicity Filtering: BERT-based toxicity classification
- HIPAA/GDPR Compliance: Healthcare and privacy policy enforcement
- Output Filtering: Catch leakage in generated responses
- Tier Reporting: Reports tier invoked back to Atlas for quota tracking
- GPU acceleration with CUDA support
- Intelligent request batching (2-8 requests/batch, 3x throughput)
- Redis caching for repeated queries
- Model optimization (quantization, compilation)
- Support for multiple model types
- Automatic CPU/GPU fallback
- Real-time metrics visualization
- Safety metrics: Block rate, false positive rate, tier distribution
- ML-aware alerting with adaptive thresholds
- Multi-channel alerting (Email, Slack, Webhooks)
- Historical analysis and trends
- CSV/JSON data export
docker-compose -f docker-compose-integrated.yml up -d# Deploy to Kubernetes cluster
kubectl apply -f kubernetes/namespace.yaml
kubectl apply -f kubernetes/
# Or use Helm
helm install aether ./helm/aether-platform# Copy environment template
cp .env.example .env
# Edit for your environment
vim .envKey variables:
# Atlas
ATLAS_RATE_LIMIT_QPS=100
ATLAS_ADMIN_TOKEN=<secure-token>
# Hyperion
HYPERION_DEVICE_TYPE=cuda # or 'cpu'
HYPERION_BATCH_SIZE=8
HYPERION_MODEL_NAME=microsoft/DialoGPT-small
# MonitorX
MONITORX_SLACK_WEBHOOK=<your-webhook>
[email protected]| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/supervise |
POST | Safety supervision (input/output check) |
/dashboard |
GET | Compliance dashboard (requires auth) |
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Health check |
/v1/chat/completions |
POST | OpenAI-compatible inference |
/v1/usage |
GET | Check quota usage |
/v1/admin/keys |
POST | Register API key |
/v1/forecasting/forecast |
GET | Traffic prediction |
/metrics |
GET | Prometheus metrics |
| Endpoint | Method | Description |
|---|---|---|
/healthz |
GET | Health + device info |
/v1/llm/chat |
POST | Direct LLM inference |
/v1/batch/stats |
GET | Batch performance |
/v1/models |
GET | Available models |
/metrics |
GET | Prometheus metrics |
| Endpoint | Method | Description |
|---|---|---|
/api/v1/health |
GET | API health |
/api/v1/models |
POST | Register model |
/api/v1/metrics/inference |
POST | Submit metrics |
/alerts/active |
GET | Active alerts |
/alerts/history |
GET | Alert history |
Full API documentation available at /docs on each service.
# Run integration test suite
./test-integrated-system.sh
# Run individual component tests
cd Atlas && pytest
cd Hyperion && pytest
cd MonitorX && pytest# Install locust
pip install locust
# Run load test
locust -f tests/locustfile.py \
--host http://localhost:8080 \
--users 100 \
--spawn-rate 10# Quick health check
curl http://localhost:8080/healthz
curl http://localhost:8000/healthz
curl http://localhost:8001/api/v1/health
# Verify metrics collection
curl http://localhost:8080/metrics | grep atlas_requests_totalDeploy multiple models simultaneously:
hyperion-gpt2:
build: ./Hyperion
environment:
- MODEL_NAME=gpt2
hyperion-bert:
build: ./Hyperion
environment:
- MODEL_NAME=bert-base-uncasedConfigure MonitorX alerts:
thresholds = {
"latency": 2000.0, # Alert if >2s
"error_rate": 0.05, # Alert if >5% errors
"gpu_memory": 0.90, # Alert if >90% memory
}Scale with Kubernetes HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hyperion-hpa
spec:
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70Aether is an integration project. Each component lives in its own repository:
- Atlas - Traffic governance
- Hyperion - ML inference engine
- MonitorX - Observability platform
- Sentinel - Safety gateway
- Make changes in the standalone component repo
- Commit and push to the component repo
- Run
./sync-repos.shin Aether to pull updates - Rebuild:
docker-compose -f docker-compose-with-sentinel.yml build
- Fork this repository
- Create a feature branch
- Submit a pull request
Apache License 2.0 - See LICENSE file for details.
Each component maintains its own Apache 2.0 license.
Aether demonstrates:
- Safety-First Design: Integrated safety layer from day one, not bolted on
- System Design: Composable architecture with clear separation of concerns
- DevOps Excellence: One-command deployment, comprehensive monitoring
- Production Mindset: Health checks, graceful degradation, observability
- Enterprise Features: Authentication, rate limiting, quota management, compliance
- Performance: GPU acceleration, intelligent batching, caching
Perfect for:
- Production ML deployments requiring safety compliance
- Learning ML infrastructure with security best practices
- Portfolio demonstrations of end-to-end platform design
- Rapid prototyping with built-in guardrails
- ML system benchmarking with safety metrics
- GitHub: github.com/BugVanquisher/Aether
- Documentation: Full Integration Guide
- Issues: Report a Bug
- Discussions: Community Forum
Aether integrates four purpose-built components:
- Sentinel - AI safety gateway with tiered defense
- Atlas - Traffic governance for LLM inference
- Hyperion - High-performance ML inference platform
- MonitorX - Production ML observability
Built with:
- Microsoft Presidio for PII detection
- Unitary Toxic-BERT for toxicity classification
- FastAPI for APIs
- InfluxDB for time-series storage