Nimbus is a self-hosted CI platform built around Firecracker microVMs, org-scoped storage, and end-to-end observability for AI evaluation workloads. It replaces GitHub-hosted runners while keeping execution on infrastructure you control.
- Getting Started
- Configuration Reference
- Operations Guide
- Onboarding Playbook
- GitHub Actions Migration Guide
- ROI Calculator
- Firecracker Security Hardening
- ClickHouse Schema
- Runbook
- Policy-as-Code
- Control Plane (
src/nimbus/control_plane): Validates GitHubworkflow_jobwebhooks (HMAC, timestamp, delivery replay fence), enforces distributed per-org rate limits, issues agent/cache tokens, and brokers job leases over Redis + Postgres. It also exposes SAML SSO, SCIM provisioning, service accounts, and compliance export logging. - Host Agent (
src/nimbus/host_agent): Runs Firecracker microVMs with snapshot boot and network fencing, plus Docker and GPU executors selected by capability labels. The agent layers in warm pools, resource/performance telemetry, offline egress enforcement, SBOM generation, and supply-chain allow/deny policies. - Executor System (
src/nimbus/runners): Pluggable executors share a commonExecutorprotocol, pool manager, resource tracker, and watchdogs for timeouts and lease renewal fencing. - Cache Proxy (
src/nimbus/cache_proxy): Multi-tenant cache front-end with S3/local backends, org quotas, eviction metrics, and circuit-breakers to isolate backend failures. - Docker Layer Cache (
src/nimbus/docker_cache): Minimal OCI registry enforcing org-prefixed repositories, blob accounting, and scoped cache tokens for push/pull. - Logging Pipeline (
src/nimbus/logging_pipeline): ClickHouse-backed ingestion API with batched writes, scoped query filters, and hardened metrics endpoints. - Web Dashboard (
web): React/Vite SPA that surfaces job queues, agent health, logs, and compliance metadata via the public API.
- Lease fencing with rotating fence tokens prevents duplicate job execution across agents.
- Webhook signature + timestamp validation with replay tracking (
x-github-delivery) blocks tampering and replays. - Agent/cache/service-account tokens are org scoped, versioned, and auditable through Postgres-backed ledgers.
- Offline-mode egress enforcement combines metadata endpoint deny-lists, regex policy packs, and explicit registry allow-lists.
- Rootfs attestation, cosign provenance checks, and per-job SBOM generation tighten host supply-chain posture.
- Metrics and admin endpoints require bearer tokens and can be IP-filtered for additional hardening.
- Install dependencies and bootstrap the environment (
uv venv,uv pip install -e .). - Configure the required environment variables and secrets (see Configuration).
- Follow the detailed setup in Getting Started to launch services and supporting infrastructure.
- Run the automated checks with
uv run pytestto validate control plane, host agent, caching, and executor integrations.
Workflows can target Nimbus runners using capability-based labels:
# Secure isolation (default)
runs-on: [nimbus] # Uses Firecracker microVMs
# Fast startup for CI/CD
runs-on: [nimbus, docker] # ~200ms startup
# GPU acceleration for ML/AI
runs-on: [nimbus, gpu, pytorch, gpu-count:2] # 2 GPUs
# Custom configurations
runs-on: [nimbus, docker, image:node:18-alpine]The control plane verifies workflow_job signatures, enforces per-org rate limits, and dispatches jobs to agents based on capability matching.
Nimbus publishes curated container images, such as nimbus/ai-eval-runner (Node.js, eval2otel, ollama client) for evaluation workloads. See Getting Started for example usage.
- Day-two procedures, monitoring, and ClickHouse schema live in the Operations Guide.
- Firecracker jailer, seccomp, and capability dropping guidance is documented in Firecracker Security Hardening.
src/nimbus/control_plane: FastAPI application, database models, RBAC/SCIM/SAML integrations, compliance tooling.src/nimbus/host_agent: Firecracker launcher, multi-executor orchestration, warm pools, egress enforcement, and SSH utilities.src/nimbus/runners: Executor implementations (Firecracker, Docker, GPU), pool manager, resource tracker, performance monitor.src/nimbus/cache_proxy&src/nimbus/docker_cache: Artifact/cache services with metrics, quota enforcement, and S3/OCI backends.src/nimbus/logging_pipeline: ClickHouse ingestion and querying service for job logs.web: Vite/React dashboard for operational monitoring.tests: Extensive pytest suite covering services, executors, security controls, and CLI tooling.
- Complete: Multi-tenant isolation, lease fencing, webhook replay protection, distributed rate limiting, metrics endpoint authentication, tenant analytics dashboard, multi-executor system with Firecracker/Docker/GPU support, warm pools, snapshot boot, comprehensive performance monitoring.
- In Progress: Enhanced GPU scheduling, ARM64 support, advanced resource optimization.
- Planned: Kubernetes executor, Windows containers, auto-scaling warm pools, cost optimization features.
Nimbus is ready for production deployments with a mature multi-executor architecture. See the Executor System Guide for comprehensive usage documentation. Contributions welcome in:
- New executor implementations (Kubernetes, ARM64, Windows)
- Advanced GPU scheduling and optimization
- Performance analysis and cost optimization
- Extended warm pool strategies