Enterprise-grade Kubernetes-native sandboxes — for humans and AI agents.
Documentation · Quickstart · Tutorials · SDK · Releases
AgentTier is a Kubernetes-native platform that provides isolated, persistent sandbox environments for running AI agents. Each sandbox is a pod with its own persistent storage, network isolation, and interactive terminal access — managed declaratively through Custom Resource Definitions.
Key use cases:
- Kubernetes operator for isolated, persistent sandboxes — declarative CRDs manage the full pod + PVC + NetworkPolicy lifecycle so stopped sandboxes keep their files and resumed sandboxes re-attach the same volume.
- Run AI coding agents (Claude Code, Cursor, Aider) in secure, isolated environments
- Provide on-demand development environments for engineering teams
- Execute untrusted AI-generated code with kernel-level isolation (gVisor)
- Orchestrate multi-agent workflows with inter-sandbox communication
Dashboard with a mix of human developer sandboxes and Claude Code agent sandboxes.
Full PTY in the browser. This sandbox is running Claude Code against AWS Bedrock.
- Create, stop, resume, delete — sandboxes spin up from a template; stopping preserves the workspace, packages, and git state on a persistent volume; resume reattaches the same volume in seconds; idle and max-runtime caps auto-stop with grace.
- Clone any sandbox via VolumeSnapshot —
POST /api/v1/sandboxes/{id}/clonetakes a CSI VolumeSnapshot of the source PVC and provisions a new sandbox whose workspace is byte-identical to the source. Clones inherit the source's spec (template, env, ports, agent harness) so a fork is one HTTP call away. SDK + CLI surfaces; works on any CSI driver with snapshot support (EBS, GCE PD, Azure Disk, etc). - Sub-second cold starts — per-template warm pools, optional immediate PVC binding, and an opt-in image pre-pull DaemonSet take creation from ~10 s down to ~800 ms.
- Self-healing — bounded retries on infrastructure failures with structured Kubernetes events for every transition; clean Error state once the retry budget is exhausted.
- Compose templates from other templates with field-level merge and per-sandbox overrides; the harness block defines the shell, tools, system prompt, hooks, and init scripts.
- Reference images out of the box — general coding, Claude Code on AWS Bedrock (with cloud-native credential injection), OpenClaw on AWS Bedrock (turnkey IRSA-driven config), Strands Agents on AWS Bedrock (Python SDK with IRSA), a LangGraph agent-mode image, an
rl-rolloutimage (PyTorch + Ray RLlib + Gymnasium + Stable-Baselines3 with self-contained PPO and/invoke-shaped rollout examples), and minimal shell.
- Run an agent on demand — configure a sandbox once with your code and install command, then call
/invoketo run it; output streams back as Server-Sent Events and closing the connection cancels the in-pod process. - Bring your own framework or harness — the LangGraph reference template ships in the box; the same shape works for Strands Agents, AutoGen, OpenHands, OpenClaw, or any pip-installable agent library. The framework owns the loop; AgentTier owns lifecycle, auth, transport, audit, and governance.
- Throttle, time out, and audit every invoke — per-sandbox concurrency caps return a clean 429 with
Retry-After; default 30-minute per-invoke timeout with a cluster ceiling; OpenTelemetry spans, Prometheus metrics, and Kubernetes events emitted on every configure and invoke. - Install logs persisted out-of-band — the trailing bytes of every
/configureinstall command land in a per-sandbox ConfigMap rather than inline on the Sandbox CR, so etcd object size stays small at scale andkubectl describe sandboxstays clean. A lazy GET endpoint serves the log on demand. - Optional local memory — Helm flag adds a
mem0sidecar next to the agent. Bring-your-own memory (PVC-local, Pinecone, Postgres + pgvector, AgentCore Memory) is fully supported and documented.
- Locked-down sandboxes by default — non-root user, read-only root filesystem with a writable in-memory
/tmp, all capabilities dropped,seccomp=RuntimeDefault, and per-sandbox service accounts with no cluster permissions. - Network isolation — default deny-all egress with a configurable allow-list and always-on DNS; optional gVisor RuntimeClass for kernel-level isolation of untrusted workloads.
- Cloud-native credentials — wire AWS Bedrock and other cloud APIs in via IRSA on EKS or workload identity on GKE; mount Kubernetes Secrets as env vars or files; no long-lived secrets baked into images.
- Hashed share-link tokens — share links store SHA-256 hashes at rest, never the raw secret, with constant-time comparison on validation.
- Browser terminal that survives drops — full PTY over WebSocket with reconnect, plus an in-pod terminal endpoint (
/ptyon each sandbox) that bypasses the Kubernetes API server so long sessions survive load-balancer and apiserver-side timeouts; a tmux wrap keeps the same shell across reconnects, and tmux's alt-screen capability is stripped so fullscreen TUIs (Claude Code, vim, less) write into the browser scrollback instead of swallowing history. - Bottom-pinned during fast TUI redraws — when an agent (Claude Code, vim, htop) is producing dense output the viewport stays pinned to the bottom so the input prompt stays visible. If you scroll up to read history, your scroll position is preserved — output keeps landing below without yanking you back to the prompt.
- Run commands programmatically — fire-and-forget or request-response exec, file upload/download/list, and port forwarding with authenticated previews through the Router; ports also surface as Ingress URLs when a preview domain is configured.
- Plug into any OIDC provider — Cognito, Okta, Azure AD, or anything OIDC-compliant. JWTs are verified against the provider's JWKS (RS256 signature + issuer + audience + expiry), plus API keys minted on demand and stored as SHA-256 hashes with LRU caching. Auth fails closed: with no OIDC issuer configured the Router rejects every request with 401 unless an operator explicitly sets
auth.devAuth: truefor local development. - Hierarchical governance policies — cluster and per-namespace caps on sandbox counts, CPU / memory / storage, idle and max-runtime timeouts, agent-mode concurrency, allowed templates, and approved image registries; violations return a structured response so UIs can pinpoint the failing field. Same policy is re-checked at agent
/configuretime so a policy that tightens after sandbox creation still gates code uploads. - Per-IP and per-user rate limiting — opt-in token-bucket throttling on Router endpoints, with health checks and WebSocket terminals exempt; 429 responses carry
Retry-After. - Built-in audit trail — every lifecycle, terminal, credential, share, clone, and port-forward event is recorded as a Kubernetes event (and optionally a row in a SQL backend for long-term retention).
- One-click sandbox management — dashboard cards show name, status, mode (Code or Agent), and key metadata; primary actions (Open Terminal / Stop / Resume / Delete) sit on the card; a gear icon opens a per-sandbox settings page at
/sandbox/<id>/settingsfor ports, files, agent invoke, and (in time) governance overrides, network rules, and other deeper controls. - Cluster glance — the left nav shows live node + pod counts plus headroom spare and warm-pool size, with a green dot when Cluster Autoscaler is running. Refreshes every ten seconds without polling the dashboard.
- Per-template warm pools, edited in place — the Settings page lets operators add and remove warm-pool entries one template at a time, see ready / pending / target counts per pool, and tune the optional headroom Deployment's replica count and per-replica CPU and memory without
helm upgrade. - Hierarchical workspace browser — the Files panel on the per-sandbox settings page lets users click into folders, breadcrumb back, download a single file, download a single folder as
.zip, or download the entire workspace as.zip. The archive endpoint streamstarfrom the pod and re-encodes to zip on the fly server-side, so it works on every sandbox image without extra binaries. - Browser-based admin — YAML template editor, time-ordered activity log with filters, live metrics + monthly cost estimator, and a Settings page for governance policies, warm pools, and cluster autoscaling.
pip install agenttier— installs both the Python SDK (sync + async, typed models, auto-detected auth, streaming file transfers including workspace archive download, opt-in retry layer with backoff andRetry-After) and the sameagenttiershell command on PATH.- Cross-platform CLI — Go binary for sandbox and template management distributed for linux / darwin / windows on amd64 + arm64; the
pipinstall gives you the same command tree without the Go dependency. - REST API — documented endpoints for sandboxes, templates, governance, audit, sharing, port forwarding, files, archive (workspace as zip), configure, and invoke.
- Tracking and pre-warming — per-sandbox startup duration in logs and events for regression tracking; optional warm pool, immediate PVC binding, and image pre-pull eliminate cold-start latency.
- OpenTelemetry traces and Prometheus metrics — distributed traces across controller and router with trace IDs auto-injected into structured JSON log lines (one trace ID pivots between an OTel UI and
kubectl logsin either direction). Spans cover every HTTP request (router.<method>), agent/configure, and/invoke. Bucketedactor_hashinstead of raw OIDC subjects so traces shipped to third-party stores don't carry PII./metricsexposes sandbox counts, startup histograms, queue depth, error counters, terminal stats, and the agent-mode invoke / configure / throttle metrics. - OTel Collector bundled with the chart — opt-in
observability.otelCollector.enabled=trueflag renders a Deployment + ConfigMap + Service runningotel/opentelemetry-collector-contribin your install namespace; deployments auto-point at it. Default exporter isdebugso you can tail the collector's container logs to verify spans without provisioning external infra; replace the exporter to ship to Honeycomb / Datadog / Tempo / Jaeger / your existing collector. - Continuous CVE scanning — every sandbox base image is scanned with Trivy on every push; findings land in the GitHub Security tab as SARIF.
- Single Helm chart — one
helm installdeploys controller, router, web UI, CRDs, RBAC, and optional add-ons (gVisor, ServiceMonitor, PDB, image pre-pull, OTel Collector, mem0 sidecar, rate limiting, cluster autoscaler, headroom). - Production load-balancer support — opt-in Ingress template with AWS Load Balancer Controller defaults (4000 s idle timeout, sticky sessions, IP allow-list); compatible with ingress-nginx and Traefik via a single override.
- Multi-cluster ready — runs on EKS, GKE, AKS, and self-managed Kubernetes 1.27+ on any CNI with NetworkPolicy support.
- Highly available — multi-replica controllers with leader election; multi-replica router with HTTP-routed exec, files, and invoke so any replica can serve any request.
- Cluster autoscaling out of the box — opt-in upstream Cluster Autoscaler installs cloud-neutral via Helm (works on EKS, GKE, AKS, OpenStack, Cluster API). Pair it with the
headroomDeployment to keep N+1 spare-node capacity warm: pause Pods at negative priority squat on a spare node, real sandboxes preempt them instantly, the evicted Pods trigger CAS to add the next spare in the background. Sandboxes never wait on a cold ASG round-trip. - Container images you can verify — every image is multi-arch, cosign-signed via GitHub Actions OIDC, and ships SPDX + CycloneDX SBOMs as OCI attestations.
- Automated post-release retention — every release run prunes container manifests + GitHub Releases older than the latest 3 GA tags (older Releases demoted to pre-release, deep links preserved); git tags, PyPI versions, and active cosign signatures are never pruned.
These are tracked but not yet shipped: Terraform module for EKS, sharing and collaboration UX, webhook / email / Slack notifications, inter-sandbox networking, optional SQL backend for state, validating admission webhook, validating CRD admission for several spec fields that are currently spec-only, HPA + multi-replica Router, and additional reference images (OpenHands).
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Controller │ │ Router │ │ Web UI │ │ etcd │ │
│ │ (operator) │ │ (API + WS) │ │ (nginx) │ │ (built-in) │ │
│ └─────┬──────┘ └─────┬──────┘ └──────────┘ └───────────┘ │
│ │ │ │
│ ┌─────┴────────────────┴────────────────────────────────────┐ │
│ │ Sandbox Namespace(s) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │Sandbox 1 │ │Sandbox 2 │ │Sandbox N │ ... │ │
│ │ │Pod + PVC │ │Pod + PVC │ │Pod + PVC │ │ │
│ │ │+ NetPol │ │+ NetPol │ │+ NetPol │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
git clone https://github.com/agenttier/agenttier.git
cd agenttier
./hack/quickstart.shThis provisions an EKS cluster, builds container images, and deploys AgentTier in ~15 minutes. Run ./hack/quickstart.sh destroy to tear down.
Install from the public Helm chart and container images at ghcr.io/agenttier/*:
# 1. Add the AgentTier Helm repo and refresh
helm repo add agenttier https://agenttier.github.io/agenttier/charts
helm repo update
# 2. Install the chart (CRDs are bundled)
helm install agenttier agenttier/agenttier \
--namespace agenttier --create-namespace
# 3. Create a sandbox
kubectl apply -f - <<EOF
apiVersion: agenttier.io/v1alpha1
kind: Sandbox
metadata:
name: my-sandbox
spec:
templateRef:
name: general-coding
kind: ClusterSandboxTemplate
EOF
# 4. Check status
kubectl get sandboxes
# 5. Open a terminal
kubectl exec -it my-sandbox-pod -c sandbox -- /bin/bashDocs site: https://agenttier.github.io/agenttier/ Pre-v0.2.0 users can still
helm repo addathttps://agenttier.github.io/agenttier(root) — both paths resolve to the same charts.
Create → Running → Stop (pod deleted, PVC preserved) → Resume (new pod, same PVC) → Delete (all removed)
- Stop: Preserves all files, packages, git repos. No compute cost while stopped.
- Resume: Restores exact filesystem state. Takes ~5-10 seconds.
- Delete: Permanently removes sandbox and all data.
Templates define reusable sandbox configurations:
apiVersion: agenttier.io/v1alpha1
kind: ClusterSandboxTemplate
metadata:
name: claude-code-bedrock
spec:
description: "AI coding environment with Claude Code CLI on Bedrock"
image:
repository: ghcr.io/agenttier/sandbox-claude-code:latest
resources:
requests: { cpu: "1", memory: 2Gi }
limits: { cpu: "4", memory: 8Gi }
storage:
size: 20Gi
network:
allowInternet: true
harness:
shell: /bin/bash
tools:
- name: claude
verifyCommand: "claude --version"
hooks:
onStart: "echo 'Sandbox ready'"
timeout: 24h
idleTimeout: 2hBuilt-in templates: general-coding, claude-code-bedrock (Claude Code on Bedrock via IRSA), openclaw-bedrock (OpenClaw CLI on Bedrock via IRSA), strands-bedrock (Strands Agents Python SDK on Bedrock via IRSA), langgraph-agent (mode: agent reference), and minimal-shell.
from agenttier import AgentTierClient
client = AgentTierClient(api_url="https://agenttier.company.com")
sandbox = client.create_sandbox(template="general-coding", name="my-sandbox")
sandbox.wait_until_running()
result = sandbox.exec("echo 'Hello from AgentTier!'")
print(result.stdout) # "Hello from AgentTier!"
sandbox.files.write("/workspace/hello.py", "print('works!')")
sandbox.terminate()Install: pip install agenttier
See helm/agenttier/values.yaml for all configuration options.
Key settings:
auth.oidc.*— OIDC provider configuration (Cognito, Okta, Azure AD)defaults.sandbox.*— Default sandbox resources, storage, timeoutssecurity.gvisor.enabled— Enable gVisor kernel isolationnetworking.defaultPolicy— Default network policy (deny-all or allow-internet)
The Router already sends RFC 6455 WebSocket control pings and application-level heartbeat messages every 30 seconds, so any load balancer with an idle timeout ≥ 60s will see traffic in both directions and keep the connection open. If you still see drops:
- On AWS ALB, the chart's
optional.ingress.annotationssetsidle_timeout.timeout_seconds=4000by default. Verify it's applied:kubectl get ingress agenttier-webui -n agenttier -o yaml. - On AWS Classic ELB, set the annotation
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"on the web-ui Service, or run:aws elb modify-load-balancer-attributes \ --load-balancer-name <elb-name> \ --load-balancer-attributes '{"ConnectionSettings":{"IdleTimeout":3600}}'
- With multi-replica routers, enable sticky sessions on the target group so a reconnecting browser lands on the same pod. The chart's default ALB annotations already include
stickiness.enabled=true.
Ensure the Router image includes the Tty: true fix in StreamOptions. Run stty size in the terminal — it should show your actual terminal dimensions (e.g., 40 120), not 0 0.
The sandbox image can't be pulled. Check:
- Template image reference:
kubectl get clustersandboxtemplate <name> -o jsonpath='{.spec.image.repository}' - Node can reach the registry: ECR images need the node role to have
AmazonEC2ContainerRegistryReadOnlypolicy - For private registries, set
spec.image.pullSecretin the sandbox spec
Never expose services with 0.0.0.0/0. Use:
loadBalancerSourceRangesto restrict to specific IPs- Or use
kubectl port-forwardfor local access (no public exposure) - Or deploy an Ingress with OIDC authentication
All Dockerfiles use public.ecr.aws/docker/library/* base images. If you see 429 errors, ensure your Dockerfiles reference ECR Public, not Docker Hub directly.
- Kubernetes 1.27+
- CNI with NetworkPolicy support (Calico, Cilium, or AWS VPC CNI)
- CSI storage driver (EBS CSI, PD CSI, or any CSI-compliant driver)
- Helm 3.x
agenttier/
├── cmd/controller/ # Kubernetes operator entrypoint
├── cmd/router/ # REST API + WebSocket terminal server
├── cmd/cli/ # CLI tool
├── api/v1alpha1/ # CRD type definitions
├── pkg/controller/ # Reconciliation logic
├── pkg/router/ # HTTP handlers, auth, terminal bridge
├── web-ui/ # React frontend (TypeScript + Vite)
├── helm/agenttier/ # Helm chart
├── terraform/aws-eks/ # AWS infrastructure (EKS + Cognito + ECR)
├── images/ # Reference Dockerfiles
├── python-sdk/ # Python SDK (pip install agenttier)
├── docs/ # Documentation (MkDocs)
└── hack/ # Scripts (quickstart, codegen, load testing)
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Coding standards
- Testing requirements
- Pull request process
Apache License 2.0 — see LICENSE for details.
Built with:
- controller-runtime — Kubernetes operator framework
- kubebuilder — CRD scaffolding
- gorilla/websocket — WebSocket implementation
- xterm.js — Terminal emulator for the browser
- React — Web UI framework