Enterprise-grade Kubernetes-native sandboxes — for humans and AI agents.
Documentation · Quickstart · Tutorials · SDK · Releases
AgentTier is a Kubernetes-native platform that provides isolated, persistent sandbox environments for running AI agents. Each sandbox is a pod with its own persistent storage, network isolation, and interactive terminal access — managed declaratively through Custom Resource Definitions.
Key use cases:
- Kubernetes operator for isolated, persistent sandboxes — declarative CRDs manage the full pod + PVC + NetworkPolicy lifecycle so stopped sandboxes keep their files and resumed sandboxes re-attach the same volume.
- Run AI coding agents (Claude Code, Cursor, Aider) in secure, isolated environments
- Provide on-demand development environments for engineering teams
- Execute untrusted AI-generated code with kernel-level isolation (gVisor)
- Orchestrate multi-agent workflows with inter-sandbox communication
Dashboard with a mix of human developer sandboxes and Claude Code agent sandboxes.
Full PTY in the browser. This sandbox is running Claude Code against AWS Bedrock.
- Create, stop, resume, delete — sandboxes spin up from a template; stopping preserves the workspace, packages, and git state on a persistent volume; resume reattaches the same volume in seconds; idle and max-runtime caps auto-stop with grace.
- Sub-second cold starts — per-template warm pools, optional immediate PVC binding, and an opt-in image pre-pull DaemonSet take creation from ~10 s down to ~800 ms.
- Self-healing — bounded retries on infrastructure failures with structured Kubernetes events for every transition; clean Error state once the retry budget is exhausted.
- Compose templates from other templates with field-level merge and per-sandbox overrides; the harness block defines the shell, tools, system prompt, hooks, and init scripts.
- Reference images out of the box — general coding, Claude Code on AWS Bedrock (with cloud-native credential injection), OpenClaw on AWS Bedrock (turnkey IRSA-driven config), Strands Agents on AWS Bedrock (Python SDK with IRSA), minimal shell, and a LangGraph agent-mode image.
- Run an agent on demand — configure a sandbox once with your code and install command, then call
/invoketo run it; output streams back as Server-Sent Events and closing the connection cancels the in-pod process. - Bring your own framework or harness — the LangGraph reference template ships in the box; the same shape works for Strands Agents, AutoGen, OpenHands, OpenClaw, or any pip-installable agent library. The framework owns the loop; AgentTier owns lifecycle, auth, transport, audit, and governance.
- Throttle, time out, and audit every invoke — per-sandbox concurrency caps return a clean 429 with
Retry-After; default 30-minute per-invoke timeout with a cluster ceiling; OpenTelemetry spans, Prometheus metrics, and Kubernetes events emitted on every configure and invoke. - Optional local memory — Helm flag adds a
mem0sidecar next to the agent. Bring-your-own memory (PVC-local, Pinecone, Postgres + pgvector, AgentCore Memory) is fully supported and documented.
- Locked-down sandboxes by default — non-root user, read-only root filesystem with a writable in-memory
/tmp, all capabilities dropped,seccomp=RuntimeDefault, and per-sandbox service accounts with no cluster permissions. - Network isolation — default deny-all egress with a configurable allow-list and always-on DNS; optional gVisor RuntimeClass for kernel-level isolation of untrusted workloads.
- Cloud-native credentials — wire AWS Bedrock and other cloud APIs in via IRSA on EKS or workload identity on GKE; mount Kubernetes Secrets as env vars or files; no long-lived secrets baked into images.
- Hashed share-link tokens — share links store SHA-256 hashes at rest, never the raw secret, with constant-time comparison on validation.
- Browser terminal that survives drops — full PTY over WebSocket with reconnect, plus an in-pod terminal endpoint (
/ptyon each sandbox) that bypasses the Kubernetes API server so long sessions survive load-balancer and apiserver-side timeouts; a tmux wrap keeps the same shell across reconnects, and tmux's alt-screen capability is stripped so fullscreen TUIs (Claude Code, vim, less) write into the browser scrollback instead of swallowing history. - Run commands programmatically — fire-and-forget or request-response exec, file upload/download/list, and port forwarding with authenticated previews through the Router; ports also surface as Ingress URLs when a preview domain is configured.
- Plug into any OIDC provider — Cognito, Okta, Azure AD, or anything OIDC-compliant, plus API keys stored as SHA-256 hashes with LRU caching.
- Hierarchical governance policies — cluster and per-namespace caps on sandbox counts, CPU / memory / storage, idle and max-runtime timeouts, agent-mode concurrency, allowed templates, and approved image registries; violations return a structured response so UIs can pinpoint the failing field. Same policy is re-checked at agent
/configuretime so a policy that tightens after sandbox creation still gates code uploads. - Per-IP and per-user rate limiting — opt-in token-bucket throttling on Router endpoints, with health checks and WebSocket terminals exempt; 429 responses carry
Retry-After. - Built-in audit trail — every lifecycle, terminal, credential, share, clone, and port-forward event is recorded as a Kubernetes event (and optionally a row in a SQL backend for long-term retention).
- One-click sandbox management — dashboard cards show name, status, mode (Code or Agent), and key metadata; primary actions (Open Terminal / Stop / Resume / Delete) sit on the card; a gear icon opens a per-sandbox settings page at
/sandbox/<id>/settingsfor ports, files, agent invoke, and (in time) governance overrides, network rules, and other deeper controls. - Cluster glance — the left nav shows live node + pod counts plus headroom spare and warm-pool size, with a green dot when Cluster Autoscaler is running. Refreshes every ten seconds without polling the dashboard.
- Per-template warm pools, edited in place — the Settings page lets operators add and remove warm-pool entries one template at a time, see ready / pending / target counts per pool, and tune the optional headroom Deployment's replica count and per-replica CPU and memory without
helm upgrade. - Hierarchical workspace browser — the Files panel on the per-sandbox settings page lets users click into folders, breadcrumb back, download a single file, download a single folder as
.zip, or download the entire workspace as.zip. The archive endpoint streamstarfrom the pod and re-encodes to zip on the fly server-side, so it works on every sandbox image without extra binaries. - Browser-based admin — YAML template editor, time-ordered activity log with filters, live metrics + monthly cost estimator, and a Settings page for governance policies, warm pools, and cluster autoscaling.
pip install agenttier— installs both the Python SDK (sync + async, typed models, auto-detected auth, streaming file transfers including workspace archive download, opt-in retry layer with backoff andRetry-After) and the sameagenttiershell command on PATH.- Cross-platform CLI — Go binary for sandbox and template management distributed for linux / darwin / windows on amd64 + arm64; the
pipinstall gives you the same command tree without the Go dependency. - REST API — documented endpoints for sandboxes, templates, governance, audit, sharing, port forwarding, files, archive (workspace as zip), configure, and invoke.
- Tracking and pre-warming — per-sandbox startup duration in logs and events for regression tracking; optional warm pool, immediate PVC binding, and image pre-pull eliminate cold-start latency.
- OpenTelemetry traces and Prometheus metrics — distributed traces across controller and router with trace context in structured JSON logs;
/metricsexposes sandbox counts, startup histograms, queue depth, error counters, terminal stats, and the agent-mode invoke / configure / throttle metrics. - Continuous CVE scanning — every sandbox base image is scanned with Trivy on every push; findings land in the GitHub Security tab as SARIF.
- Single Helm chart — one
helm installdeploys controller, router, web UI, CRDs, RBAC, and optional add-ons (gVisor, ServiceMonitor, PDB, image pre-pull, OTel Collector, mem0 sidecar, rate limiting, cluster autoscaler, headroom). - Production load-balancer support — opt-in Ingress template with AWS Load Balancer Controller defaults (4000 s idle timeout, sticky sessions, IP allow-list); compatible with ingress-nginx and Traefik via a single override.
- Multi-cluster ready — runs on EKS, GKE, AKS, and self-managed Kubernetes 1.27+ on any CNI with NetworkPolicy support.
- Highly available — multi-replica controllers with leader election; multi-replica router with HTTP-routed exec, files, and invoke so any replica can serve any request.
- Cluster autoscaling out of the box — opt-in upstream Cluster Autoscaler installs cloud-neutral via Helm (works on EKS, GKE, AKS, OpenStack, Cluster API). Pair it with the
headroomDeployment to keep N+1 spare-node capacity warm: pause Pods at negative priority squat on a spare node, real sandboxes preempt them instantly, the evicted Pods trigger CAS to add the next spare in the background. Sandboxes never wait on a cold ASG round-trip. - Container images you can verify — every image is multi-arch, cosign-signed via GitHub Actions OIDC, and ships SPDX + CycloneDX SBOMs as OCI attestations.
These are tracked but not yet shipped: Terraform module for EKS, sharing and collaboration UX, webhook / email / Slack notifications, sandbox cloning via VolumeSnapshot, inter-sandbox networking, optional SQL backend for state, validating admission webhook, and additional reference images (Strands Agents on Bedrock, OpenHands, OpenClaw, RL training).
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌────────────┐ ┌────────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Controller │ │ Router │ │ Web UI │ │ etcd │ │
│ │ (operator) │ │ (API + WS) │ │ (nginx) │ │ (built-in) │ │
│ └─────┬──────┘ └─────┬──────┘ └──────────┘ └───────────┘ │
│ │ │ │
│ ┌─────┴────────────────┴────────────────────────────────────┐ │
│ │ Sandbox Namespace(s) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │Sandbox 1 │ │Sandbox 2 │ │Sandbox N │ ... │ │
│ │ │Pod + PVC │ │Pod + PVC │ │Pod + PVC │ │ │
│ │ │+ NetPol │ │+ NetPol │ │+ NetPol │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
git clone https://github.com/agenttier/agenttier.git
cd agenttier
./hack/quickstart.shThis provisions an EKS cluster, builds container images, and deploys AgentTier in ~15 minutes. Run ./hack/quickstart.sh destroy to tear down.
Install from the public Helm chart and container images at ghcr.io/agenttier/*:
# 1. Add the AgentTier Helm repo and refresh
helm repo add agenttier https://agenttier.github.io/agenttier/charts
helm repo update
# 2. Install the chart (CRDs are bundled)
helm install agenttier agenttier/agenttier \
--namespace agenttier --create-namespace
# 3. Create a sandbox
kubectl apply -f - <<EOF
apiVersion: agenttier.io/v1alpha1
kind: Sandbox
metadata:
name: my-sandbox
spec:
templateRef:
name: general-coding
kind: ClusterSandboxTemplate
EOF
# 4. Check status
kubectl get sandboxes
# 5. Open a terminal
kubectl exec -it my-sandbox-pod -c sandbox -- /bin/bashDocs site: https://agenttier.github.io/agenttier/ Pre-v0.2.0 users can still
helm repo addathttps://agenttier.github.io/agenttier(root) — both paths resolve to the same charts.
Create → Running → Stop (pod deleted, PVC preserved) → Resume (new pod, same PVC) → Delete (all removed)
- Stop: Preserves all files, packages, git repos. No compute cost while stopped.
- Resume: Restores exact filesystem state. Takes ~5-10 seconds.
- Delete: Permanently removes sandbox and all data.
Templates define reusable sandbox configurations:
apiVersion: agenttier.io/v1alpha1
kind: ClusterSandboxTemplate
metadata:
name: claude-code-bedrock
spec:
description: "AI coding environment with Claude Code CLI on Bedrock"
image:
repository: ghcr.io/agenttier/sandbox-claude-code:latest
resources:
requests: { cpu: "1", memory: 2Gi }
limits: { cpu: "4", memory: 8Gi }
storage:
size: 20Gi
network:
allowInternet: true
harness:
shell: /bin/bash
tools:
- name: claude
verifyCommand: "claude --version"
hooks:
onStart: "echo 'Sandbox ready'"
timeout: 24h
idleTimeout: 2hBuilt-in templates: general-coding, claude-code-bedrock (Claude Code on Bedrock via IRSA), openclaw-bedrock (OpenClaw CLI on Bedrock via IRSA), strands-bedrock (Strands Agents Python SDK on Bedrock via IRSA), langgraph-agent (mode: agent reference), and minimal-shell.
from agenttier import AgentTierClient
client = AgentTierClient(api_url="https://agenttier.company.com")
sandbox = client.create_sandbox(template="general-coding", name="my-sandbox")
sandbox.wait_until_running()
result = sandbox.exec("echo 'Hello from AgentTier!'")
print(result.stdout) # "Hello from AgentTier!"
sandbox.files.write("/workspace/hello.py", "print('works!')")
sandbox.terminate()Install: pip install agenttier
See helm/agenttier/values.yaml for all configuration options.
Key settings:
auth.oidc.*— OIDC provider configuration (Cognito, Okta, Azure AD)defaults.sandbox.*— Default sandbox resources, storage, timeoutssecurity.gvisor.enabled— Enable gVisor kernel isolationnetworking.defaultPolicy— Default network policy (deny-all or allow-internet)
The Router already sends RFC 6455 WebSocket control pings and application-level heartbeat messages every 30 seconds, so any load balancer with an idle timeout ≥ 60s will see traffic in both directions and keep the connection open. If you still see drops:
- On AWS ALB, the chart's
optional.ingress.annotationssetsidle_timeout.timeout_seconds=4000by default. Verify it's applied:kubectl get ingress agenttier-webui -n agenttier -o yaml. - On AWS Classic ELB, set the annotation
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "3600"on the web-ui Service, or run:aws elb modify-load-balancer-attributes \ --load-balancer-name <elb-name> \ --load-balancer-attributes '{"ConnectionSettings":{"IdleTimeout":3600}}'
- With multi-replica routers, enable sticky sessions on the target group so a reconnecting browser lands on the same pod. The chart's default ALB annotations already include
stickiness.enabled=true.
Ensure the Router image includes the Tty: true fix in StreamOptions. Run stty size in the terminal — it should show your actual terminal dimensions (e.g., 40 120), not 0 0.
The sandbox image can't be pulled. Check:
- Template image reference:
kubectl get clustersandboxtemplate <name> -o jsonpath='{.spec.image.repository}' - Node can reach the registry: ECR images need the node role to have
AmazonEC2ContainerRegistryReadOnlypolicy - For private registries, set
spec.image.pullSecretin the sandbox spec
Never expose services with 0.0.0.0/0. Use:
loadBalancerSourceRangesto restrict to specific IPs- Or use
kubectl port-forwardfor local access (no public exposure) - Or deploy an Ingress with OIDC authentication
All Dockerfiles use public.ecr.aws/docker/library/* base images. If you see 429 errors, ensure your Dockerfiles reference ECR Public, not Docker Hub directly.
- Kubernetes 1.27+
- CNI with NetworkPolicy support (Calico, Cilium, or AWS VPC CNI)
- CSI storage driver (EBS CSI, PD CSI, or any CSI-compliant driver)
- Helm 3.x
agenttier/
├── cmd/controller/ # Kubernetes operator entrypoint
├── cmd/router/ # REST API + WebSocket terminal server
├── cmd/cli/ # CLI tool
├── api/v1alpha1/ # CRD type definitions
├── pkg/controller/ # Reconciliation logic
├── pkg/router/ # HTTP handlers, auth, terminal bridge
├── web-ui/ # React frontend (TypeScript + Vite)
├── helm/agenttier/ # Helm chart
├── terraform/aws-eks/ # AWS infrastructure (EKS + Cognito + ECR)
├── images/ # Reference Dockerfiles
├── python-sdk/ # Python SDK (pip install agenttier)
├── docs/ # Documentation (MkDocs)
└── hack/ # Scripts (quickstart, codegen, load testing)
We welcome contributions! See CONTRIBUTING.md for:
- Development setup
- Coding standards
- Testing requirements
- Pull request process
Apache License 2.0 — see LICENSE for details.
Built with:
- controller-runtime — Kubernetes operator framework
- kubebuilder — CRD scaffolding
- gorilla/websocket — WebSocket implementation
- xterm.js — Terminal emulator for the browser
- React — Web UI framework