Harbor installed-agent adapters for benchmarking AgentPlane as a control-plane wrapper around coding agents on Terminal-Bench.
AgentPlane is not submitted as a model. It is submitted as a harness profile:
agentplane-codex: AgentPlane + Codex CLIagentplane-claude-code: AgentPlane + Claude Code
The goal is to measure whether AgentPlane improves reproducibility, traceability, recovery, and failure analysis while preserving the same underlying model and benchmark constraints.
Experimental scaffold. Use the smoke run first. Do not submit leaderboard results until the generated proof bundle and ATIF trajectories have been reviewed for Terminal-Bench integrity compliance.
- Docker running locally
uv- Harbor benchmark framework installed with
uv tool install harbor - Provider API key for the selected executor/model
- No benchmark-specific hints, oracle files, test folders, or modified timeouts
Codex CLI is authenticated inside the benchmark container with
codex login --with-api-key; the local run wrapper passes a Harbor env
template (OPENAI_API_KEY=${OPENAI_API_KEY}) instead of putting the key value
in the process argv.
If Homebrew's Harbor registry CLI is also installed, set this in .env.local:
HARBOR_BIN=$HOME/.local/bin/harboruv venv
uv pip install -e ".[dev]"Copy the local environment template and add your API key:
cp .env.example .env.local
$EDITOR .env.localThen run:
uv tool install harbor
./scripts/agentplane_bench.sh setup
./scripts/agentplane_bench.sh preflight
./scripts/agentplane_bench.sh oracle-smoke
./scripts/agentplane_bench.sh smokeFor a full Harbor run:
N= ./scripts/agentplane_bench.sh fullFor a Terminal-Bench leaderboard-shaped run using the legacy tb CLI:
./scripts/agentplane_bench.sh leaderboard-tbEstimate API cost before a full run:
./scripts/estimate_cost.py --model gpt-5-nano --tasks 80 --profile midCodex:
export OPENAI_API_KEY="..."
harbor run \
-d terminal-bench/terminal-bench-2 \
--agent-import-path agentplane_harbor_adapter.agentplane_codex:AgentPlaneCodexAgent \
-m gpt-5-nano \
-n 1Claude Code:
export ANTHROPIC_API_KEY="..."
harbor run \
-d terminal-bench/terminal-bench-2 \
--agent-import-path agentplane_harbor_adapter.agentplane_claude_code:AgentPlaneClaudeCodeAgent \
-m anthropic/claude-sonnet-4-5 \
-n 1Use the current Terminal-Bench/Harbor submission instructions before running a full submission. As of the last checked public docs, leaderboard submissions must use the official Terminal-Bench dataset, default agent timeout, default test timeout, and ATIF trajectories for passing trials.
Example Harbor run shape:
harbor run \
-d terminal-bench/terminal-bench-2 \
--agent-import-path agentplane_harbor_adapter.agentplane_codex:AgentPlaneCodexAgent \
-m gpt-5-nanoIf the active submission route still requires the legacy Terminal-Bench CLI, use the official dataset form:
tb run \
--agent <published-agentplane-agent-name> \
--model <model> \
--dataset terminal-bench-core==0.1.1See docs/cost.md before running a full dataset. A full run can cost hundreds of dollars on frontier models if tasks require many turns.
Each adapter run writes AgentPlane sidecar artifacts under:
.agentplane-harbor/
proof.json
versions.json
git-diff.patch
git-status.txt
agentplane/
The proof bundle records:
- AgentPlane version
- executor version
- model name
- dataset/task metadata when Harbor exposes it
- generic policy hash
- run start/end timestamps
- final git status
- final diff
- AgentPlane task artifacts when available
The adapter must not:
- modify benchmark timeouts
- expose oracle solutions or hidden tests to the agent
- fetch task solutions from the internet
- change the grader or reward pipeline
- inject task-specific hints into AgentPlane policy
- store encrypted or obfuscated solutions in the adapter image
Reward hacking or cheating can invalidate the submission. Keep all AgentPlane policy generic and publish the exact adapter commit used for the run.
Primary benchmark score:
- Terminal-Bench success rate
- passed tasks / total trials
- official logs and ATIF trajectories
AgentPlane-specific proof:
- evidence completeness
- reproducible lifecycle artifacts
- failed-run diagnosis quality
- dirty-state prevention
- overhead in wall time and artifacts
The adapter also writes local evaluator artifacts for each AgentPlane attempt:
.agentplane-harbor/agentplane/plan.json.agentplane-harbor/agentplane/task-graph.json.agentplane-harbor/agentplane/evaluator-report.json.agentplane-harbor/agentplane/evaluator-feedback.txt.agentplane-harbor/agentplane/planner-attempt-<n>.log.agentplane-harbor/agentplane/evaluator-attempt-<n>.log.agentplane-harbor/agentplane/executor-attempt-<n>.log
The evaluator uses only public task-local signals and treats the official Harbor verifier as the scoring truth. A failed local evaluator triggers AgentPlane rework, but the agent command exits successfully so Harbor can record a normal trial and let the official verifier assign reward.
Before implementation, the adapter now runs a planning gate. The executor must inspect the workspace and write valid planning artifacts with a compact atomic task graph. The evaluator checks those artifacts before the implementation loop starts, so the Harbor profile better matches normal AgentPlane usage: plan, approve, execute scoped leaves, evaluate, repair, and finalize.
The local evaluator is intentionally stricter than a smoke check for the task
families currently covered: planner artifacts are schema-checked without
penalizing safety disclaimers, circuit-fibsqrt is tested against deterministic
square-boundary and pseudo-random oracle cases, and make-mips-interpreter
requires concrete interpreter subsystems plus a valid non-uniform BMP frame.
The minimum useful claim is not "AgentPlane always scores higher". It is:
With the same executor and model, AgentPlane preserves benchmark validity and adds auditable task lifecycle evidence, reproducible artifacts, and clearer failure analysis.