Thanks to visit codestin.com
Credit goes to github.com

Skip to content

basilisk-labs/agentplane-harbor-adapter

Repository files navigation

AgentPlane Harbor Adapter

Harbor installed-agent adapters for benchmarking AgentPlane as a control-plane wrapper around coding agents on Terminal-Bench.

AgentPlane is not submitted as a model. It is submitted as a harness profile:

  • agentplane-codex: AgentPlane + Codex CLI
  • agentplane-claude-code: AgentPlane + Claude Code

The goal is to measure whether AgentPlane improves reproducibility, traceability, recovery, and failure analysis while preserving the same underlying model and benchmark constraints.

Status

Experimental scaffold. Use the smoke run first. Do not submit leaderboard results until the generated proof bundle and ATIF trajectories have been reviewed for Terminal-Bench integrity compliance.

Requirements

  • Docker running locally
  • uv
  • Harbor benchmark framework installed with uv tool install harbor
  • Provider API key for the selected executor/model
  • No benchmark-specific hints, oracle files, test folders, or modified timeouts

Codex CLI is authenticated inside the benchmark container with codex login --with-api-key; the local run wrapper passes a Harbor env template (OPENAI_API_KEY=${OPENAI_API_KEY}) instead of putting the key value in the process argv.

If Homebrew's Harbor registry CLI is also installed, set this in .env.local:

HARBOR_BIN=$HOME/.local/bin/harbor

Install for local development

uv venv
uv pip install -e ".[dev]"

One-command local path

Copy the local environment template and add your API key:

cp .env.example .env.local
$EDITOR .env.local

Then run:

uv tool install harbor
./scripts/agentplane_bench.sh setup
./scripts/agentplane_bench.sh preflight
./scripts/agentplane_bench.sh oracle-smoke
./scripts/agentplane_bench.sh smoke

For a full Harbor run:

N= ./scripts/agentplane_bench.sh full

For a Terminal-Bench leaderboard-shaped run using the legacy tb CLI:

./scripts/agentplane_bench.sh leaderboard-tb

Estimate API cost before a full run:

./scripts/estimate_cost.py --model gpt-5-nano --tasks 80 --profile mid

Run a smoke task

Codex:

export OPENAI_API_KEY="..."
harbor run \
  -d terminal-bench/terminal-bench-2 \
  --agent-import-path agentplane_harbor_adapter.agentplane_codex:AgentPlaneCodexAgent \
  -m gpt-5-nano \
  -n 1

Claude Code:

export ANTHROPIC_API_KEY="..."
harbor run \
  -d terminal-bench/terminal-bench-2 \
  --agent-import-path agentplane_harbor_adapter.agentplane_claude_code:AgentPlaneClaudeCodeAgent \
  -m anthropic/claude-sonnet-4-5 \
  -n 1

Leaderboard run

Use the current Terminal-Bench/Harbor submission instructions before running a full submission. As of the last checked public docs, leaderboard submissions must use the official Terminal-Bench dataset, default agent timeout, default test timeout, and ATIF trajectories for passing trials.

Example Harbor run shape:

harbor run \
  -d terminal-bench/terminal-bench-2 \
  --agent-import-path agentplane_harbor_adapter.agentplane_codex:AgentPlaneCodexAgent \
  -m gpt-5-nano

If the active submission route still requires the legacy Terminal-Bench CLI, use the official dataset form:

tb run \
  --agent <published-agentplane-agent-name> \
  --model <model> \
  --dataset terminal-bench-core==0.1.1

See docs/cost.md before running a full dataset. A full run can cost hundreds of dollars on frontier models if tasks require many turns.

Proof bundle

Each adapter run writes AgentPlane sidecar artifacts under:

.agentplane-harbor/
  proof.json
  versions.json
  git-diff.patch
  git-status.txt
  agentplane/

The proof bundle records:

  • AgentPlane version
  • executor version
  • model name
  • dataset/task metadata when Harbor exposes it
  • generic policy hash
  • run start/end timestamps
  • final git status
  • final diff
  • AgentPlane task artifacts when available

Integrity rules

The adapter must not:

  • modify benchmark timeouts
  • expose oracle solutions or hidden tests to the agent
  • fetch task solutions from the internet
  • change the grader or reward pipeline
  • inject task-specific hints into AgentPlane policy
  • store encrypted or obfuscated solutions in the adapter image

Reward hacking or cheating can invalidate the submission. Keep all AgentPlane policy generic and publish the exact adapter commit used for the run.

Interpreting results

Primary benchmark score:

  • Terminal-Bench success rate
  • passed tasks / total trials
  • official logs and ATIF trajectories

AgentPlane-specific proof:

  • evidence completeness
  • reproducible lifecycle artifacts
  • failed-run diagnosis quality
  • dirty-state prevention
  • overhead in wall time and artifacts

The adapter also writes local evaluator artifacts for each AgentPlane attempt:

  • .agentplane-harbor/agentplane/plan.json
  • .agentplane-harbor/agentplane/task-graph.json
  • .agentplane-harbor/agentplane/evaluator-report.json
  • .agentplane-harbor/agentplane/evaluator-feedback.txt
  • .agentplane-harbor/agentplane/planner-attempt-<n>.log
  • .agentplane-harbor/agentplane/evaluator-attempt-<n>.log
  • .agentplane-harbor/agentplane/executor-attempt-<n>.log

The evaluator uses only public task-local signals and treats the official Harbor verifier as the scoring truth. A failed local evaluator triggers AgentPlane rework, but the agent command exits successfully so Harbor can record a normal trial and let the official verifier assign reward.

Before implementation, the adapter now runs a planning gate. The executor must inspect the workspace and write valid planning artifacts with a compact atomic task graph. The evaluator checks those artifacts before the implementation loop starts, so the Harbor profile better matches normal AgentPlane usage: plan, approve, execute scoped leaves, evaluate, repair, and finalize.

The local evaluator is intentionally stricter than a smoke check for the task families currently covered: planner artifacts are schema-checked without penalizing safety disclaimers, circuit-fibsqrt is tested against deterministic square-boundary and pseudo-random oracle cases, and make-mips-interpreter requires concrete interpreter subsystems plus a valid non-uniform BMP frame.

The minimum useful claim is not "AgentPlane always scores higher". It is:

With the same executor and model, AgentPlane preserves benchmark validity and adds auditable task lifecycle evidence, reproducible artifacts, and clearer failure analysis.

About

⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors