Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Latest commit

 

History

History
139 lines (104 loc) · 14.8 KB

File metadata and controls

139 lines (104 loc) · 14.8 KB

AGENTS.md

Project Overview

AXIS (Agent Experience Index Score) is a synthetic testing framework for AI agents. It runs agents against scenarios, captures transcripts, and produces graded scores across four dimensions: goal achievement, environment quality, service quality, and agent quality.

  • ESM TypeScript, built with tsc, tested with vitest, CLI via commander
  • Live terminal display uses ink (React for CLIs), rendered to stderr
  • Runner is fully decoupled from display via a Logger interface

Terminology

  • AXIS Result (not "AXIS Score") -the composite 0–100 number. "AXIS Score" reads as "score score" since AXIS already stands for "Agent Experience Index Score".
  • Use "AXIS Result" in all user-facing text, display output, and documentation.
  • The internal property names (axisScore, averageAxisScore) are fine as code identifiers.

Architecture

Layer Key Files Purpose
CLI src/cli.ts Entry point, ink display, signal handling
Runner src/runner/runner.ts, lifecycle.ts Job orchestration, concurrency, isolation; runLifecyclePhase captures $AXIS_OUTPUT markdown for setup/teardown/beforeAll/afterAll phases
Adapters src/adapters/*.ts Spawn agent CLIs, parse NDJSON streams
Scoring src/scoring/ LLM judge + interaction-based evaluation pipeline
Reports src/reports/writer.ts, reader.ts Persistent .axis/reports/ store
Config src/config/loader.ts, remote-scenarios.ts Parses axis.config.*; mergeRemoteConfig clones any git-URL entries in scenarios into .axis/remotes/, inlines their paths, AND folds each remote repo's env/mcp_servers/skills/artifacts/adapters into the parent (parent wins on collisions)
Display src/ui/format.ts, LiveStatus.tsx, AnimatedTokens.tsx Pure formatting + ink components (incl. live token counter)
Types src/types/ Shared interfaces (agent, config, output, scoring, report)

Adapter Pattern

Built-in adapters split into two factories. NDJSON-style adapters (claude-code, codex) are created via createAgentAdapter(spec) from src/adapters/base/agent-adapter.ts. ACP-based adapters (claude-sdk, codex-sdk, gemini, goose, opencode, qwen-code, stakpak, blackbox, fast-agent, mistral-vibe, factory-droid, poolside, vtcode, cursor-agent, auggie, kimi, openhands, cline, kiro-cli, kilo, qoder) are created via createAcpBasedAdapter(spec) from src/adapters/base/acp-adapter.ts. Each adapter is a plain factory function (e.g. createGeminiAdapter()) that returns an AgentAdapter -no classes, no inheritance. The factory owns the shared plumbing:

  • Spawn + stdin.end + cleanup registration (SIGTERM on Ctrl-C)
  • 10-minute timeout → SIGTERM → SIGKILL after 5s grace (timer cleared on clean exit)
  • stderr capped at 100 KB
  • close event listener registered BEFORE stdout stream to avoid missing it
  • Raw output capture (NDJSON lines for lines mode, raw chunks for aggregate)
  • Token estimator wiring via StreamContext.feedAssistantText
  • CLI resolution (direct command → npx --yes <pkg> fallback)
  • Error precedence: extracted.metadata.errorstderr"Agent process exited with non-zero code"

The NDJSON-style adapters (claude-code, codex) use lines mode for NDJSON parsing. Custom adapters can use either lines or aggregate mode (raw stdout capture). ACP-based adapters bypass streamConfig entirely - the ACP SDK handles framing.

Adding a new agent adapter

Call createAgentAdapter(spec) with an AgentAdapterSpec<State>. The spec is a single typed object -no class inheritance, no protected hooks:

Spec field Purpose
name Adapter name (registered in src/adapters/registry.ts)
cliCommand? CLI binary for resolveCommand; omit if user-supplied
timeoutMs? Execution timeout (default 10 min)
requiredEnv? Env vars validated by the runner pre-flight (e.g. ANTHROPIC_API_KEY)
hasLocalSession? Detect a usable local CLI login (e.g. claude login, codex login). Runner calls this only when requiredEnv is missing — explicit API keys always win
isolationEnv? Isolation vars (e.g. CLAUDE_CONFIG_DIR, CODEX_HOME). Signature: ({ workspace, home }) => Record<string, string>. Point *_HOME-style paths under home, never workspace
prepare? Side effects (mkdir, MCP / skills writers) before spawn
resolveCommand? Override how the CLI command is resolved
buildArgs Build CLI arguments (prefix args from command resolution prepended automatically)
initialState Per-run mutable state used by streamConfig handlers and getResult
streamConfig How to process agent stdout. Discriminated union: { mode: "lines", onLine, onEnd? } or { mode: "aggregate", onChunk, onEnd? }
getResult Build final { result, metadata? } from accumulated state after exit

The streamConfig field uses a discriminated union so the mode and its handler can never get out of sync -no runtime assertions needed. getResult returns null for "no result" (never ""). Metadata overrides (e.g. upstream durationMs) are spread on top of base-computed fields.

For built-in adapters, register the factory in src/adapters/registry.ts. External custom adapters are loaded via the adapters field in axis.config.json -the runner dynamically imports the module and calls registerAdapter() before running any jobs.

Authentication: API key preferred, local CLI session as fallback

claude-code and codex declare requiredEnv (ANTHROPIC_API_KEY, CODEX_API_KEY) AND implement hasLocalSession. Pre-flight in runner.ts checks env vars first; if missing, it calls hasLocalSession and only throws if both signals are absent. API keys are the preferred path — they're explicit, work in CI, and don't bill against an individual subscription. The local-session fallback exists for laptop ergonomics: when a developer already has claude login / codex login set up, they shouldn't have to mint an API key just to try a scenario. The prepare hooks in those adapters materialize ~/.claude.json / ~/.codex/auth.json (and on macOS, the Keychain entry for Claude Code) into the isolated CLAUDE_CONFIG_DIR / CODEX_HOME so the CLI authenticates as if it were running normally. New adapters that wrap a CLI with both API-key and login flows should follow the same pattern.

Error Handling

  • AgentMetadata.error is the canonical error field for failed runs
  • Runner checks both exitCode !== 0 and metadata.error for failure status
  • Friendly error classification in src/ui/format.ts via friendlyError() -maps common patterns (quota, rate limit, auth, timeout, network) to one-line messages
  • Error display: ↳ friendly message below failed rows in tables, Error: line in detail views
  • Scoring callbacks in cli.ts preserve "failed" status -never overwrite to "done"

Debug Mode

--debug enables raw output capture:

  • AgentInput.captureRawOutput signals adapters to collect raw stdout lines
  • AgentOutput.rawOutput carries the lines back to the runner
  • Report writer strips rawOutput from scenario JSON and writes it as {agent}.raw.ndjson

Documentation Policy

User-facing documentation lives in src/docs-site/ (Astro), published at https://axis.run. All changes to the CLI, scoring system, or configuration schema must be reflected there -the docs site is canonical and must stay in sync with the implementation.

README.md is intentionally lean: tagline, quick start, link tree to the docs site, and the programmatic API surface. Don't expand it back into a full reference -update the docs site instead.

Change Type Where to update
New/modified CLI flags or commands src/docs-site/src/pages/cli.astro
New/modified config fields src/docs-site/src/pages/configuration.astro
New/modified scenario schema fields src/docs-site/src/pages/configuration.astro
Scoring algorithm changes src/docs-site/src/pages/scoring.astro
Adapter contract / built-in changes src/docs-site/src/pages/running.astro
Report / baseline format changes src/docs-site/src/pages/running.astro + cli.astro
New/modified public exports README.md Programmatic API section (kept here, not in docs)

Build & Test

rm -rf dist && npm run build   # Always clean build -stale dist/ causes subtle issues
npm test                       # vitest, all unit tests

Workspace / Home isolation

Per-job temp layout (see createWorkspace in runner.ts):

/tmp/axis-<rand>/
  ├── work/   ← agent cwd (only scenario-provided files; what the agent scans)
  └── home/   ← agent HOME — .codex/, .claude/, .gemini/, .qwen/, user-scoped skills, MCP config
  • Adapter isolationEnv MUST place *_HOME-style paths under home, never workspace, so the agent doesn't see its own config when scanning files
  • HOME is set to home. AXIS_WORKSPACE is workspace. Lifecycle scripts can read both
  • Claude Code's MCP config is written to home/.claude/mcp.json and wired with --mcp-config <path> (no .mcp.json in cwd)
  • Claude/Claude-SDK skills go to CLAUDE_CONFIG_DIR/skills/ (under home); Gemini skills go to GEMINI_CLI_HOME/skills/ (under home)
  • Codex skills are the one exception: Codex only discovers skills under .agents/skills/ in cwd, so they live in workspace. Scenarios opting into Codex skills accept this limited visibility
  • Artifact capture walks workspace only — agent config never leaks into report artifacts

Gotchas

  • Always clean dist/ before testing changes -stale JS in dist can mask TypeScript errors
  • Use getUTCHours() for timestamp IDs -getHours() gives local time
  • Ink renders async -need 100ms yield before unmount to flush final state
  • Gemini CLI streams assistant messages as deltas (delta: true) -adapter accumulates them
  • Gemini settings.json must disable context discovery (discoveryMaxDirs: 0) or Gemini will scan the workspace tree before addressing the prompt, adding latency and unnecessary tool calls
  • Runner emits initial onJobUpdate AFTER pre-flight to avoid ink cursor corruption
  • The close event listener must be registered BEFORE readline to avoid missing it
  • Live token counter uses chars / 5 (intentionally conservative) so the UI never has to reverse; runner enforces monotonicity in updateTokens
  • setupOutput / teardownOutput (captured from $AXIS_OUTPUT) live on RunResult but scoreRunResult returns a fresh ScoredRunResult -cli.ts re-propagates them onto the scored copy alongside artifacts, otherwise they vanish from the manifest
  • Run-level beforeAll / afterAll hooks (on AxisConfig) fire from cli.ts -not the programmatic run() API. beforeAll runs before initReport; afterAll runs after finalizeReport so $AXIS_REPORT_DIR/report.json is on disk. Both abort the run on failure