AXIS (Agent Experience Index Score) is a synthetic testing framework for AI agents. It runs agents against scenarios, captures transcripts, and produces graded scores across four dimensions: goal achievement, environment quality, service quality, and agent quality.
- ESM TypeScript, built with
tsc, tested withvitest, CLI viacommander - Live terminal display uses
ink(React for CLIs), rendered to stderr - Runner is fully decoupled from display via a
Loggerinterface
- AXIS Result (not "AXIS Score") -the composite 0–100 number. "AXIS Score" reads as "score score" since AXIS already stands for "Agent Experience Index Score".
- Use "AXIS Result" in all user-facing text, display output, and documentation.
- The internal property names (
axisScore,averageAxisScore) are fine as code identifiers.
| Layer | Key Files | Purpose |
|---|---|---|
| CLI | src/cli.ts |
Entry point, ink display, signal handling |
| Runner | src/runner/runner.ts, lifecycle.ts |
Job orchestration, concurrency, isolation; runLifecyclePhase captures $AXIS_OUTPUT markdown for setup/teardown/beforeAll/afterAll phases |
| Adapters | src/adapters/*.ts |
Spawn agent CLIs, parse NDJSON streams |
| Scoring | src/scoring/ |
LLM judge + interaction-based evaluation pipeline |
| Reports | src/reports/writer.ts, reader.ts |
Persistent .axis/reports/ store |
| Config | src/config/loader.ts, remote-scenarios.ts |
Parses axis.config.*; mergeRemoteConfig clones any git-URL entries in scenarios into .axis/remotes/, inlines their paths, AND folds each remote repo's env/mcp_servers/skills/artifacts/adapters into the parent (parent wins on collisions) |
| Display | src/ui/format.ts, LiveStatus.tsx, AnimatedTokens.tsx |
Pure formatting + ink components (incl. live token counter) |
| Types | src/types/ |
Shared interfaces (agent, config, output, scoring, report) |
Built-in adapters split into two factories. NDJSON-style adapters (claude-code, codex) are created via createAgentAdapter(spec) from src/adapters/base/agent-adapter.ts. ACP-based adapters (claude-sdk, codex-sdk, gemini, goose, opencode, qwen-code, stakpak, blackbox, fast-agent, mistral-vibe, factory-droid, poolside, vtcode, cursor-agent, auggie, kimi, openhands, cline, kiro-cli, kilo, qoder) are created via createAcpBasedAdapter(spec) from src/adapters/base/acp-adapter.ts. Each adapter is a plain factory function (e.g. createGeminiAdapter()) that returns an AgentAdapter -no classes, no inheritance. The factory owns the shared plumbing:
- Spawn + stdin.end + cleanup registration (SIGTERM on Ctrl-C)
- 10-minute timeout → SIGTERM → SIGKILL after 5s grace (timer cleared on clean exit)
- stderr capped at 100 KB
closeevent listener registered BEFORE stdout stream to avoid missing it- Raw output capture (NDJSON lines for
linesmode, raw chunks foraggregate) - Token estimator wiring via
StreamContext.feedAssistantText - CLI resolution (direct command →
npx --yes <pkg>fallback) - Error precedence:
extracted.metadata.error→stderr→"Agent process exited with non-zero code"
The NDJSON-style adapters (claude-code, codex) use lines mode for NDJSON parsing. Custom adapters can use either lines or aggregate mode (raw stdout capture). ACP-based adapters bypass streamConfig entirely - the ACP SDK handles framing.
Call createAgentAdapter(spec) with an AgentAdapterSpec<State>. The spec is a single typed object -no class inheritance, no protected hooks:
| Spec field | Purpose |
|---|---|
name |
Adapter name (registered in src/adapters/registry.ts) |
cliCommand? |
CLI binary for resolveCommand; omit if user-supplied |
timeoutMs? |
Execution timeout (default 10 min) |
requiredEnv? |
Env vars validated by the runner pre-flight (e.g. ANTHROPIC_API_KEY) |
hasLocalSession? |
Detect a usable local CLI login (e.g. claude login, codex login). Runner calls this only when requiredEnv is missing — explicit API keys always win |
isolationEnv? |
Isolation vars (e.g. CLAUDE_CONFIG_DIR, CODEX_HOME). Signature: ({ workspace, home }) => Record<string, string>. Point *_HOME-style paths under home, never workspace |
prepare? |
Side effects (mkdir, MCP / skills writers) before spawn |
resolveCommand? |
Override how the CLI command is resolved |
buildArgs |
Build CLI arguments (prefix args from command resolution prepended automatically) |
initialState |
Per-run mutable state used by streamConfig handlers and getResult |
streamConfig |
How to process agent stdout. Discriminated union: { mode: "lines", onLine, onEnd? } or { mode: "aggregate", onChunk, onEnd? } |
getResult |
Build final { result, metadata? } from accumulated state after exit |
The streamConfig field uses a discriminated union so the mode and its handler can never get out of sync -no runtime assertions needed. getResult returns null for "no result" (never ""). Metadata overrides (e.g. upstream durationMs) are spread on top of base-computed fields.
For built-in adapters, register the factory in src/adapters/registry.ts. External custom adapters are loaded via the adapters field in axis.config.json -the runner dynamically imports the module and calls registerAdapter() before running any jobs.
claude-code and codex declare requiredEnv (ANTHROPIC_API_KEY, CODEX_API_KEY) AND implement hasLocalSession. Pre-flight in runner.ts checks env vars first; if missing, it calls hasLocalSession and only throws if both signals are absent. API keys are the preferred path — they're explicit, work in CI, and don't bill against an individual subscription. The local-session fallback exists for laptop ergonomics: when a developer already has claude login / codex login set up, they shouldn't have to mint an API key just to try a scenario. The prepare hooks in those adapters materialize ~/.claude.json / ~/.codex/auth.json (and on macOS, the Keychain entry for Claude Code) into the isolated CLAUDE_CONFIG_DIR / CODEX_HOME so the CLI authenticates as if it were running normally. New adapters that wrap a CLI with both API-key and login flows should follow the same pattern.
AgentMetadata.erroris the canonical error field for failed runs- Runner checks both
exitCode !== 0andmetadata.errorfor failure status - Friendly error classification in
src/ui/format.tsviafriendlyError()-maps common patterns (quota, rate limit, auth, timeout, network) to one-line messages - Error display:
↳ friendly messagebelow failed rows in tables,Error:line in detail views - Scoring callbacks in
cli.tspreserve"failed"status -never overwrite to"done"
--debug enables raw output capture:
AgentInput.captureRawOutputsignals adapters to collect raw stdout linesAgentOutput.rawOutputcarries the lines back to the runner- Report writer strips
rawOutputfrom scenario JSON and writes it as{agent}.raw.ndjson
User-facing documentation lives in src/docs-site/ (Astro), published at https://axis.run. All changes to the CLI, scoring system, or configuration schema must be reflected there -the docs site is canonical and must stay in sync with the implementation.
README.md is intentionally lean: tagline, quick start, link tree to the docs site, and the programmatic API surface. Don't expand it back into a full reference -update the docs site instead.
| Change Type | Where to update |
|---|---|
| New/modified CLI flags or commands | src/docs-site/src/pages/cli.astro |
| New/modified config fields | src/docs-site/src/pages/configuration.astro |
| New/modified scenario schema fields | src/docs-site/src/pages/configuration.astro |
| Scoring algorithm changes | src/docs-site/src/pages/scoring.astro |
| Adapter contract / built-in changes | src/docs-site/src/pages/running.astro |
| Report / baseline format changes | src/docs-site/src/pages/running.astro + cli.astro |
| New/modified public exports | README.md Programmatic API section (kept here, not in docs) |
rm -rf dist && npm run build # Always clean build -stale dist/ causes subtle issues
npm test # vitest, all unit testsPer-job temp layout (see createWorkspace in runner.ts):
/tmp/axis-<rand>/
├── work/ ← agent cwd (only scenario-provided files; what the agent scans)
└── home/ ← agent HOME — .codex/, .claude/, .gemini/, .qwen/, user-scoped skills, MCP config
- Adapter
isolationEnvMUST place*_HOME-style paths underhome, neverworkspace, so the agent doesn't see its own config when scanning files HOMEis set tohome.AXIS_WORKSPACEisworkspace. Lifecycle scripts can read both- Claude Code's MCP config is written to
home/.claude/mcp.jsonand wired with--mcp-config <path>(no.mcp.jsonin cwd) - Claude/Claude-SDK skills go to
CLAUDE_CONFIG_DIR/skills/(underhome); Gemini skills go toGEMINI_CLI_HOME/skills/(underhome) - Codex skills are the one exception: Codex only discovers skills under
.agents/skills/in cwd, so they live inworkspace. Scenarios opting into Codex skills accept this limited visibility - Artifact capture walks
workspaceonly — agent config never leaks into report artifacts
- Always clean
dist/before testing changes -stale JS in dist can mask TypeScript errors - Use
getUTCHours()for timestamp IDs -getHours()gives local time - Ink renders async -need 100ms yield before unmount to flush final state
- Gemini CLI streams assistant messages as deltas (
delta: true) -adapter accumulates them - Gemini
settings.jsonmust disable context discovery (discoveryMaxDirs: 0) or Gemini will scan the workspace tree before addressing the prompt, adding latency and unnecessary tool calls - Runner emits initial
onJobUpdateAFTER pre-flight to avoid ink cursor corruption - The
closeevent listener must be registered BEFORE readline to avoid missing it - Live token counter uses
chars / 5(intentionally conservative) so the UI never has to reverse; runner enforces monotonicity inupdateTokens setupOutput/teardownOutput(captured from$AXIS_OUTPUT) live onRunResultbutscoreRunResultreturns a freshScoredRunResult-cli.tsre-propagates them onto the scored copy alongsideartifacts, otherwise they vanish from the manifest- Run-level
beforeAll/afterAllhooks (onAxisConfig) fire fromcli.ts-not the programmaticrun()API.beforeAllruns beforeinitReport;afterAllruns afterfinalizeReportso$AXIS_REPORT_DIR/report.jsonis on disk. Both abort the run on failure