🤖 feat: add in-place workspace support for CLI/benchmark sessions #472

ammar-agent · 2025-10-29T00:03:37Z

Enables cmux to work directly in provided directories without requiring git worktrees. This is essential for terminal-bench integration and agentSessionCli usage.

Problem

Terminal-bench harness (and agentSessionCli) need to work in arbitrary directories like /app in benchmark containers. Previously, cmux assumed all workspaces were git worktrees under ~/.cmux/src/<project>/<branch>, causing systematic failures:

RuntimeError: Working directory does not exist: /root/.cmux/src/app

Solution

Detect "in-place" workspaces (directories not under srcBaseDir) and store them directly without worktree reconstruction. Uses a simple sentinel: projectPath === name indicates in-place mode.

agentSession.ts: When workspacePath is outside ~/.cmux/src/, store it directly by setting both projectPath and name to the absolute path.

aiService.ts: Check for in-place mode (projectPath === name) and use the path directly instead of calling runtime.getWorkspacePath().

streamManager.ts: Fixed cleanup safety—run rm -rf from parent directory instead of / to limit blast radius if path is malformed.

Testing

Ran terminal-bench harness with multiple tasks:

✅ Agent executes successfully in /app directory
✅ No "Working directory does not exist" errors
✅ Passed 2/3 tests in sanitize-git-repo task
✅ Cleanup works correctly with safer approach

Generated with cmux

Enables cmux to work directly in provided directories without requiring git worktrees, essential for terminal-bench integration and CLI usage. Changes: - agentSession: Detect in-place workspaces (not under srcBaseDir) and store path directly by setting projectPath === name as sentinel - aiService: Check for in-place mode and use stored path instead of reconstructing via runtime.getWorkspacePath() - streamManager: Fix cleanup safety by running rm -rf from parent directory instead of root (limits blast radius if path is malformed) Before: Terminal-bench failed with 'Working directory does not exist' After: Agents run successfully in task containers (e.g., /app) Tested with terminal-bench harness running multiple tasks successfully.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/services/agentSession.ts

The workflow was trying to upload terminal-bench-results/ which doesn't exist. Terminal-bench writes results to runs/ by default.

Downloaded artifacts from terminal-bench CI runs should not be committed.

In-place workspaces (identified by projectPath === workspaceName) are direct workspace directories used by CLI/benchmark sessions, not git worktrees. Attempting to run 'git worktree remove' on them fails or attempts to remove the main checkout. This fix detects the in-place sentinel pattern and skips git worktree operations, allowing session cleanup without destructive filesystem operations. Resolves Codex review comment in PR #472.

When the current branch has no upstream, automatically run git push -u to set it instead of failing. This makes the script more user-friendly for new branches.

- Run full benchmark suite (~80 tasks) every night at midnight UTC - Concurrency=4 is appropriate for full suite (60-90 min estimated) - Timeout=180 min (3 hours) provides safety margin - Use default fallbacks for scheduled runs (no inputs) - Add unique artifact names with run_id to avoid conflicts - Set 30-day retention for nightly benchmark artifacts

terminal-bench-core==0.1.1 contains ~80 tasks, which is the complete stable benchmark suite. The -head version is bleeding-edge dev.

Use matrix strategy to run both models every night: - anthropic:claude-sonnet-4-5 (high thinking) - openai:gpt-5-codex (high thinking) Matrix only applies to scheduled runs (cron), not manual workflow_dispatch. Artifacts are named uniquely per model to avoid conflicts. This enables direct comparison of model performance on the full 80-task suite.

Cleaner architecture: - terminal-bench.yml: Reusable workflow (workflow_call + workflow_dispatch) - nightly-terminal-bench.yml: Scheduled runner with matrix strategy Benefits: - Main workflow stays simple for manual use - Nightly schedule logic isolated in dedicated file - Easy to add more models to nightly runs - Manual workflow_dispatch supports model/thinking overrides Nightly runs both models at midnight UTC: - anthropic:claude-sonnet-4-5 (high thinking) - openai:gpt-5-codex (high thinking)

chatgpt-codex-connector bot reviewed Oct 29, 2025

View reviewed changes

src/services/agentSession.ts Show resolved Hide resolved

ammar-agent added 10 commits October 29, 2025 00:41

🤖 fix: upload actual benchmark results from runs/ directory

0d1b14e

The workflow was trying to upload terminal-bench-results/ which doesn't exist. Terminal-bench writes results to runs/ by default.

🤖 chore: add terminal-bench-results/ to .gitignore

5349001

Downloaded artifacts from terminal-bench CI runs should not be committed.

🤖 fix: automatically set upstream in wait_pr_checks.sh

f490ab7

When the current branch has no upstream, automatically run git push -u to set it instead of failing. This makes the script more user-friendly for new branches.

🤖 chore: format code

fa8a069

🤖 chore: format wait_pr_checks.sh

07362af

🤖 docs: clarify terminal-bench-core==0.1.1 is the full suite

5f09559

terminal-bench-core==0.1.1 contains ~80 tasks, which is the complete stable benchmark suite. The -head version is bleeding-edge dev.

ammario enabled auto-merge October 29, 2025 01:41

ammario added this pull request to the merge queue Oct 29, 2025

Merged via the queue into main with commit 8b6d39a Oct 29, 2025
13 checks passed

ammario deleted the tb-baseline branch October 29, 2025 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

🤖 feat: add in-place workspace support for CLI/benchmark sessions #472

🤖 feat: add in-place workspace support for CLI/benchmark sessions #472

Uh oh!

ammar-agent commented Oct 29, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

🤖 feat: add in-place workspace support for CLI/benchmark sessions #472

🤖 feat: add in-place workspace support for CLI/benchmark sessions #472

Uh oh!

Conversation

ammar-agent commented Oct 29, 2025

Problem

Solution

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants