Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Enables cmux to work directly in provided directories without requiring git worktrees. This is essential for terminal-bench integration and agentSessionCli usage.

Problem

Terminal-bench harness (and agentSessionCli) need to work in arbitrary directories like /app in benchmark containers. Previously, cmux assumed all workspaces were git worktrees under ~/.cmux/src/<project>/<branch>, causing systematic failures:

RuntimeError: Working directory does not exist: /root/.cmux/src/app

Solution

Detect "in-place" workspaces (directories not under srcBaseDir) and store them directly without worktree reconstruction. Uses a simple sentinel: projectPath === name indicates in-place mode.

agentSession.ts: When workspacePath is outside ~/.cmux/src/, store it directly by setting both projectPath and name to the absolute path.

aiService.ts: Check for in-place mode (projectPath === name) and use the path directly instead of calling runtime.getWorkspacePath().

streamManager.ts: Fixed cleanup safetyβ€”run rm -rf from parent directory instead of / to limit blast radius if path is malformed.

Testing

Ran terminal-bench harness with multiple tasks:

  • βœ… Agent executes successfully in /app directory
  • βœ… No "Working directory does not exist" errors
  • βœ… Passed 2/3 tests in sanitize-git-repo task
  • βœ… Cleanup works correctly with safer approach

Generated with cmux

Enables cmux to work directly in provided directories without requiring
git worktrees, essential for terminal-bench integration and CLI usage.

Changes:
- agentSession: Detect in-place workspaces (not under srcBaseDir) and store
  path directly by setting projectPath === name as sentinel
- aiService: Check for in-place mode and use stored path instead of
  reconstructing via runtime.getWorkspacePath()
- streamManager: Fix cleanup safety by running rm -rf from parent directory
  instead of root (limits blast radius if path is malformed)

Before: Terminal-bench failed with 'Working directory does not exist'
After: Agents run successfully in task containers (e.g., /app)

Tested with terminal-bench harness running multiple tasks successfully.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The workflow was trying to upload terminal-bench-results/ which doesn't exist.
Terminal-bench writes results to runs/ by default.
Downloaded artifacts from terminal-bench CI runs should not be committed.
In-place workspaces (identified by projectPath === workspaceName) are direct
workspace directories used by CLI/benchmark sessions, not git worktrees. Attempting
to run 'git worktree remove' on them fails or attempts to remove the main checkout.

This fix detects the in-place sentinel pattern and skips git worktree operations,
allowing session cleanup without destructive filesystem operations.

Resolves Codex review comment in PR #472.
When the current branch has no upstream, automatically run git push -u
to set it instead of failing. This makes the script more user-friendly
for new branches.
- Run full benchmark suite (~80 tasks) every night at midnight UTC
- Concurrency=4 is appropriate for full suite (60-90 min estimated)
- Timeout=180 min (3 hours) provides safety margin
- Use default fallbacks for scheduled runs (no inputs)
- Add unique artifact names with run_id to avoid conflicts
- Set 30-day retention for nightly benchmark artifacts
terminal-bench-core==0.1.1 contains ~80 tasks, which is the complete
stable benchmark suite. The -head version is bleeding-edge dev.
Use matrix strategy to run both models every night:
- anthropic:claude-sonnet-4-5 (high thinking)
- openai:gpt-5-codex (high thinking)

Matrix only applies to scheduled runs (cron), not manual workflow_dispatch.
Artifacts are named uniquely per model to avoid conflicts.

This enables direct comparison of model performance on the full 80-task suite.
Cleaner architecture:
- terminal-bench.yml: Reusable workflow (workflow_call + workflow_dispatch)
- nightly-terminal-bench.yml: Scheduled runner with matrix strategy

Benefits:
- Main workflow stays simple for manual use
- Nightly schedule logic isolated in dedicated file
- Easy to add more models to nightly runs
- Manual workflow_dispatch supports model/thinking overrides

Nightly runs both models at midnight UTC:
- anthropic:claude-sonnet-4-5 (high thinking)
- openai:gpt-5-codex (high thinking)
@ammario ammario enabled auto-merge October 29, 2025 01:41
@ammario ammario added this pull request to the merge queue Oct 29, 2025
Merged via the queue into main with commit 8b6d39a Oct 29, 2025
13 checks passed
@ammario ammario deleted the tb-baseline branch October 29, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants