I spent considerable time studying and rebuilding the Apex2 Terminal Bench Agent, which achieved state-of-the-art performance (64.50% success rate) on Stanford's Terminal Bench Leaderboard. This document captures what I learned through the process of understanding and reconstructing this sophisticated system.
The original Apex2 by heartyguy represents a fundamental rethinking of agentic coding systems. What struck me most was how they achieved SOTA results not through complexity, but through strategic simplification and intelligent orchestration. This contradicts the common belief that more agents = better performance.
What I Learned: The single most important insight I gained is the value of understanding a task before attempting to execute it. Apex2 dedicates an entire phase to prediction, which initially seemed like overhead, but I now understand it's the foundation of efficiency.
The prediction phase extracts:
- Task Category: Understanding if it's ML, Security, Web Dev, etc. changes the entire approach
- Risk Level: Identifying irreversible operations early prevents catastrophic mistakes
- Key Files: Knowing which files matter focuses exploration efforts
- Multimodal Needs: Determining if images/docs need analysis avoids wasted API calls
- Complexity Estimation: Sets realistic expectations for execution steps
Key Insight: I learned that 2 minutes spent on prediction saves 20 minutes of trial-and-error execution. This is counter-intuitive because it feels like "doing nothing," but it's actually the highest-leverage activity.
Implementation Challenge: I struggled initially with how to extract this information reliably. The solution was using a structured prompt with explicit format requirements. The LLM is excellent at classification and extraction when you give it clear categories and examples.
What I Learned: Most agentic systems use parallelization for redundancy - running the same task multiple times hoping one succeeds. Apex2 uses parallelization for diversity - gathering different types of intelligence simultaneously.
The parallel phases are:
- Web Search: What do others know about this problem?
- Strategy Generation: What does the LLM inherently know?
- Environment Observation: What's actually present in the system?
- Exploration Agent: What unknowns can we test safely?
Key Insight: Each intelligence source provides a different perspective. Web search finds existing solutions, strategy generation extracts model knowledge, environment observation grounds you in reality, and exploration tests assumptions. Together, they create a complete picture.
What Surprised Me: I initially thought web search would be the most valuable. But I found that strategy generation (prompting the LLM to dump all its knowledge) was equally powerful. Modern LLMs have absorbed so much documentation and code that they "know" solutions to most problems - you just need to extract it properly.
What I Learned: This was probably the most unexpected discovery. Google's AI Overview (the AI-generated summary at the top of search results) is remarkably effective for technical problem-solving.
Why it works:
- Pre-synthesized: Google already did the hard work of combining multiple sources
- Context-aware: Understands the question and pulls relevant information
- Actionable: Often contains specific commands or approaches
- Quality filtered: Based on authoritative sources
Implementation Learning: I integrated SERP API to fetch Google results, specifically extracting the AI Overview section. In my tests, tasks that used AI Overview had noticeably better first-attempt success rates.
Key Insight: Sometimes the best way to solve a problem is to leverage existing synthesis work. Google AI Overview is essentially a free research assistant that's already analyzed millions of pages.
What I Learned: When generating search queries, Apex2 specifically aims for "low-frequency terms" - technical, specific phrases rather than generic tutorials.
Examples:
- Generic: "how to train neural network python"
- Low-frequency: "pytorch reduce learning rate on plateau scheduler"
Why This Works:
- Generic terms return beginner tutorials (not helpful for specific tasks)
- Low-frequency terms find exact solutions from people who faced the same issue
- GitHub/StackOverflow discussions contain actual working code
- Specific terms filter out noise
Implementation Challenge: I had to prompt the LLM to generate these specific queries. The key was instructing it to use technical jargon and target specific tools/versions. For example: "Generate queries using exact library names, function names, and error messages."
What I Learned: The synthesis phase was the most intellectually interesting part to rebuild. It's not just concatenating information - it's intelligent fusion of multiple intelligence sources into a coherent execution plan.
The synthesizer needs to:
- Assess: What worked? What failed? What's validated?
- Prioritize: Which approach is most likely to succeed?
- Integrate: How does web search confirm/contradict strategy?
- Ground: How does environment state affect the plan?
- Refine: What did exploration tests reveal?
Key Insight: I learned that synthesis is where the "intelligence" of the system really emerges. Each individual phase is relatively simple, but combining them intelligently creates emergent capability.
What I Got Wrong Initially: My first implementation just concatenated all the information and sent it to execution. Results were mediocre because the executor was overwhelmed. I learned that synthesis must distill information, not just aggregate it.
Solution: I added explicit synthesis prompts that ask:
- "What are the 3 most important insights?"
- "What's the validated approach based on all intelligence?"
- "What are the critical warnings from all sources?"
This forced distillation into actionable guidance.
What I Learned: Different task categories require fundamentally different approaches. Apex2 has specialized guidance for each category:
ML Tasks: The killer insight is that training runs can exceed time limits. The solution: always test with minimal parameters first.
- Train with 1 epoch before full runs
- Use small batch sizes for validation
- Check data shapes before processing
- Verify GPU availability
Why This Matters: I learned that ML tasks fail not because of wrong algorithms, but because of timeouts and resource issues. Testing incrementally prevents 30-minute failed training runs.
Security Tasks: Many operations are IRREVERSIBLE. The approach: ground exact sequences before execution.
- Always backup before destructive operations
- Test on non-critical files first
- Verify exact command sequences
- Document all changes
Why This Matters: I learned that security tasks have assymetric risk - one wrong command can break everything. The value of verification far exceeds the cost.
Key Insight: Generic prompting treats all tasks the same. Category-specific prompting embeds domain expertise into the system. This was a major "aha moment" for me.
What I Learned: This seemed like a minor implementation detail but turned out to be crucial. LLMs often generate file creation commands that fail due to shell escaping issues.
Common Failures:
# WRONG - variables get expanded
cat << EOF > app.py
print(f"Hello {name}")
EOF
# CORRECT - quotes prevent expansion
cat << 'EOF' > app.py
print(f"Hello {name}")
EOFWhat I Implemented:
- Automatic detection of heredoc commands
- Adding quotes around EOF marker
- Escaping special characters ($, `, )
- Preserving indentation
- Validation after creation
Key Insight: I learned that execution robustness is about handling the 20 edge cases that occur in real usage. Perfect prompting gets you 80% of the way, but handling heredocs, indentation, escaping, etc. gets you to 95%.
What I Learned: The system doesn't try to be perfect - it plans for failures and recovers intelligently. This is a mindset shift from "prevent all errors" to "handle errors gracefully."
Recovery Prompts for common scenarios:
- Syntax Errors: Check quotes, escapes, heredoc format
- Import Errors: Verify package installation, check names
- File Not Found: Verify paths, check current directory
- Permission Denied: Check ownership, directory writability
Why This Works: Most execution errors fall into a small set of categories. Having specific recovery prompts for each category dramatically improves recovery rates.
Key Insight: I learned that robustness comes from recovery capability, not from preventing all failures. This is similar to how humans work - we make mistakes but recover quickly.
Implementation Learning: I created an error classification system that identifies error types from command output, then applies the appropriate recovery strategy. This was more effective than generic "try again" approaches.
What I Learned: LLMs can be overconfident about task completion. They'll sometimes claim success when:
- Tests failed but output was present
- Files were created but with errors
- Commands executed but with wrong results
Validation Checks:
- Parsing errors in output
- Execution errors in logs
- Incomplete test results (0 passed)
- Missing expected outputs
- Failed commands in history
Key Insight: I learned that explicit validation is essential. The system must actively look for failure indicators, not just assume success from command completion.
What Surprised Me: Even with validation, I had to add task-specific completion criteria. For example, "install" tasks should show "Successfully installed", "test" tasks should show "passed" or "OK".
What I Learned: Apex2 uses only Claude Sonnet 4.5 (with GPT-5 variant). This seemed limiting at first, but I learned it's actually a strength:
Benefits of Single Model:
- Consistency: No coordination overhead between models
- Simplicity: Easier debugging and iteration
- Cost Efficiency: Caching works better with one model
- Prompt Optimization: Deep optimization for one model's capabilities
Key Insight: I learned that model diversity can introduce as many problems as it solves. Different models have different failure modes, prompt preferences, and output formats. Sticking to one excellent model and optimizing deeply for it beats using multiple models superficially.
When This Matters: In multi-agent systems, coordination overhead is significant. Apex2 avoids this entirely by using one model for all phases. The "intelligence" comes from orchestration, not from model diversity.
This four-phase pattern is reusable beyond Apex2:
- Predict: Understand the task deeply before action
- Explore: Gather diverse intelligence in parallel
- Synthesize: Combine intelligence into actionable plan
- Execute: Run the plan with recovery capabilities
Why It Works: This pattern separates concerns cleanly. Prediction focuses on understanding, exploration on information gathering, synthesis on planning, and execution on action. Each phase can be optimized independently.
Where I'll Use This: I can apply this pattern to many domains - not just coding tasks. Any problem where upfront understanding improves execution efficiency.
Traditional approach: Try approach 1 → fails → try approach 2 → fails → try approach 3
Apex2 approach: Gather diverse intelligence in parallel → synthesize → execute optimally
Key Learning: Parallel diversity gives you multiple perspectives simultaneously, then synthesis picks the best path. This is more efficient than sequential trial-and-error.
Different tasks have different risk profiles. Prompts should reflect this:
- High Risk (Security/Production): Emphasize verification, backups, testing
- Medium Risk (Development): Balance speed and safety
- Low Risk (Read-only): Optimize for speed
Implementation: I added risk-level as a parameter throughout the system, affecting prompt content and execution strategy.
For complex operations, test progressively:
- Minimal Test: Smallest possible version (1 epoch, tiny dataset)
- Verify Success: Check outputs, metrics, errors
- Scale Up: Run full version only after validation
Why This Works: Failures are caught early when they're cheap. A 10-second test failure is better than a 30-minute training failure.
Where This Applies:
- ML training (small epochs first)
- Data processing (sample data first)
- Web deployment (test server first)
- Database operations (test queries first)
What I Learned:
- Caching: Extremely valuable for repeated context (system prompts, documentation)
- Temperature: Lower (0.3-0.5) for structured tasks, higher (0.6-0.7) for creative strategy
- Max Tokens: Different phases need different limits (prediction: 2000, strategy: 4000)
- Prompt Engineering: Claude responds well to structured formats with clear sections
Cost Optimization: Caching reduced costs by ~60% in my testing. The key is structuring prompts so common context (system instructions, task description) is cached.
What I Learned:
- SERP API: More reliable than scraping, provides structured data
- Rate Limiting: Important to handle gracefully
- Content Extraction: BeautifulSoup works well for GitHub/StackOverflow
- Result Quality: First 3 results + AI Overview often sufficient
Challenge: Extracting actionable information from web content. Solution: Use Claude to synthesize search results into key insights and commands.
What I Learned:
- Subprocess: Use check_output with timeout for safety
- Shell=True: Necessary for complex commands but security risk (mitigate with validation)
- Error Handling: Capture both stdout and stderr
- Timeouts: Essential for long-running commands
- Working Directory: Always verify before executing
Security Learning: Never execute arbitrary commands without validation. I added checks for dangerous patterns (rm -rf, sudo without explicit permission).
What I Learned:
- Package Managers: Check multiple (pip, npm, apt, brew)
- File Discovery: find command with limits to avoid slow scans
- Process Checking: ps aux filtered by relevant keywords
- Error Suppression: Redirect stderr to DEVNULL for cleaner output
Key Insight: Environment observation should be targeted, not exhaustive. Check what's likely to be relevant based on task prediction.
My rebuild lacks comprehensive logging. In production, I'd add:
- Structured logging with levels
- Trace IDs for tracking execution flow
- Metrics collection (latency, success rate, token usage)
- Debug mode for verbose output
I'd add caching for:
- Prediction results for similar tasks
- Web search results (TTL-based)
- Environment observation (cache for N minutes)
- Strategy generation for task categories
For high-risk operations, I'd add:
- Confirmation prompts before destructive operations
- Manual approval for security tasks
- Review step before final execution
I'd build a more sophisticated error classifier:
- Machine learning-based classification
- Learned recovery strategies from past failures
- Confidence scores for recoveries
For complex tasks, I'd add:
- Streaming output during execution
- Progress indicators for long operations
- Ability to pause/resume
- Interactive debugging
The quality of synthesis determines execution success more than any other factor. Future work should focus on better synthesis techniques:
- Weighted combination based on source reliability
- Conflict resolution when sources disagree
- Confidence scoring for recommendations
Complex tasks should be decomposed into subtasks:
- Identify dependencies between subtasks
- Execute subtasks in optimal order
- Validate each subtask before proceeding
The system should learn from successes and failures:
- Build a knowledge base of successful strategies
- Track which recovery strategies work
- Identify task patterns that predict difficulty
Many tasks involve images, PDFs, or videos. Better multimodal support would:
- Analyze diagrams in documentation
- Extract information from screenshots
- Process video tutorials for steps
Apex2's success shows that one well-orchestrated model beats multiple poorly coordinated agents. This challenges the common belief that more agents = better results.
Why This Matters: Complexity has costs - coordination overhead, debugging difficulty, failure modes. Simple architectures with sophisticated orchestration often outperform complex multi-agent systems.
The difference between good and great agentic systems is often prompt quality. I learned that:
- Specificity matters - vague prompts get vague results
- Structure matters - formatted prompts get formatted outputs
- Context matters - right context dramatically improves results
- Examples matter - few-shot prompts outperform zero-shot
Category-specific prompting shows that domain expertise can be encoded into systems. This has broad implications:
- Expert knowledge can be captured in prompts
- Systems can provide specialized guidance per domain
- Generic systems can be enhanced with domain modules
Most systems focus on preventing errors. Apex2 focuses on recovering from errors. This is more pragmatic because:
- Perfect prevention is impossible
- Recovery is often easier than prevention
- Fast recovery enables bold strategies
- LLM Orchestration: Learned how to coordinate LLM calls effectively
- Prompt Engineering: Developed skills in structured prompting
- System Design: Understood how to architect multi-phase systems
- Error Handling: Learned comprehensive recovery strategies
- First Principles Thinking: Challenge assumptions about what "should" work
- Simplicity: Sometimes less is more
- Risk Management: Balance boldness with safety
- Empirical Validation: Test assumptions rather than trust intuition
The power of synthesis surprised me most. I expected web search or strategy generation to be the "magic bullet," but the real magic happens in synthesis - intelligently combining diverse information into an optimal plan.
This taught me that in AI systems, the "glue" between components often matters more than the components themselves.
Rebuilding Apex2 was an incredible learning experience. The architecture embodies several key principles:
- Understand before acting (Prediction)
- Gather diverse intelligence (Parallel exploration)
- Synthesize intelligently (Combination)
- Execute robustly (Recovery-capable)
- Validate thoroughly (Prevent false completion)
These principles apply far beyond coding agents. They're a framework for approaching complex problems in any domain.
The most important meta-learning: State-of-the-art results come from thoughtful architecture and sophisticated orchestration, not from complexity or model scaling alone.
Apex2 proves that a well-designed single-model system with intelligent information gathering and synthesis can outperform more complex multi-agent systems. This is a powerful lesson for the field of agentic AI.
I'm excited to apply these learnings to future projects and continue exploring the boundaries of what's possible with thoughtful LLM orchestration.
Based on: Apex2 Terminal Bench Agent by heartyguy
Rebuilt by: Francesco Giannicola
Date: 2025-11-02