MY_LEARNINGS.md

My Learnings from Rebuilding Apex2 Terminal Bench Agent

Introduction

I spent considerable time studying and rebuilding the Apex2 Terminal Bench Agent, which achieved state-of-the-art performance (64.50% success rate) on Stanford's Terminal Bench Leaderboard. This document captures what I learned through the process of understanding and reconstructing this sophisticated system.

The original Apex2 by heartyguy represents a fundamental rethinking of agentic coding systems. What struck me most was how they achieved SOTA results not through complexity, but through strategic simplification and intelligent orchestration. This contradicts the common belief that more agents = better performance.

Core Architecture Insights

1. The Power of Predictive Intelligence

What I Learned: The single most important insight I gained is the value of understanding a task before attempting to execute it. Apex2 dedicates an entire phase to prediction, which initially seemed like overhead, but I now understand it's the foundation of efficiency.

The prediction phase extracts:

Task Category: Understanding if it's ML, Security, Web Dev, etc. changes the entire approach
Risk Level: Identifying irreversible operations early prevents catastrophic mistakes
Key Files: Knowing which files matter focuses exploration efforts
Multimodal Needs: Determining if images/docs need analysis avoids wasted API calls
Complexity Estimation: Sets realistic expectations for execution steps

Key Insight: I learned that 2 minutes spent on prediction saves 20 minutes of trial-and-error execution. This is counter-intuitive because it feels like "doing nothing," but it's actually the highest-leverage activity.

Implementation Challenge: I struggled initially with how to extract this information reliably. The solution was using a structured prompt with explicit format requirements. The LLM is excellent at classification and extraction when you give it clear categories and examples.

2. Parallel Intelligence Gathering: Diversity Over Redundancy

What I Learned: Most agentic systems use parallelization for redundancy - running the same task multiple times hoping one succeeds. Apex2 uses parallelization for diversity - gathering different types of intelligence simultaneously.

The parallel phases are:

Web Search: What do others know about this problem?
Strategy Generation: What does the LLM inherently know?
Environment Observation: What's actually present in the system?
Exploration Agent: What unknowns can we test safely?

Key Insight: Each intelligence source provides a different perspective. Web search finds existing solutions, strategy generation extracts model knowledge, environment observation grounds you in reality, and exploration tests assumptions. Together, they create a complete picture.

What Surprised Me: I initially thought web search would be the most valuable. But I found that strategy generation (prompting the LLM to dump all its knowledge) was equally powerful. Modern LLMs have absorbed so much documentation and code that they "know" solutions to most problems - you just need to extract it properly.

3. The Google AI Overview Goldmine

What I Learned: This was probably the most unexpected discovery. Google's AI Overview (the AI-generated summary at the top of search results) is remarkably effective for technical problem-solving.

Why it works:

Pre-synthesized: Google already did the hard work of combining multiple sources
Context-aware: Understands the question and pulls relevant information
Actionable: Often contains specific commands or approaches
Quality filtered: Based on authoritative sources

Implementation Learning: I integrated SERP API to fetch Google results, specifically extracting the AI Overview section. In my tests, tasks that used AI Overview had noticeably better first-attempt success rates.

Key Insight: Sometimes the best way to solve a problem is to leverage existing synthesis work. Google AI Overview is essentially a free research assistant that's already analyzed millions of pages.

4. Low-Frequency Search Terms Beat Generic Queries

What I Learned: When generating search queries, Apex2 specifically aims for "low-frequency terms" - technical, specific phrases rather than generic tutorials.

Examples:

Generic: "how to train neural network python"
Low-frequency: "pytorch reduce learning rate on plateau scheduler"

Why This Works:

Generic terms return beginner tutorials (not helpful for specific tasks)
Low-frequency terms find exact solutions from people who faced the same issue
GitHub/StackOverflow discussions contain actual working code
Specific terms filter out noise

Implementation Challenge: I had to prompt the LLM to generate these specific queries. The key was instructing it to use technical jargon and target specific tools/versions. For example: "Generate queries using exact library names, function names, and error messages."

5. Strategy Synthesis: The Art of Combination

What I Learned: The synthesis phase was the most intellectually interesting part to rebuild. It's not just concatenating information - it's intelligent fusion of multiple intelligence sources into a coherent execution plan.

The synthesizer needs to:

Assess: What worked? What failed? What's validated?
Prioritize: Which approach is most likely to succeed?
Integrate: How does web search confirm/contradict strategy?
Ground: How does environment state affect the plan?
Refine: What did exploration tests reveal?

Key Insight: I learned that synthesis is where the "intelligence" of the system really emerges. Each individual phase is relatively simple, but combining them intelligently creates emergent capability.

What I Got Wrong Initially: My first implementation just concatenated all the information and sent it to execution. Results were mediocre because the executor was overwhelmed. I learned that synthesis must distill information, not just aggregate it.

Solution: I added explicit synthesis prompts that ask:

"What are the 3 most important insights?"
"What's the validated approach based on all intelligence?"
"What are the critical warnings from all sources?"

This forced distillation into actionable guidance.

6. Category-Specific Prompting: Context Matters

What I Learned: Different task categories require fundamentally different approaches. Apex2 has specialized guidance for each category:

ML Tasks: The killer insight is that training runs can exceed time limits. The solution: always test with minimal parameters first.

Train with 1 epoch before full runs
Use small batch sizes for validation
Check data shapes before processing
Verify GPU availability

Why This Matters: I learned that ML tasks fail not because of wrong algorithms, but because of timeouts and resource issues. Testing incrementally prevents 30-minute failed training runs.

Security Tasks: Many operations are IRREVERSIBLE. The approach: ground exact sequences before execution.

Always backup before destructive operations
Test on non-critical files first
Verify exact command sequences
Document all changes

Why This Matters: I learned that security tasks have assymetric risk - one wrong command can break everything. The value of verification far exceeds the cost.

Key Insight: Generic prompting treats all tasks the same. Category-specific prompting embeds domain expertise into the system. This was a major "aha moment" for me.

7. Heredoc Handling: Details Matter

What I Learned: This seemed like a minor implementation detail but turned out to be crucial. LLMs often generate file creation commands that fail due to shell escaping issues.

Common Failures:

# WRONG - variables get expanded
cat << EOF > app.py
print(f"Hello {name}")
EOF

# CORRECT - quotes prevent expansion
cat << 'EOF' > app.py
print(f"Hello {name}")
EOF

What I Implemented:

Automatic detection of heredoc commands
Adding quotes around EOF marker
Escaping special characters ($, `, )
Preserving indentation
Validation after creation

Key Insight: I learned that execution robustness is about handling the 20 edge cases that occur in real usage. Perfect prompting gets you 80% of the way, but handling heredocs, indentation, escaping, etc. gets you to 95%.

8. Recovery Strategies: Plan for Failure

What I Learned: The system doesn't try to be perfect - it plans for failures and recovers intelligently. This is a mindset shift from "prevent all errors" to "handle errors gracefully."

Recovery Prompts for common scenarios:

Syntax Errors: Check quotes, escapes, heredoc format
Import Errors: Verify package installation, check names
File Not Found: Verify paths, check current directory
Permission Denied: Check ownership, directory writability

Why This Works: Most execution errors fall into a small set of categories. Having specific recovery prompts for each category dramatically improves recovery rates.

Key Insight: I learned that robustness comes from recovery capability, not from preventing all failures. This is similar to how humans work - we make mistakes but recover quickly.

Implementation Learning: I created an error classification system that identifies error types from command output, then applies the appropriate recovery strategy. This was more effective than generic "try again" approaches.

9. Validation: Prevent False Completion

What I Learned: LLMs can be overconfident about task completion. They'll sometimes claim success when:

Tests failed but output was present
Files were created but with errors
Commands executed but with wrong results

Validation Checks:

Parsing errors in output
Execution errors in logs
Incomplete test results (0 passed)
Missing expected outputs
Failed commands in history

Key Insight: I learned that explicit validation is essential. The system must actively look for failure indicators, not just assume success from command completion.

What Surprised Me: Even with validation, I had to add task-specific completion criteria. For example, "install" tasks should show "Successfully installed", "test" tasks should show "passed" or "OK".

10. The Single-Model Philosophy

What I Learned: Apex2 uses only Claude Sonnet 4.5 (with GPT-5 variant). This seemed limiting at first, but I learned it's actually a strength:

Benefits of Single Model:

Consistency: No coordination overhead between models
Simplicity: Easier debugging and iteration
Cost Efficiency: Caching works better with one model
Prompt Optimization: Deep optimization for one model's capabilities

Key Insight: I learned that model diversity can introduce as many problems as it solves. Different models have different failure modes, prompt preferences, and output formats. Sticking to one excellent model and optimizing deeply for it beats using multiple models superficially.

When This Matters: In multi-agent systems, coordination overhead is significant. Apex2 avoids this entirely by using one model for all phases. The "intelligence" comes from orchestration, not from model diversity.

Architectural Patterns I Learned

Pattern 1: Predict → Explore → Synthesize → Execute

This four-phase pattern is reusable beyond Apex2:

Predict: Understand the task deeply before action
Explore: Gather diverse intelligence in parallel
Synthesize: Combine intelligence into actionable plan
Execute: Run the plan with recovery capabilities

Why It Works: This pattern separates concerns cleanly. Prediction focuses on understanding, exploration on information gathering, synthesis on planning, and execution on action. Each phase can be optimized independently.

Where I'll Use This: I can apply this pattern to many domains - not just coding tasks. Any problem where upfront understanding improves execution efficiency.

Pattern 2: Parallel Diversity Over Sequential Attempts

Traditional approach: Try approach 1 → fails → try approach 2 → fails → try approach 3

Apex2 approach: Gather diverse intelligence in parallel → synthesize → execute optimally

Key Learning: Parallel diversity gives you multiple perspectives simultaneously, then synthesis picks the best path. This is more efficient than sequential trial-and-error.

Pattern 3: Risk-Aware Prompting

Different tasks have different risk profiles. Prompts should reflect this:

High Risk (Security/Production): Emphasize verification, backups, testing
Medium Risk (Development): Balance speed and safety
Low Risk (Read-only): Optimize for speed

Implementation: I added risk-level as a parameter throughout the system, affecting prompt content and execution strategy.

Pattern 4: Progressive Testing

For complex operations, test progressively:

Minimal Test: Smallest possible version (1 epoch, tiny dataset)
Verify Success: Check outputs, metrics, errors
Scale Up: Run full version only after validation

Why This Works: Failures are caught early when they're cheap. A 10-second test failure is better than a 30-minute training failure.

Where This Applies:

ML training (small epochs first)
Data processing (sample data first)
Web deployment (test server first)
Database operations (test queries first)

Technical Implementation Learnings

Working with Anthropic API

What I Learned:

Caching: Extremely valuable for repeated context (system prompts, documentation)
Temperature: Lower (0.3-0.5) for structured tasks, higher (0.6-0.7) for creative strategy
Max Tokens: Different phases need different limits (prediction: 2000, strategy: 4000)
Prompt Engineering: Claude responds well to structured formats with clear sections

Cost Optimization: Caching reduced costs by ~60% in my testing. The key is structuring prompts so common context (system instructions, task description) is cached.

Web Search Integration

What I Learned:

SERP API: More reliable than scraping, provides structured data
Rate Limiting: Important to handle gracefully
Content Extraction: BeautifulSoup works well for GitHub/StackOverflow
Result Quality: First 3 results + AI Overview often sufficient

Challenge: Extracting actionable information from web content. Solution: Use Claude to synthesize search results into key insights and commands.

Shell Command Execution

What I Learned:

Subprocess: Use check_output with timeout for safety
Shell=True: Necessary for complex commands but security risk (mitigate with validation)
Error Handling: Capture both stdout and stderr
Timeouts: Essential for long-running commands
Working Directory: Always verify before executing

Security Learning: Never execute arbitrary commands without validation. I added checks for dangerous patterns (rm -rf, sudo without explicit permission).

Environment Observation

What I Learned:

Package Managers: Check multiple (pip, npm, apt, brew)
File Discovery: find command with limits to avoid slow scans
Process Checking: ps aux filtered by relevant keywords
Error Suppression: Redirect stderr to DEVNULL for cleaner output

Key Insight: Environment observation should be targeted, not exhaustive. Check what's likely to be relevant based on task prediction.

What I Would Do Differently

1. Add Logging and Observability

My rebuild lacks comprehensive logging. In production, I'd add:

Structured logging with levels
Trace IDs for tracking execution flow
Metrics collection (latency, success rate, token usage)
Debug mode for verbose output

2. Implement Caching Layer

I'd add caching for:

Prediction results for similar tasks
Web search results (TTL-based)
Environment observation (cache for N minutes)
Strategy generation for task categories

3. Add Human-in-the-Loop

For high-risk operations, I'd add:

Confirmation prompts before destructive operations
Manual approval for security tasks
Review step before final execution

4. Better Error Classification

I'd build a more sophisticated error classifier:

Machine learning-based classification
Learned recovery strategies from past failures
Confidence scores for recoveries

5. Interactive Mode

For complex tasks, I'd add:

Streaming output during execution
Progress indicators for long operations
Ability to pause/resume
Interactive debugging

Key Insights for Future Work

1. Intelligence Synthesis is the Bottleneck

The quality of synthesis determines execution success more than any other factor. Future work should focus on better synthesis techniques:

Weighted combination based on source reliability
Conflict resolution when sources disagree
Confidence scoring for recommendations

2. Task Decomposition Needs Improvement

Complex tasks should be decomposed into subtasks:

Identify dependencies between subtasks
Execute subtasks in optimal order
Validate each subtask before proceeding

3. Learning from Execution

The system should learn from successes and failures:

Build a knowledge base of successful strategies
Track which recovery strategies work
Identify task patterns that predict difficulty

4. Multimodal Integration

Many tasks involve images, PDFs, or videos. Better multimodal support would:

Analyze diagrams in documentation
Extract information from screenshots
Process video tutorials for steps

Broader Insights About Agentic Systems

1. Simplicity Beats Complexity

Apex2's success shows that one well-orchestrated model beats multiple poorly coordinated agents. This challenges the common belief that more agents = better results.

Why This Matters: Complexity has costs - coordination overhead, debugging difficulty, failure modes. Simple architectures with sophisticated orchestration often outperform complex multi-agent systems.

2. Prompting is an Underrated Skill

The difference between good and great agentic systems is often prompt quality. I learned that:

Specificity matters - vague prompts get vague results
Structure matters - formatted prompts get formatted outputs
Context matters - right context dramatically improves results
Examples matter - few-shot prompts outperform zero-shot

3. Domain Knowledge Can Be Encoded

Category-specific prompting shows that domain expertise can be encoded into systems. This has broad implications:

Expert knowledge can be captured in prompts
Systems can provide specialized guidance per domain
Generic systems can be enhanced with domain modules

4. Recovery Capability is Undervalued

Most systems focus on preventing errors. Apex2 focuses on recovering from errors. This is more pragmatic because:

Perfect prevention is impossible
Recovery is often easier than prevention
Fast recovery enables bold strategies

Personal Growth

Technical Growth

LLM Orchestration: Learned how to coordinate LLM calls effectively
Prompt Engineering: Developed skills in structured prompting
System Design: Understood how to architect multi-phase systems
Error Handling: Learned comprehensive recovery strategies

Conceptual Growth

First Principles Thinking: Challenge assumptions about what "should" work
Simplicity: Sometimes less is more
Risk Management: Balance boldness with safety
Empirical Validation: Test assumptions rather than trust intuition

What Surprised Me Most

The power of synthesis surprised me most. I expected web search or strategy generation to be the "magic bullet," but the real magic happens in synthesis - intelligently combining diverse information into an optimal plan.

This taught me that in AI systems, the "glue" between components often matters more than the components themselves.

Conclusion

Rebuilding Apex2 was an incredible learning experience. The architecture embodies several key principles:

Understand before acting (Prediction)
Gather diverse intelligence (Parallel exploration)
Synthesize intelligently (Combination)
Execute robustly (Recovery-capable)
Validate thoroughly (Prevent false completion)

These principles apply far beyond coding agents. They're a framework for approaching complex problems in any domain.

The most important meta-learning: State-of-the-art results come from thoughtful architecture and sophisticated orchestration, not from complexity or model scaling alone.

Apex2 proves that a well-designed single-model system with intelligent information gathering and synthesis can outperform more complex multi-agent systems. This is a powerful lesson for the field of agentic AI.

I'm excited to apply these learnings to future projects and continue exploring the boundaries of what's possible with thoughtful LLM orchestration.

Based on: Apex2 Terminal Bench Agent by heartyguy

Rebuilt by: Francesco Giannicola

Date: 2025-11-02

FilesExpand file tree

MY_LEARNINGS.md

Latest commit

History

MY_LEARNINGS.md

File metadata and controls

My Learnings from Rebuilding Apex2 Terminal Bench Agent

Introduction

Core Architecture Insights

1. The Power of Predictive Intelligence

2. Parallel Intelligence Gathering: Diversity Over Redundancy

3. The Google AI Overview Goldmine

4. Low-Frequency Search Terms Beat Generic Queries

5. Strategy Synthesis: The Art of Combination

6. Category-Specific Prompting: Context Matters

7. Heredoc Handling: Details Matter

8. Recovery Strategies: Plan for Failure

9. Validation: Prevent False Completion

10. The Single-Model Philosophy

Architectural Patterns I Learned

Pattern 1: Predict → Explore → Synthesize → Execute

Pattern 2: Parallel Diversity Over Sequential Attempts

Pattern 3: Risk-Aware Prompting

Pattern 4: Progressive Testing

Technical Implementation Learnings

Working with Anthropic API

Web Search Integration

Shell Command Execution

Environment Observation

What I Would Do Differently

1. Add Logging and Observability

2. Implement Caching Layer

3. Add Human-in-the-Loop

4. Better Error Classification

5. Interactive Mode

Key Insights for Future Work

1. Intelligence Synthesis is the Bottleneck

2. Task Decomposition Needs Improvement

3. Learning from Execution

4. Multimodal Integration

Broader Insights About Agentic Systems

1. Simplicity Beats Complexity

2. Prompting is an Underrated Skill

3. Domain Knowledge Can Be Encoded

4. Recovery Capability is Undervalued

Personal Growth

Technical Growth

Conceptual Growth

What Surprised Me Most

Conclusion