Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@rmarescu
Copy link
Member

@rmarescu rmarescu commented Feb 28, 2025

  • Add 2 new model options for config.ai.model for anthropic
    • claude-3-7-sonnet-20250219
    • claude-3-7-sonnet-latest
  • Add 2 new computer tools: computer_20250124, bash_20250124
  • Add support for new actions: triple_click, hold_key, left_mouse_down, left_mouse_up, wait, scroll

Note

This PR does not add support for reasoning

@vercel
Copy link

vercel bot commented Feb 28, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
shortest ⬜️ Ignored (Inspect) Visit Preview Mar 9, 2025 1:21am

@rmarescu
Copy link
Member Author

Adding support for Claude 3.7 is not as simple as I thought. Some things I've noticed so far:

  1. Vercel AI SDK (@ai-sdk/anthropic 1.1.12) doesn't yet support the new bash_20250124 tool that is required by the new model. I'll open a PR with them unless someone else does it before.

    {
      "type": "error",
      "error": {
        "type": "invalid_request_error",
        "message": "'claude-3-7-sonnet-20250219' does not support tool types: bash_20241022, text_editor_20241022. Did you mean one of bash_20250124, computer_20250124, text_editor_20250124?"
      }
    }
  2. The new computer_20250124 tool comes with new actions (vs computer_20241022), like: hold_key, left_mouse_down, left_mouse_up, triple_click, scroll, wait.

  3. Our BrowserActionEnum only supports the existing actions (expected), plus our own custom actions (like navigate, check_email, etc). Needs to support the new actions from computer_20250124

High-level, we need some sort of tools registry to handle this:

  • register all available tools supported (from providers and ours)
  • AIClient uses a single interface to access available tools, which are determined based on the selected config.ai.model
  • provide tool version-specific adapter that determines which actions are available for a given tool

@rmarescu
Copy link
Member Author

rmarescu commented Mar 1, 2025

For 1 I've created vercel/ai#5024

rmarescu added a commit that referenced this pull request Mar 2, 2025
### What

Implement a comprehensive refactoring of the AI tools system  

* Create a modular tool system with a central ToolRegistry class
* Move tools from hardcoded implementation in AIClient to dedicated
files
* Implement tool registration through createToolRegistry()
* Map Anthropic's computer actions to internal actions

### Why

In preparation for adding Claude 3.7.

Ref
#369 (comment)
@rmarescu rmarescu force-pushed the rmarescu/claude-3-7 branch from 6fa2132 to 392fc36 Compare March 6, 2025 04:17
@rmarescu rmarescu self-assigned this Mar 8, 2025
@rmarescu rmarescu added this to the v0.4.6 milestone Mar 8, 2025
@rmarescu rmarescu marked this pull request as ready for review March 9, 2025 02:41
const animationPromise = showClickAnimation(page, "left");

await Promise.all([
page.mouse.click(scaledX, scaledY, { delay: 200 }), // delay to match animation duration
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that delay is necesssary. Removing for now.

Comment on lines -218 to -219
case "middle_click":
case "double_click": {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These didn't seem to work correctly before, as the actual button (middle, etc), or the number of clicks, were not passed to Playwright.

Comment on lines +378 to +380
const keys = Array.isArray(actions.keyboardShortcuts[keyText])
? actions.keyboardShortcuts[keyText]
: [actions.keyboardShortcuts[keyText] || input.text];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept similar logic to key action. Generally, I think this is not needed as much, as Playwright accepts a combo key (e.g. Ctrl+c`). Leaving for now, can be removed in the future.

}

case "sleep": {
case InternalActionEnum.WAIT:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to sleep, only that the unit of measurement here for duration is seconds, while for sleep is miliseconds. Some future refactoring can clean up and consolidate the logic.

output = `Waited for ${seconds} second${seconds !== 1 ? "s" : ""}`;
break;

case InternalActionEnum.SCROLL:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very slow. scroll_amount is measured in clicks (possible pixels). During some tests (e.g. scroll to the bottom of page), the value returned by AI was 10-20. Hopefully they improve the functionality over time.

metadata: await this.getMetadata(),
metadata: browserMetadata,
};
this.log.trace("Screenshot details", metadata);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was redundant. Also, removed some extra metadata from the logs that was just noise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests for browser tools can be added in a separate PR.

@rmarescu
Copy link
Member Author

rmarescu commented Mar 9, 2025

I've done an analysis between 2 runs to compare the models, and Claude 3.7 is 4.6 times more expensive at the first glance. For that reason, I'm not enabling it on our project, while keeping Claude 3.5 as the default model.

yt-claude-3-5-sonnet-20241022.txt
yt-claude-3-7-latest.txt

AI analysis

After comparing the two Shortest test results using Claude 3.5 Sonnet and Claude 3.7 Sonnet, I can identify several key differences in how they executed the same YouTube testing scenarios.

Main Differences

  1. Testing Execution Approach

    Claude 3.7: More efficient and direct in its approach, often taking direct actions with fewer intermediate steps
    Claude 3.5: More verbose in its approach, with more screenshots, more careful checks, and more redundant actions

  2. Token Usage

    Claude 3.7: Used significantly more tokens (1,253,265 tokens, ~$3.85)
    Claude 3.5: Used fewer tokens (267,581 tokens, ~$0.84)

  3. Test Duration

    Claude 3.7: Longer test duration (416.62 seconds)
    Claude 3.5: Shorter test duration (151.72 seconds)

  4. Decision Making

    Claude 3.7: Made more attempts to achieve the desired outcome, showing more persistence
    Claude 3.5: Often accepted initial results more readily

  5. Diagnostic Behaviors

    Claude 3.7: Performed more extensive monitoring via more frequent screenshots
    Claude 3.7: Made more detailed observations about the state of the page
    Claude 3.7: Employed more waiting strategies to allow actions to complete

Why Claude 3.7 Used More Tokens

  1. More Verbose Responses: Claude 3.7's messages were generally longer and more detailed about what it was observing and planning to do.
  2. More Total Actions: Claude 3.7 performed more actions per test (especially screenshots), which generated more conversation turns between the AI and the testing framework.
  3. More Diagnostic Messages : Claude 3.7 included more detailed analysis in its messages about what it observed on the screen and about the system's state.
  4. More Persistence: When encountering issues (especially with video quality settings in the second test), Claude 3.7 made more repeated attempts, which resulted in more message exchanges.
  5. More Thorough Verification: Claude 3.7 spent more time and messages verifying that actions had completed successfully.

The difference is particularly stark in the second test case (YouTube video playback settings) where Claude 3.7 used more than 1 million tokens ($3.15) compared to Claude 3.5's much more modest token usage ($0.61). This suggests that while Claude 3.7 may be more thorough and persistent, it's also significantly more expensive to run for these types of automated testing scenarios.

Cost Comparison

Claude 3.5 Sonnet (Total: ~$0.84)

Test 1 (Search for puppies): 61,558 tokens (~$0.20)
Test 2 (Change playback settings): 194,631 tokens (~$0.61)
Test 3 (Visit TED channel): 11,392 tokens (~$0.04)

Claude 3.7 Sonnet (Total: ~$3.85)

Test 1 (Search for puppies): 96,632 tokens (~$0.30)
Test 2 (Change playback settings): 1,032,598 tokens (~$3.15)
Test 3 (Visit TED channel): 124,035 tokens (~$0.39)

From this data, we can see that Claude 3.7 Sonnet is approximately 4.6 times more expensive than Claude 3.5 Sonnet for these test scenarios ($3.85 vs $0.84).
The most extreme difference is in Test 2 (changing YouTube playback settings), where Claude 3.7 used 5.3 times more tokens than Claude 3.5 ($3.15 vs $0.61). This suggests that Claude 3.7's thoroughness and persistence in handling complex interactions comes with a significantly higher cost.

@rmarescu rmarescu merged commit 17fd881 into main Mar 9, 2025
6 checks passed
@rmarescu rmarescu deleted the rmarescu/claude-3-7 branch March 9, 2025 02:50
@github-project-automation github-project-automation bot moved this to Done in Shortest Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants