Thanks to visit codestin.com
Credit goes to github.com

feat: add claude-3-7-sonnet-20250219 #369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

rmarescu merged 13 commits into main from rmarescu/claude-3-7

Mar 9, 2025

Member

rmarescu commented Feb 28, 2025 •

edited

Loading

Add 2 new model options for config.ai.model for anthropic
- claude-3-7-sonnet-20250219
- claude-3-7-sonnet-latest
Add 2 new computer tools: computer_20250124, bash_20250124
Add support for new actions: triple_click, hold_key, left_mouse_down, left_mouse_up, wait, scroll

Note

This PR does not add support for reasoning

github-project-automation bot added this to Shortest

vercel bot commented Feb 28, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
shortest	⬜️ Ignored (Inspect)	Visit Preview		Mar 9, 2025 1:21am

Member Author

rmarescu commented Feb 28, 2025

Adding support for Claude 3.7 is not as simple as I thought. Some things I've noticed so far:

Vercel AI SDK (@ai-sdk/anthropic 1.1.12) doesn't yet support the new bash_20250124 tool that is required by the new model. I'll open a PR with them unless someone else does it before.

{
  "type": "error",
  "error": {
    "type": "invalid_request_error",
    "message": "'claude-3-7-sonnet-20250219' does not support tool types: bash_20241022, text_editor_20241022. Did you mean one of bash_20250124, computer_20250124, text_editor_20250124?"
  }
}

The new computer_20250124 tool comes with new actions (vs computer_20241022), like: hold_key, left_mouse_down, left_mouse_up, triple_click, scroll, wait.
Our BrowserActionEnum only supports the existing actions (expected), plus our own custom actions (like navigate, check_email, etc). Needs to support the new actions from computer_20250124

High-level, we need some sort of tools registry to handle this:

register all available tools supported (from providers and ours)
AIClient uses a single interface to access available tools, which are determined based on the selected config.ai.model
provide tool version-specific adapter that determines which actions are available for a given tool

Member Author

rmarescu commented Mar 1, 2025

For 1 I've created vercel/ai#5024

rmarescu mentioned this pull request

refactor: add ToolRegistry #372

Merged

rmarescu added a commit that referenced this pull request


          refactor: add ToolRegistry (#372)

d80a19f

### What

Implement a comprehensive refactoring of the AI tools system  

* Create a modular tool system with a central ToolRegistry class
* Move tools from hardcoded implementation in AIClient to dedicated
files
* Implement tool registration through createToolRegistry()
* Map Anthropic's computer actions to internal actions

### Why

In preparation for adding Claude 3.7.

Ref
#369 (comment)

rmarescu added 2 commits

March 5, 2025 19:49


          Bump deps

360c97f


          Add Claude 3.7

392fc36

rmarescu force-pushed the rmarescu/claude-3-7 branch from 6fa2132 to 392fc36 Compare

March 6, 2025 04:17

rmarescu added 8 commits

March 7, 2025 16:55

c84b41d


          Merge remote-tracking branch 'origin/main' into rmarescu/claude-3-7

e91e772


          add logging

8309bec


          Add InternalActionEnum.HOLD_KEY

0608c57


          LEFT_MOUSE_DOWN & LEFT_MOUSE_UP

b03aa3c


          TRIPLE_CLICK

dc4dda6


          WAIT

36f13da


          SCROLL

33320a1

rmarescu self-assigned this

rmarescu added this to the v0.4.6 milestone

rmarescu added 3 commits

March 8, 2025 09:24

4286b1f


          Simplify screenshot logging

33cf4d2

8bac7cb

rmarescu marked this pull request as ready for review

March 9, 2025 02:41

rmarescu commented

View reviewed changes

packages/shortest/src/browser/actions/index.ts

    
                const animationPromise = showClickAnimation(page, "left");

                await Promise.all([

                  page.mouse.click(scaledX, scaledY, { delay: 200 }), // delay to match animation duration

Member Author

rmarescu Mar 9, 2025

Not sure if that delay is necesssary. Removing for now.

packages/shortest/src/browser/core/browser-tool.ts

Comment on lines -218 to -219

    
                      case "middle_click":

                      case "double_click": {

Member Author

rmarescu Mar 9, 2025

These didn't seem to work correctly before, as the actual button (middle, etc), or the number of clicks, were not passed to Playwright.

packages/shortest/src/browser/core/browser-tool.ts

Comment on lines +378 to +380

    
                        const keys = Array.isArray(actions.keyboardShortcuts[keyText])

                          ? actions.keyboardShortcuts[keyText]

                          : [actions.keyboardShortcuts[keyText] || input.text];

Member Author

rmarescu Mar 9, 2025

Kept similar logic to key action. Generally, I think this is not needed as much, as Playwright accepts a combo key (e.g. Ctrl+c`). Leaving for now, can be removed in the future.

packages/shortest/src/browser/core/browser-tool.ts

    
                      }

                      case "sleep": {

                      case InternalActionEnum.WAIT:

Member Author

rmarescu Mar 9, 2025

This is similar to sleep, only that the unit of measurement here for duration is seconds, while for sleep is miliseconds. Some future refactoring can clean up and consolidate the logic.

packages/shortest/src/browser/core/browser-tool.ts

    
                        output = `Waited for ${seconds} second${seconds !== 1 ? "s" : ""}`;

                        break;

                      case InternalActionEnum.SCROLL:

Member Author

rmarescu Mar 9, 2025

This is very slow. scroll_amount is measured in clicks (possible pixels). During some tests (e.g. scroll to the bottom of page), the value returned by AI was 10-20. Hopefully they improve the functionality over time.

packages/shortest/src/browser/core/browser-tool.ts

    
                    metadata: await this.getMetadata(),

                    metadata: browserMetadata,

                  };

                  this.log.trace("Screenshot details", metadata);

Member Author

rmarescu Mar 9, 2025

This was redundant. Also, removed some extra metadata from the logs that was just noise.

packages/shortest/src/browser/core/browser-tool.ts

Member Author

rmarescu Mar 9, 2025

Unit tests for browser tools can be added in a separate PR.

Member Author

rmarescu commented Mar 9, 2025

I've done an analysis between 2 runs to compare the models, and Claude 3.7 is 4.6 times more expensive at the first glance. For that reason, I'm not enabling it on our project, while keeping Claude 3.5 as the default model.

yt-claude-3-5-sonnet-20241022.txt
yt-claude-3-7-latest.txt

AI analysis

After comparing the two Shortest test results using Claude 3.5 Sonnet and Claude 3.7 Sonnet, I can identify several key differences in how they executed the same YouTube testing scenarios.

Main Differences

Testing Execution Approach

Claude 3.7: More efficient and direct in its approach, often taking direct actions with fewer intermediate steps
Claude 3.5: More verbose in its approach, with more screenshots, more careful checks, and more redundant actions
Token Usage

Claude 3.7: Used significantly more tokens (1,253,265 tokens, ~$3.85)
Claude 3.5: Used fewer tokens (267,581 tokens, ~$0.84)
Test Duration

Claude 3.7: Longer test duration (416.62 seconds)
Claude 3.5: Shorter test duration (151.72 seconds)
Decision Making

Claude 3.7: Made more attempts to achieve the desired outcome, showing more persistence
Claude 3.5: Often accepted initial results more readily
Diagnostic Behaviors

Claude 3.7: Performed more extensive monitoring via more frequent screenshots
Claude 3.7: Made more detailed observations about the state of the page
Claude 3.7: Employed more waiting strategies to allow actions to complete

Why Claude 3.7 Used More Tokens

More Verbose Responses: Claude 3.7's messages were generally longer and more detailed about what it was observing and planning to do.
More Total Actions: Claude 3.7 performed more actions per test (especially screenshots), which generated more conversation turns between the AI and the testing framework.
More Diagnostic Messages : Claude 3.7 included more detailed analysis in its messages about what it observed on the screen and about the system's state.
More Persistence: When encountering issues (especially with video quality settings in the second test), Claude 3.7 made more repeated attempts, which resulted in more message exchanges.
More Thorough Verification: Claude 3.7 spent more time and messages verifying that actions had completed successfully.

The difference is particularly stark in the second test case (YouTube video playback settings) where Claude 3.7 used more than 1 million tokens ($3.15) compared to Claude 3.5's much more modest token usage ($0.61). This suggests that while Claude 3.7 may be more thorough and persistent, it's also significantly more expensive to run for these types of automated testing scenarios.

Cost Comparison

Claude 3.5 Sonnet (Total: ~$0.84)

Test 1 (Search for puppies): 61,558 tokens (~$0.20)
Test 2 (Change playback settings): 194,631 tokens (~$0.61)
Test 3 (Visit TED channel): 11,392 tokens (~$0.04)

Claude 3.7 Sonnet (Total: ~$3.85)

Test 1 (Search for puppies): 96,632 tokens (~$0.30)
Test 2 (Change playback settings): 1,032,598 tokens (~$3.15)
Test 3 (Visit TED channel): 124,035 tokens (~$0.39)

From this data, we can see that Claude 3.7 Sonnet is approximately 4.6 times more expensive than Claude 3.5 Sonnet for these test scenarios ($3.85 vs $0.84).
The most extreme difference is in Test 2 (changing YouTube playback settings), where Claude 3.7 used 5.3 times more tokens than Claude 3.5 ($3.15 vs $0.61). This suggests that Claude 3.7's thoroughness and persistence in handling complex interactions comes with a significantly higher cost.

rmarescu merged commit 17fd881 into main

6 checks passed

rmarescu deleted the rmarescu/claude-3-7 branch

March 9, 2025 02:50

github-project-automation bot moved this to Done in Shortest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet