- 
                Notifications
    
You must be signed in to change notification settings  - Fork 333
 
feat: add claude-3-7-sonnet-20250219 #369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
  | 
    
| 
           Adding support for Claude 3.7 is not as simple as I thought. Some things I've noticed so far: 
 High-level, we need some sort of tools registry to handle this: 
  | 
    
| 
           For 1 I've created vercel/ai#5024  | 
    
### What Implement a comprehensive refactoring of the AI tools system * Create a modular tool system with a central ToolRegistry class * Move tools from hardcoded implementation in AIClient to dedicated files * Implement tool registration through createToolRegistry() * Map Anthropic's computer actions to internal actions ### Why In preparation for adding Claude 3.7. Ref #369 (comment)
6fa2132    to
    392fc36      
    Compare
  
    | const animationPromise = showClickAnimation(page, "left"); | ||
| 
               | 
          ||
| await Promise.all([ | ||
| page.mouse.click(scaledX, scaledY, { delay: 200 }), // delay to match animation duration | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if that delay is necesssary. Removing for now.
| case "middle_click": | ||
| case "double_click": { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These didn't seem to work correctly before, as the actual button (middle, etc), or the number of clicks, were not passed to Playwright.
| const keys = Array.isArray(actions.keyboardShortcuts[keyText]) | ||
| ? actions.keyboardShortcuts[keyText] | ||
| : [actions.keyboardShortcuts[keyText] || input.text]; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept similar logic to key action. Generally, I think this is not needed as much, as Playwright accepts a combo key (e.g. Ctrl+c`). Leaving for now, can be removed in the future.
| } | ||
| 
               | 
          ||
| case "sleep": { | ||
| case InternalActionEnum.WAIT: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to sleep, only that the unit of measurement here for duration is seconds, while for sleep is miliseconds. Some future refactoring can clean up and consolidate the logic.
| output = `Waited for ${seconds} second${seconds !== 1 ? "s" : ""}`; | ||
| break; | ||
| 
               | 
          ||
| case InternalActionEnum.SCROLL: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very slow. scroll_amount is measured in clicks (possible pixels). During some tests (e.g. scroll to the bottom of page), the value returned by AI was 10-20. Hopefully they improve the functionality over time.
| metadata: await this.getMetadata(), | ||
| metadata: browserMetadata, | ||
| }; | ||
| this.log.trace("Screenshot details", metadata); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was redundant. Also, removed some extra metadata from the logs that was just noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests for browser tools can be added in a separate PR.
| 
           I've done an analysis between 2 runs to compare the models, and Claude 3.7 is 4.6 times more expensive at the first glance. For that reason, I'm not enabling it on our project, while keeping Claude 3.5 as the default model. yt-claude-3-5-sonnet-20241022.txt AI analysisAfter comparing the two Shortest test results using Claude 3.5 Sonnet and Claude 3.7 Sonnet, I can identify several key differences in how they executed the same YouTube testing scenarios. Main Differences
 Why Claude 3.7 Used More Tokens
 The difference is particularly stark in the second test case (YouTube video playback settings) where Claude 3.7 used more than 1 million tokens ($3.15) compared to Claude 3.5's much more modest token usage ($0.61). This suggests that while Claude 3.7 may be more thorough and persistent, it's also significantly more expensive to run for these types of automated testing scenarios. Cost ComparisonClaude 3.5 Sonnet (Total: ~$0.84)Test 1 (Search for puppies): 61,558 tokens ( Claude 3.7 Sonnet (Total: ~$3.85)Test 1 (Search for puppies): 96,632 tokens ( From this data, we can see that Claude 3.7 Sonnet is approximately 4.6 times more expensive than Claude 3.5 Sonnet for these test scenarios ($3.85 vs $0.84).  | 
    
anthropicclaude-3-7-sonnet-20250219claude-3-7-sonnet-latestcomputer_20250124,bash_20250124triple_click,hold_key,left_mouse_down,left_mouse_up,wait,scrollNote
This PR does not add support for reasoning