An AI Agent for Computer Use is an autonomous program that can reason about tasks, plan sequences of actions, and act within the domain of a computer or mobile device in the form of clicks, keystrokes, other computer events, command-line operations and internal/external API calls. These agents combine perception, decision-making, and control capabilities to interact with digital interfaces and accomplish user-specified goals independently.
A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.
- Anthropic | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
- Bill Gates | AI is about to completely change how you use computers
- Ethan Mollick | When you give a Claude a mouse
-
GUI Agents: A Survey (Dec. 2024)
- General survey of GUI agents
-
Large Language Model-Brained GUI Agents: A Survey (Nov. 2024)
- Focus on LLM-based approaches
- Website
-
GUI Agents with Foundation Models: A Comprehensive Survey (Nov. 2024)
- Comprehensive overview of foundation model-based GUI agents
-
Large Action Models: From Inception to Implementation (Dec. 2024)
- Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation
- Details key stages including data collection, model training, environment integration, grounding and evaluation
-
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Dec. 2024)
- Novel reward-guided navigation approach
-
SpiritSight Agent: Advanced GUI Agent with One Look (Dec. 2024)
- Single-shot GUI interaction approach
-
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Dec. 2024)
- Novel approach for automatic GUI functionality annotation
-
Simulate Before Act: Model-Based Planning for Web Agents (Dec. 2024)
- Novel model-based planning approach using LLM world models
-
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Dec. 2024)
- Novel autonomous skill discovery framework for web agents
- Code
-
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Dec. 2024)
- Novel framework for contextualizing web pages to enhance LLM agent decision making
-
Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Dec. 2024)
- Novel value-based offline RL approach for training VLM device-control agents
-
Magentic-One (Nov. 2024)
- Multi-agent system with orchestrator-led coordination
- Strong performance on GAIA, WebArena, and AssistantBench
-
Agent Workflow Memory (Sep. 2024)
- Novel workflow memory framework for agents
- Code
-
The Impact of Element Ordering on LM Agent Performance (Sep. 2024)
- Novel study on element ordering's impact on agent performance
- Code
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
- Novel reasoning and learning framework
- Website
-
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models (Aug. 2024)
- Open platform for web-based agent deployment
- Code
-
Agent-e: From autonomous web navigation to foundational design principles in agentic systems (Jul. 2024)
- Hierarchical architecture with flexible DOM distillation
- Novel denoising method for web navigation
-
Apple Intelligence Foundation Language Models (Jul. 2024)
- Vision-Language Model with Private Cloud Compute
- Novel foundation model architecture
-
Tree search for language model agents (Jul. 2024)
- Multi-step reasoning and planning with best-first tree search
- Novel approach for LLM-based agents
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
- Novel reinforcement learning approach
- Code
-
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
- Multi-agent collaboration for mobile device operation
- Code
-
Octopus Series: On-device Language Models for Computer Control (Apr. 2024)
-
AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent (Apr. 2024)
- Novel approach for real-world web navigation and bilingual benchmark
- Code
-
Cradle: Empowering Foundation Agents towards General Computer Control (Mar. 2024)
- Focus on general computer control using Red Dead Redemption II as a case study
- Code
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
- Novel Chain-of-Action-Thought framework for Android interaction
- Code
-
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
- Vision-language model for computer control
- Code
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
- Vision-Language Model for PC interaction
- Code
-
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
- Specialized for Windows OS interaction
- Code
-
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation (Feb. 2024)
- Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception
- Introduces conditional action prediction (CAP) for reliable action response
-
Intention-inInteraction (IN3): Tell Me More! (Feb. 2024)
- Novel benchmark for evaluating user intention understanding in agent designs
- Introduces model experts for robust user-agent interaction
-
Dual-view visual contextualization for web navigation (Feb. 2024)
- Novel approach for automatic web navigation with language instructions
- Key: HTML elements, visual contextualization
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
- Specialized for mobile UI and infographics understanding
- Novel approach for visual interface comprehension
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024)
- Demonstrates GPT-4V capabilities for web interaction
- Code
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
- Visual perception for mobile device interaction
- Code
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (Jan. 2024)
- End-to-end approach for web interaction
- Code
-
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023)
- Works across PC and Android platforms
- Code
-
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
- Focused on smartphone interaction
- Code
-
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
- Novel approach to web navigation
- Code
-
AndroidEnv: A Reinforcement Learning Platform for Android (May 2021)
- Reinforcement learning platform for Android interaction
- Code
-
OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
- Novel vision-based screen parsing method for UI screenshots
- Combines finetuned interactable icon detection and functional description models
- Code
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
- Mobile UI understanding
- Code
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024)
- Advanced visual grounding techniques
- Code
-
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (Oct. 2024)
- Multimodal LLM for universal UI understanding across diverse platforms
- Introduces adaptive gridding for high-resolution perception
- Preprint
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
- Universal approach to GUI interaction
- Code
-
OS-ATLAS: Foundation Action Model for Generalist GUI Agents (Oct. 2024)
- Comprehensive action modeling
- Code
-
UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding (Dec. 2024)
- Novel framework for building VLMs with strong UI element grounding capabilities
-
Grounding Multimodal Large Language Model in GUI World (Dec. 2024)
- Novel GUI grounding framework with automated data collection engine and lightweight grounding module
-
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)
- Novel interaction-driven approach for automated GUI trajectory synthesis
- Introduces reverse task synthesis and trajectory reward model
- Code
-
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (Dec. 2024)
- Web tutorial-based trajectory synthesis
-
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights (Jun. 2024)
- Novel approach to continual learning from trajectories
-
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale (Sep. 2024)
- Scalable demonstration generation
-
Multi-Turn Mind2Web: On the Multi-turn Instruction Following (Feb. 2024)
- Multi-turn instruction dataset for web agents
- Code
-
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation (Aug. 2024)
- Chinese benchmark for agent evaluation
- Code
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks (Jul. 2024)
- Benchmark for realistic and time-consuming web tasks
- Code
-
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
- Large-scale web interaction dataset
- Code
-
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
- Large-scale dataset for Android interaction
- Real-world device control scenarios
-
WebShop: Towards Scalable Real-World Web Interaction (Jul. 2022)
- Dataset for grounded language agents in web interaction
- Code
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
- Mobile app UI dataset
- Design-focused data collection
-
A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
- Comprehensive evaluation framework
- Code
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (May. 2024)
- Android-focused evaluation
- Code
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
- Evaluation in data science workflows
- Code
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
- Mobile agent evaluation
- Code
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024)
- Web-focused evaluation
- Code
-
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (Sep. 2024)
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
- Mobile-focused evaluation framework
- Code
-
Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
- Security analysis of computer agents
- Code
-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage (Sep. 2024)
- Privacy and security analysis
-
GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning (Jun. 2024)
- Safety mechanisms for agents
-
- Framework for building AI agent systems.
- It simplifies the creation of event-driven, distributed, scalable, and resilient agentic applications.
-
- Autonomous GPT-4 agent
- Task automation focus
-
- Make websites accessible for AI agents with vision + HTML extraction
- Supports multi-tab management and custom actions with LangChain integration
-
- MacOS implementation
- Claude integration
-
- Game automation
- Specialized use case
-
- Ready-to-use implementation
- Comprehensive toolset
-
- Advanced computer control
-
- Computer control agent
- Task automation focus
-
- AI web agent framework
- Modular architecture
-
- MacOS-specific tools
- Anthropic integration
-
- Browser automation
- GPT-4 Vision integration
-
- AI-First Process Automation
- Multimodal model integration
-
- Open-source UI interaction framework
- Cross-platform support
-
- General-purpose computer control framework
- Python-based, extensible architecture
-
Open Source Computer Use by E2B
- Open-source implementation of computer control capabilities
- Secure sandboxed environment for AI agents
-
- Computer control framework
- Vision-based automation
-
- AI web agent framework
- Automate browser-based workflows with LLMs using vision and HTML extraction
-
- Device operation toolkit
- Extensible agent framework
-
- Web page annotation tool
- Vision-language model support
-
- Windows inside a Docker container
-
- Secure desktop environment
- Agent testing platform
-
- Docker container for running virtual machines using QEMU
-
- Native UI automation
- JavaScript/TypeScript implementation
-
- Cross-platform GUI automation
- Python-based control library
-
- Commercial computer control capability
- Integrated with Claude 3.5 models
-
- AI agents that can fully complete tasks in any web environment.
Contributions are welcome! Please feel free to submit a Pull Request.