# Awesome-GUI-Agents
A curated list for GUI Agents
Table Content:
🚀 Updates
-
On November 23, 2025, We have summarized and analyzed the AAAI 2026 accepted papers on GUI agents. Check it out.
-
On November 14, 2025, We have summarized and analyzed the ICLR 2026 papers on GUI. Check it out.
-
On November 8, 2025, we are happy that our two papers, GUI-G² and GUI-RC, were accepted by AAAI 2026.
-
October 28, 2025: We have summarized the paper on GUI Agent from ICLR 2026. Please refer to ICLR 2026
-
September 16, 2025: We released our new paper on GUI Automation: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
-
Augest 18, 2025: We open-source our GUI-G2-3B and GUI-G2-7B models. Try it out.
-
Augest 14, 2025: We released our new paper on GUI Grounding: Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
-
August 1, 2025: We will add a weekly section summarizing GUI Agent research papers. Stay tuned!
-
July 22, 2025: We released our new paper on GUI Grounding: GUI-G²: Gaussian Reward Modeling for GUI Grounding. Check it out!
-
April 22, 2025: We're excited to announce that our paper has been published and is now available on arXiv. We welcome your attention and feedback! Check it out. We updated GUI Agents List based on RL (R1 Style).
-
April 2, 2025: We have already uploaded the paper to arXiv, please wait for some time. Meanwhile, we will keep updating this repo.
-
March 24, 2025: We updated the repository and released our comprehensive survey on GUI Agents.
If you'd like to include your paper, or need to update any details such as github repo information or code URLs, please feel free to submit a pull request.
📚 2025-8-25 to 2025-8-29 📚
Weekly Paper List:
- Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
- WEBSIGHT: A Vision-First Architecture for Robust Web Agents
- PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration
- UItron: Foundational GUI Agent with Advanced Perception and Planning
📚 2025-8-18 to 2025-8-22 📚
Weekly Paper List:
- CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
- COMPUTERRL: SCALING END-TO-END ONLINE REINFORCEMENT LEARNING FOR COMPUTER USE AGENTS
- Mobile-Agent-v3: Foundamental Agents for GUI Automation
- SWIRL: A STAGED WORKFLOW FOR INTERLEAVED REINFORCEMENT LEARNING IN MOBILE GUI CONTROL
📚 2025-8-11 to 2025-8-15 📚
Weekly Paper List:
- OPENCUA: Open Foundations for Computer-Use Agents
- Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
- WinSpot: A Windows GUI Grounding Benchmark with Multimodal Large Language Models
- InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
- UI-Venus Technical Report: Building High-performance UI Agents with RFT
We have divided GUI Agents into four modules: perception, exploration, planning, and interaction, as shown below:
We have dedicated a separate chapter to datasets and benchmarks for GUI Agents, with all content presented in chronological order.
UI-Venus-1.5 Technical Report
POINTS-GUI-G: GUI-Grounding Journey
OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution (BaiDu)
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Fara-7B: An Efficient Agentic Model for Computer Use
Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents
UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action
Surfer 2: The Next Generation of Cross-Platform Computer-Use Agents [bolg]
Holo1.5 - Open Foundation Models for Computer Use Agents [bolg]
AgentS3: THE UNREASONABLE EFFECTIVENESS OF SCALING AGENTS FOR COMPUTER USE
Mano Technical Report
SCALECUA: SCALING OPEN-SOURCE COMPUTER USE AGENTS WITH CROSS-PLATFORM DATA
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
CODA: COORDINATING THE CEREBRUM AND CEREBELLUM FOR A DUAL-BRAIN COMPUTER USE AGENT WITH DECOUPLED REINFORCEMENT LEARNING.
UItron: Foundational GUI Agent with Advanced Perception and Planning
Mobile-Agent-v3: Foundamental Agents for GUI Automation
UI-Venus Technical Report: Building High-performance UI Agents with RFT
OPENCUA: Open Foundations for Computer-Use Agents
MAGICGUI: A FOUNDATIONAL MOBILE GUI AGENT WITH SCALABLE DATA PIPELINE AND REINFORCEMENT FINE-TUNING
Magentic-UI: Towards Human-in-the-loop Agentic Systems
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
GTA1: GUI Test-time Scaling Agent
AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
1、ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
1、MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
2、A Generative Visual GUI World Model for App Agents
3、MobileDreamer: Generative Sketch World Model for GUI Agent
4、Generative Visual Code Mobile World Models
5、Code2World: A GUI World Model via Renderable Code Generation
2026-02-09: MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2025-02-06: UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
2025-12-24: EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
2025-10-28: MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
2025-10-13: AUTO-SCALING CONTINUOUS MEMORY FOR GUI AGENT
MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
EFFICIENT MULTI-TURN RL FOR GUI AGENTS VIA DECOUPLED TRAINING AND ADAPTIVE DATA CURATION
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
MOBILERL: ADVANCING MOBILE USE AGENTS WITH ADAPTIVE ONLINE REINFORCEMENT LEARNING
COMPUTERRL: SCALING END-TO-END ONLINE REINFORCEMENT LEARNING FOR COMPUTER USE AGENTS
Mobile-Agent-v3: Foundamental Agents for GUI Automation
MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games [Game]
MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
2025-09: MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
2025-08: UI-NEXUS: Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System
2025-07: MMBENCH-GUI: HIERARCHICAL MULTI-PLATFORM EVALUATION FRAMEWORK FOR GUI AGENTS We will organize these points later:
1、AndroidWorld
2、AndroidControl
3、GUI-Odyssey
4、Amex
5、WebArena
6、WebSRC_v1.0
7、Mind2Web 2
8、... stay tuned
1、ShowUI: One Vision-Language-Action Model for GUI Visual Agent
2、ShowUI-Aloha: Human-Taught GUI Agent
3、ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
1、AUTO-Explorer: Automated Data Collection for GUI Agent
2、GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning
2025-12 GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning [nips 2025]
2025-10: EFFICIENT MULTI-TURN RL FOR GUI AGENTS VIA DECOUPLED TRAINING AND ADAPTIVE DATA CURATION
2025-09: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
2025-05: ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay
2025-05: WEBAGENT-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
1、Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning [2025AAAI]
2、MEGA-GUI: MULTI-STAGE ENHANCED GROUNDING AGENTS FOR GUI ELEMENTS [2025-11-18]
3、DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [EMNLP 2025]
4、Visual Test-time Scaling for GUI Agent Grounding [ICCV 2025]
5、Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
6、Improved GUI Grounding via Iterative Narrowing
7、Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
8、MVP: Multiple View Prediction Improves GUI Grounding
9、Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
2025-6-02: ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging Drag Datasets of CUA
Using GUI Agent for Electronic Design Automation [CAD]
VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
WinSpot: A Windows GUI Grounding Benchmark with Multimodal Large Language Models
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction [2025ICML]
ScreenSpot: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
ScreenSpot-V2: OS-ATLAS: Foundation Action Model for Generalist GUI Agents
ScreenSpot-Pro: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
CA-GUI: AgentCPM-GUI: An on-device GUI agent for operating Android apps, enhancing reasoning ability with reinforcement fine-tuning for efficient task execution.
desc: using video to assist gui agent learn
Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents
ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
Watch and Learn: Learning to Use Computers from Online Videos
1、Test-Time Reinforcement Learning for GUI Grounding via Region Consistency
2、GTA1: GUI Test-time Scaling Agent
3、DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [2025 EMNLP]
4、Improved GUI Grounding via Iterative Narrowing
5、Visual Test-time Scaling for GUI Agent Grounding
6、UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
7、GENERALIST SCANNER MEETS SPECIALIST LOCATOR: A SYNERGISTIC COARSE-TO-FINE FRAMEWORK FOR ROBUST GUI GROUNDING
2026-1-08: FOCUSUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
2025-10-24: Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Model
2025-10-01: GUI-KV: EFFICIENT GUI AGENTS VIA KV CACHE WITH SPATIO-TEMPORAL AWARENESS
2026-1-20: Continual GUI Agents
2026-1-12: From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
2025-11-27: Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
2025-11-17: Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents
2025-11-12: GROUNDING COMPUTER USE AGENTS ON HUMAN DEMONSTRATIONS (open sourced training data)
2025-11-3: HYPERCLICK: ADVANCING RELIABLE GUI GROUNDING VIA UNCERTAINTY CALIBRATION
2025-11-3: GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation
2025-10-24: UI-INS: ENHANCING GUI GROUNDING WITH MULTIPERSPECTIVE INSTRUCTION-AS-REASONING
2025-10-22: AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification
2025-10-13: GUI-SHIFT: ENHANCING VLM-BASED GUI AGENTS THROUGH SELF-SUPERVISED REINFORCEMENT LEARNING
2025-10-04: GUI-SPOTLIGHT: ADAPTIVE ITERATIVE FOCUS REFINEMENT FOR ENHANCED GUI VISUAL GROUNDING
2025-9-30: Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
2025-9-30: UI-UG: A Unified MLLM for UI Understanding and Generation
2025-9-28: GUI-SHEPHERD: RELIABLE PROCESS REWARD AND VERIFICATION FOR LONG-SEQUENCE GUI TASKS
2025-9-28: EFFICIENT MULTI-TURN RL FOR GUI AGENTS VIA DECOUPLED TRAINING AND ADAPTIVE DATA CURATION
2025-9-25: LEARNING GUI GROUNDING WITH SPATIAL REASONING FROM VISUAL FEEDBACK
2025-9-23: Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
2025-9-22: GUI-ARP: ENHANCING GROUNDING WITH ADAPTIVE REGION PERCEPTION FOR GUI AGENTS
2025-9-22: BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent (2025 NIPS)
2025-9-16: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
2025-9-06: WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
2025-9-05: Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding
2025-8-28: SWIRL: A STAGED WORKFLOW FOR INTERLEAVED REINFORCEMENT LEARNING IN MOBILE GUI CONTROL
2025-8-20: CODA: COORDINATING THE CEREBRUM AND CEREBELLUM FOR A DUAL-BRAIN COMPUTER USE AGENT WITH DECOUPLED REINFORCEMENT LEARNING.
2025-8-20: COMPUTERRL: SCALING END-TO-END ONLINE REINFORCEMENT LEARNING FOR COMPUTER USE AGENTS
2025-8-18: CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks
2025-8-11: InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization
2025-8-06: SEA: Self-Evolution Agent with Step-wise Reward for Computer Use
2025-8-06: NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks
2025-8-04: GuirlVG: Incentivize GUI Visual Grounding via Empirical Exploration on Reinforcement Learning
2025-7-22: GUI-G²: Gaussian Reward Modeling for GUI Grounding
2025-7-09: MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment
2025-7-09: GTA1: GUI Test-time Scaling Agent
2025-6-25: Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards
2025-6-13: LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
2025-5-29: Grounded Reinforcement Learning for Visual Reasoning
2025-5-22: WEBAGENT-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
2025-6-06: Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation
2025-5-21: GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
2025-5-18: Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning
2025-4-19: InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
2025-4-14: GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
2025-3-27: UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
2025-10 MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces
2025-10 ReInAgent: A Context-Aware GUI Agent Enabling Human-in-the-Loop Mobile Task Navigation
2025-10 IMPROVING GUI GROUNDING WITH EXPLICIT POSITION-TO-COORDINATE MAPPING
2025-10 PAL-UI: PLANNING WITH ACTIVE LOOK-BACK FOR VISION-BASED GUI AGENTS (SFT)
2025-10 AGENT-SCANKIT: UNRAVELING MEMORY AND REASONING OF MULTIMODAL AGENTS VIA SENSITIVITY PERTURBATIONS
2025-9 Log2Plan: An Adaptive GUI Automation Framework Integrated with Task Mining Approach
2025-9 Retrieval-augmented GUI Agents with Generative Guidelines
2025-9 Instruction Agent: Enhancing Agent with Expert Demonstration
2025-9 Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding
2025-8 SWIRL: A STAGED WORKFLOW FOR INTERLEAVED REINFORCEMENT LEARNING IN MOBILE GUI CONTROL
2025-8 PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration
2025-8 WEBSIGHT: A Vision-First Architecture for Robust Web Agents
2025-8 Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
2025-8 You Don’t Know Until You Click: Automated GUI Testing for Production-Ready Software Evaluation
2025-8 Browsing Like Human: A Multimodal Web Agent with Experiential Fast-and-Slow Thinking [2025ACL]
2025-8 Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement
2025-7 ZonUI-3B: A Lightweight Vision–Language Model for Cross-Resolution GUI Grounding
2025-7 Qwen-GUI-3B: A Lightweight Vision–Language Model for Cross-Resolution GUI Grounding
2025-6 Understanding GUI Agent Localization Biases through Logit Sharpness
2025-6 DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
2025-6 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Github Paper
2025-5 Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment
2025-5 UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
2025-5 SpiritSight Agent: Advanced GUI Agent with One Look (2025CVPR)
| Title & Time | Introduction | Links |
|---|---|---|
| Less is More: Empowering GUI Agent with Context-Aware Simplification (2025-7, 2025ICCV) | Github Paper |
|
| Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills (2025-6) | Github Paper |
|
| GUI-Explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent (2025-5, 2025ACL) | Github Paper |
|
| MobileA^3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users (2025-5) | Github Paper |
|
| Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment (2025-3) | Github Paper |
|
| API Agents vs. GUI Agents: Divergence and Convergence (2025-3) | Github Paper |
|
| AppAgentX: Evolving GUI Agents as Proficient Smartphone Users (2025-3) | Github Paper |
|
| Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems (2025-3) | Github Paper |
|
| UI-TARS: Pioneering Automated GUI Interaction with Native Agents (2025-2) | Github Paper |
|
| Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks (2025-1) | Github Paper |
|
| InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (2025-1) | Github Paper |
|
| OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (2024-12) | Github Paper |
|
| PC Agent: While You Sleep, AI Works - A Cognitive Journey into Digital World (2024-12) | Github Paper |
|
| Aria-UI: Visual Grounding for GUI Instructions (2024-12) | Github Paper |
|
| Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining (2024-12) | Github Paper |
|
| AgentTrek : AGENT TRAJECTORY SYNTHESIS VIA GUIDING REPLAY WITH WEB TUTORIALS (2024-12) | Github Paper |
|
| AGUVIS: UNIFIED PURE VISION AGENTS FOR AUTONOMOUS GUI INTERACTION (2024-12) | Github Paper |
|
| Ponder & Press: Advancing Visual GUI Agent towards General Computer Control (2024-12) | Github Paper |
|
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent (2024-11) | Github Paper |
|
| The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use (2024-11) | Github Paper |
|
| MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation (2024-10) (2025 NAACL Demo) | Github Paper |
|
| OS-ATLAS: A FOUNDATION ACTION MODEL FOR GENERALIST GUI AGENTS (2024-10) | Github Paper |
|
| AutoGLM: Autonomous Foundation Agents for GUIs (2024-10) | Github Paper |
|
| FERRET-UI 2: MASTERING UNIVERSAL USER INTERFACE UNDERSTANDING ACROSS PLATFORMS (2024-10) | Github Paper |
|
| AutoWebGLM: A Large Language Model-based Web Navigating Agent (2024-10) | Github Paper |
|
| AGENT S: AN OPEN AGENTIC FRAMEWORK THAT USES COMPUTERS LIKE A HUMAN (2024-10) | Github Paper |
|
| Navigating the Digital World as Humans Do: UNIVERSAL VISUAL GROUNDING FOR GUI AGENTS (2024-10) | Github Paper |
|
| MOBILEFLOW: A MULTIMODAL LLM FOR MOBILE GUI AGENT (2024-8) | Github Paper |
|
| AppAgent v2: Advanced Agent for Flexible Mobile Interactions (2024-8) | Github Paper |
|
| OmniParser for Pure Vision Based GUI Agent (2024-8) | Github Paper |
|
| OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop (2024-7) | Github Paper |
|
| Android in the Zoo: Chain-of-Action-Thought for GUI Agents (2024-7) | Github Paper |
|
| CRADLE: Empowering Foundation Agents Towards General Computer Control (2024-7) | Github Paper |
|
| VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning (2024-6) | Github Paper |
|
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (2024-6) | Github Paper |
|
| Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (2024-6) | Github Paper |
|
| SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (2024-5) | Github Paper |
|
| UFO : A UI-Focused Agent for Windows OS Interaction (2024-5) | Github Paper |
|
| MOBILE-AGENT: AUTONOMOUS MULTI-MODAL MOBILE DEVICE AGENT WITH VISUAL PERCEPTION (2024-4) | Github Paper |
|
| WebArena: A REALISTIC WEB ENVIRONMENT FOR BUILDING AUTONOMOUS AGENTS (2024-4) | Github Paper |
|
| TRAINING A VISION LANGUAGE MODEL AS SMARTPHONE ASSISTANT (2024-4) | Github Paper |
|
| GPT-4V(ision) is a Generalist Web Agent, if Grounded (SeeAct)(2024-3) | Github Paper |
|
| AutoDroid: LLM-powered Task Automation in Android (2024-3) | Github Paper |
|
| SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (2024-2) | Github Paper |
|
| OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (2024-2) | Github Paper |
|
| Understanding the Weakness of Large Language Model Agents within a Complex Android Environment (2024-2) | Github Paper |
|
| MOBILEAGENT: ENHANCING MOBILE CONTROL VIA HUMAN-MACHINE INTERACTION AND SOP INTEGRATION (2024-1) | Github Paper |
|
| ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation (2024-1) | Github Paper |
|
| AppAgent: Multimodal Agents as Smartphone Users (2023-12) | Github Paper |
|
| CogAgent: A Visual Language Model for GUI Agents (2023-12) | Github Paper |
|
| MIND2WEB: Towards a Generalist Agent for the Web (2023-12) | Github Paper |
|
| Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V (2023-11) | Github Paper |
|
| META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI (2022-12) | Github Paper |
|
| UIBert: Learning Generic Multimodal Representations for UI Understanding (2021-8) | Github Paper |
| Title & Time | Introduction | Links |
|---|---|---|
| UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis (2025-04) | Github Paper |
|
| Ui-vision: A desktop-centric gui benchmark for visual perception and interaction (2025-03) | Github Paper |
|
| Worldgui: Dynamic testing for comprehensive desktop gui automation (2025-02) | Github Paper |
|
| Screenspot-pro: Gui grounding for professional high-resolution computer use (2025-01) | Github Paper |
|
| Webwalker: Benchmarking llms in web traversal (2025-01) | Github Paper |
|
| A3: Android agent arena for mobile gui agents (2025-01) | Github Paper |
|
| Gui testing arena: A unified benchmark for advancing autonomous gui testing agent (2024-12) | Github Paper |
|
| Harnessing web page uis for text-rich visual understanding (2024-11) | Github Paper |
|
| On the effects of data scale on computer control agents (2024-11) | Github Paper |
|
| AndroidLab: training and systematic benchmarking of android autonomous agents(2024-10) | Github Paper |
|
| Spa-bench: A comprehensive benchmark for smartphone agent evaluation (2024-10) | Github Paper |
|
| Read anywhere pointed: Layout-aware gui screen reading with tree of-lens grounding (2024-10) | Github Paper |
|
| Crab: Cross environment agent benchmark for multimodal lan guage model agents (2024-10) | Github Paper |
|
| Androidworld: A dynamic benchmarking environment for autonomous agents (2024-10) | Github Paper |
|
| Benchmarking mobile device control agents across diverse configurations (2024-10) | Github Paper |
|
| Agentstudio: A toolkit for building general virtual agents (2024-10) | Github Paper |
|
| Weblinx: Real world website navigation with multi-turn dialogue (2024-10) | Github Paper |
|
| Windows agent arena: Evaluating multi-modal os agents at scale (2024-09) | Github Paper |
|
| Understanding the weakness of large lan guage model agents within a complex android en vironment (2024-09) | Github Paper |
|
| Llamatouch: A faithful and scalable testbed for mobile ui automation task evaluation (2024-08) | Github Paper |
|
| Amex: Android multi annotation expo dataset for mobile gui agents (2024-07) | Github Paper |
|
| Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web (2024-07) | HuggingFace Paper |
|
| Spider2-v: Howfar are multimodal agents from automating data science and engineering workflows? (2024-07) | Github Paper |
|
| Webcanvas: Benchmarking web agents in online environments (2024-07) | HuggingFace Paper |
|
| Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices (2024-06) | Github Paper |
|
| Gui-world: A dataset for gui-oriented multimodal llm-based agents (2024-06) | Github Paper |
|
| Guicourse: From general vision language models to versatile gui agents (2024-06) | Github Paper |
|
| Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents (2024-06) | Github Paper |
|
| Videogui: A benchmark for gui automation from instructional videos (2024-06) | Github Paper |
|
| Visualwebarena: Evaluating multimodal agents on realistic visual web tasks (2024-06) | HomePage Paper |
|
| Mobile-env: an evaluation platform and benchmark for llm-gui interaction (2024-06) | Github Paper |
|
| Os world: Benchmarking multimodal agents for open ended tasks in real computer environments (2024-05) | Github Paper |
|
| Visualwebbench: How far have multi modal llms evolved in web page understanding and grounding? (2024-04) | Github Paper |
|
| Mmina: Benchmarking multihop multimodal internet agents (2024-04) | HomePage Paper |
|
| Webarena: Arealistic web environment for building autonomous agents (2024-04) | HomePage Paper |
|
| Webvln: Vision-and-language navigation on websites (2024-03) | Github Paper |
|
| On the multi-turn instruction fol lowing for conversational web agents (2024-02) | Github Paper |
|
| Assist gui: Task-oriented desktop graphical user interface automation (2024-01) | Github Paper |
|
| Mind2web: Towards a generalist agent for the web (2023-12) | Github Paper |
|
| Androidinthewild: A large-scale dataset for an droid device control (2023-10) | Github Paper |
|
| Web shop: Towards scalable real-world web interaction with grounded language agents (2023-02) | Github Paper |
|
| Meta-gui: Towards multi-modal con versational agents on mobile gui (2022-11) | Github Paper |
|
| A dataset for interactive vision language navigation with unknown command fea sibility (2022-08) | Github Paper |
|
| Websrc: A dataset for web-based structural reading comprehension (2021-11) | Github Paper |
|
| Mapping natural language instructions to mobile ui action sequences (2020-06) | Github Paper |
|
| Reinforcement learning on web interfaces using workflow-guided exploration (2018-02) | Github Paper |
|
| Rico: A mobile app dataset for building data-driven design applications (2017-10) | HomePage Paper |
|
| World of bits: An open-domain platform for web based agents (2017-08) | Paper |