This repo provides the public preview for V-Droid(https://arxiv.org/abs/2503.15937), a verifier-driven mobile GUI agents. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.
- ✅ Paper link: https://arxiv.org/abs/2503.15937
- ✅ Model weights: https://huggingface.co/V-Droid/V-Droid-8B-0323
V-Droid in the following demos are hosted on 2x4090 GPUs, the videos are presented without acceleration.
In V-Droid, we propose the verifier-driven approach and the correpsonding workflow for GUI agents as follows:
- Extracting actions from UI and supplementing default actions;
- Constructing verification prompts with the template for each candidate action;
- Scoring with the verifier in batch with prefix caching;
- Completing and executing the selected action;
- Updating the working memory. For more details, please refer our code
-
Setup AndroidWorld Environment
- Download Android Studio here
- Create an Android Virtual Device (AVD) by following these instructions. For hardware select Pixel 6, for System Image select Tiramisu, API Level 33, and choose AVD name as AndroidWorldAvd. Watch the setup video.
-
Launch the Android Emulator from the command line Launch the emulator from the command line, not using the Android Studio UI, with the
-grpc 8554flag which is needed communication with accessibility forwarding app.# Typically it's located in ~/Android/Sdk/emulator/emulator or # ~/Library/Android/sdk/emulator/emulator EMULATOR_NAME=AndroidWorldAvd # From previous step ~/Library/Android/sdk/emulator/emulator -avd $EMULATOR_NAME -no-snapshot -grpc 8554
-
[Optional] It's recommended to use
conda, which you can download here.conda create -n android_world python=3.11.8 conda activate android_world conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia conda install -y numpy pandas -
Install Dependency. Note: Python 3.11 or above is required.
pip install -r requirements.txt
-
Modify vLLM.
Please navigate to vllm/model_executor/layers/sampler.py, add the following to line 317.
for val, lst in zip(logits, sample_logprobs): for d in lst: for k in d.keys(): d[k].logprob = val
(See vllm-project/vllm#11397 for more explanations)
-
Add model provider APIs as environment variables.
Three API providers are supported: OpenAI and its compatible APIs, and Azure OpenAI services. You may configure any of these based on your preferences.
These APIs are only used for building the working memory, V-Droid allows to build the working memory without using these third-party APIs
# Add to .bashrc. # use Gemini GCP service, which requires API key export GCP_API_KEY= # use openai compatible APIs, including OPENAI, Qwen and DeepSeek export OPENAI_ENDPOINT= export OPENAI_MODEL_NAME= export OPENAI_API_VERSION= export OPENAI_API_KEY= # use azure openai services export AZURE_OPENAI_API_KEY= export AZURE_OPENAI_MODEL_NAME= export AZURE_OPENAI_API_VERSION= export AZURE_OPENAI_ENDPOINT=
-
Download Lora weights for V-Droid model
The V-Droid model weight is available at https://huggingface.co/V-Droid/V-Droid-8B-0323
-
Lauanch the emulator and run the eveluation tasks
emulator -avd AndroidWorldAvd -no-window -no-snapshot -grpc 8554 bash main.sh
-
Training You may use the following code to train the lora module in V-Droid. We provide several training pairs to use.
train.sh If you use this repo, please cite our paper:
@article{dai2025advancingmobileguiagents,
title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment},
author={Gaole Dai and Shiqi Jiang and Ting Cao and Yuanchun Li and Yuqing Yang and Rui Tan and Mo Li and Lili Qiu},
year={2025},
eprint={2503.15937},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.15937},
}