📢 [Project Page] [V2 Blog Post] [Models V2] [Models V1.5] [HuggingFace Space Demo]
OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.
- [2025/3] We support local logging of trajecotry so that you can use OmniParser+OmniTool to build training data pipeline for your favorate agent in your domain. [Documentation WIP]
- [2025/3] We are gradually adding multi agents orchstration and improving user interface in OmniTool for better experience.
- [2025/2] We release OmniParser V2 checkpoints. Watch Video
- [2025/2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o/o1/o3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. Watch Video
- [2025/1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark Screen Spot Pro with OmniParser v2 (will be released soon)! Read more details here.
- [2024/11] We release an updated version, OmniParser V1.5 which features 1) more fine grained/small icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb.
- [2024/10] OmniParser was the #1 trending model on huggingface model hub (starting 10/29/2024).
- [2024/10] Feel free to checkout our demo on huggingface space! (stay tuned for OmniParser + Claude Computer Use)
- [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! Hugginface models
- [2024/09] OmniParser achieves the best performance on Windows Agent Arena!
First clone the repo, and then install environment:
cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txtEnsure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:
# download the model checkpoints to local directory OmniParser/weights/
for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence
Note
- When started without local weights, the FastAPI server auto-installs the YOLO detector, the Florence2 fine-tuned weights, and also the Florence2 processor/tokenizer files and local custom code (configuration/modeling/processing) required by
trust_remote_code. This allows fully local loading without additional network access at runtime.
We put together a few simple examples in the demo.ipynb.
To run gradio demo, simply run:
python gradio_demo.pyAn HTTP API wrapper is available for running OmniParser as a service with a response format compatible with the Minecraft Action Model loop.
Run the server (models auto-install on first run if missing):
python fastapi_server.py --device cuda --BOX_TRESHOLD 0.05 --host 0.0.0.0 --port 8510
python fastapi_server.py --som_model_path weights/icon_detect/model.pt --caption_model_name florence2 --caption_model_path weights/icon_caption_florence --device cuda --BOX_TRESHOLD 0.05 --host 0.0.0.0 --port 8510
Parse an image (multipart or JSON):
curl -F "image=@OmniParser/imgs/omni3.jpg" http://localhost:8510/parse
curl -X POST http://localhost:8510/parse \
-H "Content-Type: application/json" \
-d '{ "image_base64": "data:image/png;base64,..." }'
API endpoints
POST /parse: parse image, returns{ annotated_image_base64, annotation_list }GET /health: readiness probeGET /config: current model configuration
Notes
- If the specified
som_model_pathorcaption_model_pathdoes not exist, the server downloads the appropriate files frommicrosoft/OmniParser-v2.0using Hugging Face Hub and updates the effective paths in/config. - Requires network access for first-run downloads.
If you prefer uv over conda/pip, the project is configured with optional dependency groups for CPU and GPU installs via pyproject.toml extras: cpu and gpu.
Prereqs
- Python 3.10+ and
uvinstalled (pipx install uvor see uv docs).
Create and activate a virtualenv
uv venv
source .venv/bin/activate
Base install (no heavy frameworks)
uv sync
CPU stack
uv pip install '.[cpu]'
Optionally lock then sync
uv lock --extra cpu
uv sync
GPU stack (CUDA)
- Choose your CUDA tag:
cu118,cu121, orcu122(match your driver/runtime). - Install the GPU extra with the proper indexes so PyTorch and Paddle use CUDA wheels:
uv pip install '.[gpu]' \
--extra-index-url https://download.pytorch.org/whl/cu121 \
-f https://www.paddlepaddle.org.cn/whl/cu121
Alternative two-step if you prefer to pin separately
uv pip install --extra-index-url https://download.pytorch.org/whl/cu121 torch torchvision
uv pip install -f https://www.paddlepaddle.org.cn/whl/cu121 paddlepaddle-gpu
uv pip install .
Verify GPU availability
uv run python -c "import torch; print('torch cuda:', torch.cuda.is_available())"
uv run python - <<'PY'
import paddle
print('paddle cuda:', paddle.device.is_compiled_with_cuda())
print('device:', paddle.device.get_device())
PY
Notes
uv lock --extra gpurecords the dependency set, but wheel indexes come from the install command; keep the PyTorch--extra-index-urland Paddle-fin your deployment docs/scripts.- Replace
cu121in the URLs with your CUDA version as needed. - Thanks to PR #332 for clarifying which dependencies to pin in the GPU install instructions.
For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.
Our technical report can be found here. If you find our work useful, please consider citing our work:
@misc{lu2024omniparserpurevisionbased,
title={OmniParser for Pure Vision Based GUI Agent},
author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
year={2024},
eprint={2408.00203},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.00203},
}