π ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Zoom Eye enables MLLMs to (a) answer the question directly when the visual information is adequate, (b) zoom in gradually for a closer examination, and (c) zoom out to the previous view and explore other regions if the desired information is not initially found.
2025.08.28π We have released an updated version of our paper on arXiv, which includes the results of Qwen2.5-VL and InternVL2.5, along with a thorough comparison against a wide range of baselines.2025.08.21π Zoom Eye has been accepted by EMNLP 2025 MainConference π. We will release an updated version of the paper soon, which includes more comprehensive evaluations on various Multimodal Large Language Models (MLLMs) as well as a detailed ablation study. Stay tuned ~2025.01.01π We released the Project Page of ZoomEye, welcom to visit~2025.01.01π We released the evaluation code for MME-RealWorld.2024.11.30π We released the evaluation code for V* Bench and HR-Bench.2024.11.25π We released the ArXiv paper.
This project is built based on LLaVA-Next. If you encounter unknown errors during installation, you can refer to the issues and solutions in it.
git clone https://github.com/om-ai-lab/ZoomEye.git
cd ZoomEyeconda create -n zoom_eye python=3.10 -y
conda activate zoom_eye
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"In our work, we implement Zoom Eye with LLaVA-v1.5 and LLaVA-OneVision(ov) series, you could download these checkpoints before running or automatically download them when executing the from_pretrained method in transformers.
The core evaluation data (including V* Bench and HR-Bench) will be used has been packaged together, and the link is provided here. After downloading, please unzip it and its path is referred as to anno path.
[Optional] If you want to evaluate ZoomEye on MME-RealWorld Benchmark, you could follow the instructions in this repository to download the images and extract them to the <anno path>/mme-realworld directory. Meanwhile, place the annotation_mme-realworld.json file from this link into <anno path>/mme-realworld.
The folder tree is that:
zoom_eye_data
βββ hr-bench_4k
βΒ Β βββ annotation_hr-bench_4k.json
βΒ Β βββ images/
β βββ some.jpg
βΒ Β ...
βββ hr-bench_8k
βΒ Β βββ annotation_hr-bench_8k.json
βΒ Β βββ images/
β βββ some.jpg
βΒ Β ...
βββ vstar
βΒ Β βββ annotation_vstar.json
βΒ Β βββ direct_attributes/
β βββ some.jpg
βΒ Β ...
βΒ Β βββ relative_positions/
β βββ some.jpg
βΒ Β ...
βββ mme-realworld
βΒ Β βββ annotation_mme-realworld.json
βΒ Β βββ AutonomousDriving/
β βββ MME-HD-CN/
β βββ monitoring_images/
β βββ ocr_cc/
β βββ remote_sensing/
...
We provide a demo file of Zoom Eye accepting any input Image-Question pair.
python ZoomEye/demo.py \
--model-path lmms-lab/llava-onevision-qwen2-7b-ov \
--input_image demo/demo.jpg \
--question "What is the color of the soda can?"and the zoomed views of Zoom Eye will be saved into the demo folder.
We also provide a Gradio Demo, run the script and open http://127.0.0.1:7860/ in your browser.
python gdemo_gradio.py # After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
vstar
# Get the result
python ZoomEye/eval/eval_results_vstar.py --answers-file ZoomEye/eval/answers/vstar/<mllm model base name>/merge.jsonlThe <mllm model> could be referred as to the above MLLM checkpoints, and the <anno path> is the path of the evaluation data.
If you don't have multi-gpu environment, you can set CUDA_VISIBLE_DEVICES=0.
# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_4k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_4k
# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_4k/merge.jsonl# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/hr-bench_8k/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
hr-bench_8k
# Get the result
python ZoomEye/eval/eval_results_hr-bench.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonl# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl
python ZoomEye/eval/perform_zoom_eye.py \
--model-path <mllm model> \
--annotation_path <anno path> \
--benchmark <bench name> \
--direct-answer
# Get the result
python ZoomEye/eval/eval_results_{vstar/hr-bench}.py --answers-file ZoomEye/eval/answers/<bench name>/<mllm model base name>/direct_answer.jsonl# After excute this script, the result will be saved in the answers dir: ZoomEye/eval/answers/mme-realworld/<mllm model base name>/merge.jsonl
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash ZoomEye/eval/perform_zoom_eye.sh \
<mllm model> \
<anno path> \
mme-realworld
# Get the result
python ZoomEye/eval/eval_results_mme-realworld.py --answers-file ZoomEye/eval/answers/vstar/hr-bench_8k/merge.jsonlIf you are intrigued by multimodal large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
π OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer (EMNLP24)
π GitHub Repository
π How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection (AAAI24)
π GitHub Repository
π OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network (IET Computer Vision)
π Github Repository
If you find this repository helpful to your research, welcome to cite our paper:
@article{shen2024zoomeye,
title={ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration},
author={Shen, Haozhan and Zhao, Kangjia and Zhao, Tiancheng and Xu, Ruochen and Zhang, Zilun and Zhu, Mingwei and Yin, Jianwei},
journal={arXiv preprint arXiv:2411.16044},
year={2024}
}