Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics]: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Notifications You must be signed in to change notification settings

ekonwang/VisuoThink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.

Quick Start

  1. Install the dependencies:
conda create -n visuothink python==3.10 -y
conda activate visuothink

pip install -r requirements.txt
  1. Set up the config.py file under visual-navigation/ as follows:
import os

# set up the agent max reasoning steps
MAX_REPLY = 10

os.environ["AUTOGEN_USE_DOCKER"] = "False"

MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o")
API_KEY = os.environ.get("OPENAI_API_KEY")


llm_config={"cache_seed": None, "config_list": [{"model": MODEL_NAME, "temperature": 0.0, "api_key": API_KEY}]}
  1. Run the visual navigation tasks:
  • To run the visual navigation with VisuoThink, you can use the following command:
python visual-navigation/run_task_nav.py --verbose --visual
# alternatively, you can use the --tree_search flag to enable multimodal tree search
  • To run the visual navigation with CoT with Executor, you can use the following command:
python visual-navigation/run_task_nav.py --verbose
  • To run the visual navigation with CoT, you can use the following command:
python visual-navigation/run_task_nav.py --verbose --visual --run_tag cot
  • To run the geometry tasks with VisuoThink (/wo rollout search), you can use the following command:
python geometry/solver.py

Benchmarks

  • (4.27 News) Geomverse Dataset has been released! See here for more details.

Citation

Please consider citing our paper and starring this repo if you find them helpful. Thank you!

@misc{wang2025visuothinkempoweringlvlmreasoning,
      title={VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search}, 
      author={Yikun Wang and Siyin Wang and Qinyuan Cheng and Zhaoye Fei and Liang Ding and Qipeng Guo and Dacheng Tao and Xipeng Qiu},
      year={2025},
      eprint={2504.09130},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.09130}, 
}

About

[Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics]: VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages