📄 This is our latest survey paper on compositional visual reasoning!
📚 View the survey paper
If you have related papers that should be included in this survey, feel free to open an issue. Contributions are welcome!
This stage focuses on methods where LLMs act as the central reasoning engine, guided by textual prompts. Visual inputs are often pre-processed or summarized before being fed into the LLM.
-
DDCoT
Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS 2023 / Code -
IdealGPT
Iteratively Decomposing Vision and Language Reasoning via Large Language Models
EMNLP 2023 / Code -
Modeling Collaborator
Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
CVPR 2024 -
ChatCaptioner
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
TMLR 2023 / Code
-
Cola
Large Language Models are Visual Reasoning Coordinators
NeurIPS 2023 / Code -
MM-CoT
Multimodal Chain-of-Thought Reasoning in Language Models
TMLR 2024 / Code -
AdGPT
Explore Meaningful Advertising with ChatGPT
ACM MM 2025 / Code -
PromptCap
Prompt-Guided Task-Aware Image Captioning
ICCV 2023 / Code -
VCTP
Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
AAAI 2024 / Code -
Finedefics
Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
ICLR 2025 / Code
This paradigm typically involves two key components: generating actions from the current state, and transitioning between states by executing those actions.
-
ViperGPT
ViperGPT: Visual Inference via Python Execution for Reasoning
ICCV 2023 / Code -
Chameleon
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
NeurIPS 2023 / Code -
Visprog
Visual Programming: Compositional Visual Reasoning Without Training
CVPR 2023 / Code -
Visual ChatGPT
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arXiv 2023 / Code -
HuggingGPT
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
NeurIPS 2023 / Code -
GPT4Tools
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
NeurIPS 2023 / Code -
InternGPT
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
arXiv 2023 / Code -
ViotGPT
VIoTGPT: Learning to Schedule Vision Tools Towards Intelligent Video Internet of Things
AAAI 2025 / Code -
MM-REACT
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arXiv 2023 / Code -
VisRep
Self-training Large Language Models for Improved Visual Program Synthesis with Visual Reinforcement
CVPR 2024 -
CRAFT
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
ICLR 2024 / Code -
CLOVA
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
CVPR 2024 / Code -
HYDRA
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
ECCV 2024 / Code -
ContextualCoder
ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering
TMM 2025 -
ViUniT
Visual Unit Tests for More Robust Visual Programming
CVPR 2025 / Code -
SYNAPSE
SYNAPSE: SYmbolic Neural-Aided Preference Synthesis Engine
AAAI 2025 / Code -
NAVER
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
ICCV 2025 / Code -
DWIM
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
ICCV 2025 / Code / Website -
LEFT
What’s Left? Concept Grounding with Logic-Enhanced Foundation Models
NeurIPS 2023 / Code
Tool-enhanced VLMs extend tool-enhanced LLMs by replacing the language model with a vision-language model, enabling direct visual interaction. Unlike tool-enhanced LLMs, planners here take raw images as input, reducing information loss and improving efficiency.
-
Image-of-Thought
Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
arXiv 2024 -
P2G
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
arXiv 2024 -
LLaVA-PLUS
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
ICLR 2024 / Code -
VIREO
From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
EMNLP 2024 / Code -
Openthinkimg
OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
arXiv 2025 / Code -
SKETCHPAD
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
NeurIPS 2024 / Code -
VisionLLM v2
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
NeurIPS 2024 / Code -
VITRON
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
NeurIPS 2024 / Code -
Syn
Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
CVPR 2024 -
VTool-R1
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
arXiv 2025 / Code -
Self-Imagine
Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
arXiv 2024 / Code -
Creative Agents
Creative Agents: Empowering Agents with Imagination for Creative Tasks
UAI 2025 / Code -
CoT-VLA
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
CVPR 2025 / Website
These models carry out multi-step reasoning in a single forward pass without external tools, explicitly exposing intermediate reasoning and perceptual states before the final answer.
-
LLaVA-CoT
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
arXiv 2024 / Code -
CCoT
Compositional Chain-of-Thought Prompting for Large Multimodal Models
CVPR 2024 / Code -
PaLI
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
CVPR 2024 -
VOCoT
VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
NAACL 2025 / Code -
MM-GCoT
Grounded Chain-of-Thought for Multimodal Large Language Models
arXiv 2025 / Code -
LLaVA-Aurora
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
CVPR 2025 / Code -
DeepPerception
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
arXiv 2025 / Code -
CoReS
CoReS: Orchestrating the Dance of Reasoning and Segmentation
ECCV 2024 / Code -
G1
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
arXiv 2025 / Code -
Vision-R1
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
arXiv 2025 / Code -
Ground-R1
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
arXiv 2025 -
Griffon-R
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
arXiv 2025 / Code -
VISTAR
Visually Interpretable Subtask Reasoning for Visual Question Answering
CVPR 2025 / Code -
CoF
Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
arXiv 2025 / Code -
LATTE
LATTE: Learning to Think with Vision Specialists
EMNLP 2025 / Code -
VisRL
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning ICCV 2025 / Code
These models incorporate higher-order cognitive mechanisms such as planning, memory, operation, imagination, textual feedback and visual evidence.
-
V*
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
CVPR 2024 / Code -
DC2
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
AAAI 2025 / Code -
ZoomEye
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
arXiv 2024 / Code -
FAST
Visual Agents as Fast and Slow Thinkers
ICLR 2025 / Code -
CogCoM
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
ICLR 2025 / Code -
GeReA
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
arXiv 2024 / Code -
Insight-V
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
CVPR 2025 / Code -
Argus
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
CVPR 2025 -
Mirage
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
arXiv 2025 / Code -
FOREWARN
From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
RSS 2025 / Code -
VisCoT
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
NeurIPS 2024 / Code -
ViGoRL
Grounded Reinforcement Learning for Visual Reasoning
arXiv 2025 / Code -
ViGoRL
SIFThinker: Spatially-Aware Image Focus for Visual Reasoning arXiv 2025 / Code
If you find this work useful for your research, please consider citing it.
@misc{ke2025explainanswersurveycompositional,
title={Explain Before You Answer: A Survey on Compositional Visual Reasoning},
author={Fucai Ke and Joy Hsu and Zhixi Cai and Zixian Ma and Xin Zheng and Xindi Wu and Sukai Huang and Weiqing Wang and Pari Delir Haghighi and Gholamreza Haffari and Ranjay Krishna and Jiajun Wu and Hamid Rezatofighi},
year={2025},
eprint={2508.17298},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.17298},
}