Thanks to visit codestin.com
Credit goes to github.com

Skip to content

pokerme7777/Compositional-Visual-Reasoning-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

👨‍💻 Awesome Compositional Visual Reasoning

📄 This is our latest survey paper on compositional visual reasoning!
📚 View the survey paper

Teaser Diagram

sliding Bar

📬 Contribute to This List

If you have related papers that should be included in this survey, feel free to open an issue. Contributions are welcome!

📘 Stages Overview

Roadmap of Compositional Visual Reasoning Models

Timeline Diagram


Key Shift from Monolithic Reasoning to Compositional Reasoning

Relationship Diagram


This stage focuses on methods where LLMs act as the central reasoning engine, guided by textual prompts. Visual inputs are often pre-processed or summarized before being fed into the LLM.

Stage I Diagram

🔹 Category 1: Task Decomposition Followed by Visual Perception

  • DDCoT
    Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
    NeurIPS 2023 / Code

  • IdealGPT
    Iteratively Decomposing Vision and Language Reasoning via Large Language Models
    EMNLP 2023 / Code

  • Modeling Collaborator
    Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
    CVPR 2024

  • ChatCaptioner
    ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
    TMLR 2023 / Code


🔹 Category 2: Perceptual Grounding Prior to Reasoning

  • Cola
    Large Language Models are Visual Reasoning Coordinators
    NeurIPS 2023 / Code

  • MM-CoT
    Multimodal Chain-of-Thought Reasoning in Language Models
    TMLR 2024 / Code

  • AdGPT
    Explore Meaningful Advertising with ChatGPT
    ACM MM 2025 / Code

  • PromptCap
    Prompt-Guided Task-Aware Image Captioning
    ICCV 2023 / Code

  • VCTP
    Visual Chain-of-Thought Prompting for Knowledge-Based Visual Reasoning
    AAAI 2024 / Code

  • Finedefics
    Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
    ICLR 2025 / Code


This paradigm typically involves two key components: generating actions from the current state, and transitioning between states by executing those actions.

Stage II/III Diagram

  • ViperGPT
    ViperGPT: Visual Inference via Python Execution for Reasoning
    ICCV 2023 / Code

  • Chameleon
    Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
    NeurIPS 2023 / Code

  • Visprog
    Visual Programming: Compositional Visual Reasoning Without Training
    CVPR 2023 / Code

  • Visual ChatGPT
    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
    arXiv 2023 / Code

  • HuggingGPT
    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
    NeurIPS 2023 / Code

  • GPT4Tools
    GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
    NeurIPS 2023 / Code

  • InternGPT
    InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
    arXiv 2023 / Code

  • ViotGPT
    VIoTGPT: Learning to Schedule Vision Tools Towards Intelligent Video Internet of Things
    AAAI 2025 / Code

  • MM-REACT
    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
    arXiv 2023 / Code

  • VisRep
    Self-training Large Language Models for Improved Visual Program Synthesis with Visual Reinforcement
    CVPR 2024

  • CRAFT
    CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
    ICLR 2024 / Code

  • CLOVA
    CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
    CVPR 2024 / Code

  • HYDRA
    HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
    ECCV 2024 / Code

  • ContextualCoder
    ContextualCoder: Adaptive In-context Prompting for Programmatic Visual Question Answering
    TMM 2025

  • ViUniT
    Visual Unit Tests for More Robust Visual Programming
    CVPR 2025 / Code

  • SYNAPSE
    SYNAPSE: SYmbolic Neural-Aided Preference Synthesis Engine
    AAAI 2025 / Code

  • NAVER
    NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
    ICCV 2025 / Code

  • DWIM
    DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
    ICCV 2025 / Code / Website

  • LEFT
    What’s Left? Concept Grounding with Logic-Enhanced Foundation Models
    NeurIPS 2023 / Code


Tool-enhanced VLMs extend tool-enhanced LLMs by replacing the language model with a vision-language model, enabling direct visual interaction. Unlike tool-enhanced LLMs, planners here take raw images as input, reducing information loss and improving efficiency.


  • Image-of-Thought
    Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models
    arXiv 2024

  • P2G
    Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
    arXiv 2024

  • LLaVA-PLUS
    LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
    ICLR 2024 / Code

  • VIREO
    From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
    EMNLP 2024 / Code

  • Openthinkimg
    OPENTHINKIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
    arXiv 2025 / Code

  • SKETCHPAD
    Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
    NeurIPS 2024 / Code

  • VisionLLM v2
    VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
    NeurIPS 2024 / Code

  • VITRON
    VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
    NeurIPS 2024 / Code

  • Syn
    Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA
    CVPR 2024

  • VTool-R1
    VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
    arXiv 2025 / Code

  • Self-Imagine
    Self-Imagine: Effective Unimodal Reasoning with Multimodal Models using Self-Imagination
    arXiv 2024 / Code

  • Creative Agents
    Creative Agents: Empowering Agents with Imagination for Creative Tasks
    UAI 2025 / Code

  • CoT-VLA
    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
    CVPR 2025 / Website


These models carry out multi-step reasoning in a single forward pass without external tools, explicitly exposing intermediate reasoning and perceptual states before the final answer.


Stage IV Diagram

  • LLaVA-CoT
    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
    arXiv 2024 / Code

  • CCoT
    Compositional Chain-of-Thought Prompting for Large Multimodal Models
    CVPR 2024 / Code

  • PaLI
    Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
    CVPR 2024

  • VOCoT
    VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
    NAACL 2025 / Code

  • MM-GCoT
    Grounded Chain-of-Thought for Multimodal Large Language Models
    arXiv 2025 / Code

  • LLaVA-Aurora
    Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
    CVPR 2025 / Code

  • DeepPerception
    DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
    arXiv 2025 / Code

  • CoReS
    CoReS: Orchestrating the Dance of Reasoning and Segmentation
    ECCV 2024 / Code

  • G1
    G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
    arXiv 2025 / Code

  • Vision-R1
    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
    arXiv 2025 / Code

  • Ground-R1
    Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
    arXiv 2025

  • Griffon-R
    Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models
    arXiv 2025 / Code

  • VISTAR
    Visually Interpretable Subtask Reasoning for Visual Question Answering
    CVPR 2025 / Code

  • CoF
    Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
    arXiv 2025 / Code

  • LATTE
    LATTE: Learning to Think with Vision Specialists
    EMNLP 2025 / Code

  • VisRL
    VisRL: Intention-Driven Visual Perception via Reinforced Reasoning ICCV 2025 / Code


These models incorporate higher-order cognitive mechanisms such as planning, memory, operation, imagination, textual feedback and visual evidence.

Stage V Diagram

  • V*
    V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
    CVPR 2024 / Code

  • DC2
    Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
    AAAI 2025 / Code

  • ZoomEye
    ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
    arXiv 2024 / Code

  • FAST
    Visual Agents as Fast and Slow Thinkers
    ICLR 2025 / Code

  • CogCoM
    CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
    ICLR 2025 / Code

  • GeReA
    GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
    arXiv 2024 / Code

  • Insight-V
    Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
    CVPR 2025 / Code

  • Argus
    Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
    CVPR 2025

  • Mirage
    Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
    arXiv 2025 / Code

  • FOREWARN
    From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment
    RSS 2025 / Code

  • VisCoT
    Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
    NeurIPS 2024 / Code

  • ViGoRL
    Grounded Reinforcement Learning for Visual Reasoning
    arXiv 2025 / Code

  • ViGoRL
    SIFThinker: Spatially-Aware Image Focus for Visual Reasoning arXiv 2025 / Code

📖Citation

If you find this work useful for your research, please consider citing it.

@misc{ke2025explainanswersurveycompositional,
      title={Explain Before You Answer: A Survey on Compositional Visual Reasoning}, 
      author={Fucai Ke and Joy Hsu and Zhixi Cai and Zixian Ma and Xin Zheng and Xindi Wu and Sukai Huang and Weiqing Wang and Pari Delir Haghighi and Gholamreza Haffari and Ranjay Krishna and Jiajun Wu and Hamid Rezatofighi},
      year={2025},
      eprint={2508.17298},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.17298}, 
}

About

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published