Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

HJYao00/Awesome-Agentic-MLLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 

Repository files navigation

🤖Awesome-Agentic-MLLMs

arXiv Maintenance Discussion Contribution Welcome

👏 Welcome to the Awesome-Agentic-MLLMs repository! This curated collection features influential papers, codebases, datasets, benchmarks, and resources dedicated to exploring the emerging field of agentic capabilities in Multimodal Large Language Models.

⭐ Feel free to star and fork this repository to stay updated with the latest advancements and contribute to the growing community.

We greatly appreciate and welcome everyone to submit an issue for any related work we may have missed, and we’ll review and address it in the next release!

🔔 News

  • Oct 14, 2025. We’re excited to introduce our survey paper on agentic MLLMs. Check it out on arXiv!
  • Oct 12, 2025. This repository curates and maintains an updated list of papers on Awesome-Agentic-MLLM. Contributions and suggestions are warmly welcome!

🔗 Citation

If you find this survey helpful, please cite our work:

@article{yao2025survey,
  title={A Survey on Agentic Multimodal Large Language Models},
  author={Yao, Huanjin and Zhang, Ruifei and Huang, Jiaxing and Zhang, Jingyi and Wang, Yibo and Fang, Bo and Zhu, Ruolin and Jing, Yongcheng and Liu, Shunyu and Li, Guanbin and others},
  journal={arXiv preprint arXiv:2510.10991},
  year={2025}
}

🌍 Overview

We collect recent advances in Agentic MLLMs and categorize them into three core dimensions: (1) Agentic Internal Intelligence, which leverages reasoning, reflection, and memory to enable accurate long-horizon planning; (2) Agentic External Tool Invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (3) Agentic environment interaction, which situates models within virtual or physical environments, allowing them to perceive changes and incorporate feedback from the real world.

📒 Table of Contents


📄 Paper List

Foundational MLLMs

Dense MLLMs

Date Title Paper Code
2502 Qwen2.5-VL Technical Report Paper Code
2502 SmolVLM2: Bringing Video Understanding to Every Device Paper Code
2506 MiMo-VL Technical Report Paper Code
2507 Kwai Keye-VL Technical Report Paper Code
2509 SAIL-VL2 Technical Report Paper Code
2509 LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training Paper Code
2509 MiniCPM-V 4.5 technical report Paper Code

MoE MLLMs

Date Title Paper Code
2509 Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action Paper Code
2409 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning Paper -
2412 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Paper Code
2503 Kimi-VL Technical Report Paper Code
2506 ERNIE 4.5 Technical Report Paper Code
2507 Seed1.5-VL Technical Report Paper Code
2507 GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper Code
2507 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding Paper Code
2508 InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper Code

Agentic Internal Intelligence

Agentic Reasoning

Date Title Paper Github
2410 Improve Vision Language Model Chain-of-thought Reasoning Paper Code
2411 LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Paper Code
2412 Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search Paper Code
2503 Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Paper Code
2503 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization Paper Code
2503 MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning Paper Code
2503 Video-R1: Reinforcing Video Reasoning in MLLMs Paper Code
2504 SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Paper Code
2504 NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Paper Code
2504 Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning Paper Code
2504 VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Paper Code
2505 SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward Paper Code
2505 R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO Paper Code
2505 EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning Paper Code
2505 Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models Paper Code
2506 GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning Paper Code
2506 WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning Paper Code
2506 APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization Paper Code
2507 Scaling RL to Long Videos Paper Code
2507 VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning Paper Code
2507 C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning Paper Code
2507 Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning Paper -
2508 StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models Paper -
2509 MAPO: Mixed Advantage Policy Optimization Paper -
2509 MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources Paper Code
2509 VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception Paper Code
2509 Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models Paper Code

Agentic Reflection

Date Title Paper Code
2410 ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents Paper Code
2411 Self-Corrected Multimodal Large Language Model for Robot Manipulation and Reflection Paper -
2411 Vision-Language Models Can Self-Improve Reasoning via Reflection Paper Code
2412 Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search Paper Code
2503 V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents Paper -
2504 MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding Paper -
2504 VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning Paper Code
2505 Training-Free Reasoning and Reflection in MLLMs Paper Code
2506 SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning Paper Code
2507 Look-Back: Implicit Visual Re-focusing in MLLM Reasoning Paper Code
2509 Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards Paper Code
2510 SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models Paper Code

Agentic Memory

Date Title Paper Code
2305 MemoryBank: Enhancing Large Language Models with Long-Term Memory Paper Code
2307 MovieChat Paper Code
2312 Empowering Working Memory for Large Language Model Agents Paper -
2402 LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper Code
2502 A-Mem: Agentic Memory for LLM Agents Paper Code
2503 In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents Paper -
2504 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory Paper Code
2506 A Walk to Remember: Mllm Memory-Driven Visual Navigation Paper -
2506 MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents Paper Code
2507 MemOS: A Memory OS for AI System Paper -
2507 MIRIX: Multi-Agent Memory System for LLM-Based Agents Paper Code
2508 Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning Paper -
2508 Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory Paper -
2508 MMS: Multiple Memory Systems for Enhancing the Long-term Memory of Agent Paper -

Agentic External Tool Invocation

Agentic Search for Information Retrieval

Date Title Paper Code
2502 Open AI Deep Research: Introducing deep research Paper Code
2505 VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning Paper Code
2505 Visual Agentic Reinforcement Fine-Tuning Paper Code
2506 MMSearch-R1: Incentivizing LMMs to Search Paper Code
2508 Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning Paper Code
2508 M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation Paper -
2508 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent Paper Code
2510 DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search Paper -

Agentic Coding for Complex Computations

Date Title Paper Code
2501 rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper Code
2504 ReTool: Reinforcement Learning for Strategic Tool Use in LLMs Paper Code
2505 R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning Paper Code
2506 CoRT: Code-integrated Reasoning within Thinking Paper Code
2507 PyVision: Agentic Vision with Dynamic Tooling Paper Code
2508 rStar2-Agent: Agentic Reasoning Technical Report Paper Code
2508 Posterior-GRPO: Rewarding Reasoning Processes in Code Generation Paper -
2509 Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use Paper Code

Agentic Visual Processing for Thinking with Image

Date Title Paper Code
2501 Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step Paper Code
2505 Visual Planning: Let's Think Only with Images Paper Code
2505 Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO Paper Code
2505 GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning Paper Code
2505 DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning Paper Code
2505 VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought Paper -
2505 Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO Paper Code
2505 OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Paper Code
2505 Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL Paper Code
2505 Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning Paper Code
2508 Simple o3: Towards Interleaved Vision-Language Reasoning Paper -
2508 Thyme: Think Beyond Images Paper Code
2509 Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search Paper Code

Agentic Enviroment Interaction

Agentic Virtual Interaction

Date Title Paper Code
2411 ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper Code
2501 UI-TARS: Pioneering Automated GUI Interaction with Native Agents Paper Code
2503 UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning Paper Code
2504 TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials Paper Code
2504 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents Paper Code
2504 InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners Paper Code
2505 WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning Paper Code
2506 GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior Paper Code
2509 InfraMind: A Novel Exploration-based GUI Agentic Framework for Mission-critical Industrial Management Paper -
2509 UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning Paper Code

Agentic Physical Interaction

Date Title Paper Code
2406 OpenVLA: An Open-Source Vision-Language-Action Model Paper Code
2505 ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models Paper -
2506 Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning Paper Code
2506 VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning Paper Code
2507 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning Paper Code
2508 Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Paper Code
2508 MolmoAct: Action Reasoning Models that can Reason in Space Paper Code
2508 EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control Paper Code
2509 Nav-R1: Reasoning and Navigation in Embodied Scenes Paper Code
2509 Wall-x: Igniting VLMs toward the Embodied Space Paper Code
2509 VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search Paper -

Agentic Training Framework

Agentic CPT/SFT

Title Code
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models Code
ms-swift: SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning) Code
Megatron-LM Code
Unsloth Code

Agentic RL

Title Code
verl: Volcano Engine Reinforcement Learning for LLMs Code
rLLM (DeepScaleR): Reinforcement Learning for Language Agents Code
RLFactory: Easy and Efficient RL Training Code
ROLL: Reinforcement Learning Optimization for Large-Scale Learning Code
RAGEN: Training Agents by Reinforcing Reasoning Code
SkyRL: A Modular Full-stack RL Library for LLMs Code
Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning Code
Multimodal-Search-R1: Incentivizing LMMs to Search Code
Visual Agentic Reinforcement Fine-Tuning Code

About

Agentic MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published