🤖Awesome-Agentic-MLLMs

👏 Welcome to the Awesome-Agentic-MLLMs repository! This curated collection features influential papers, codebases, datasets, benchmarks, and resources dedicated to exploring the emerging field of agentic capabilities in Multimodal Large Language Models.

⭐ Feel free to star and fork this repository to stay updated with the latest advancements and contribute to the growing community.

We greatly appreciate and welcome everyone to submit an issue for any related work we may have missed, and we’ll review and address it in the next release!

🔔 News

Oct 14, 2025. We’re excited to introduce our survey paper on agentic MLLMs. Check it out on arXiv!
Oct 12, 2025. This repository curates and maintains an updated list of papers on Awesome-Agentic-MLLM. Contributions and suggestions are warmly welcome!

🔗 Citation

If you find this survey helpful, please cite our work:

@article{yao2025survey,
  title={A Survey on Agentic Multimodal Large Language Models},
  author={Yao, Huanjin and Zhang, Ruifei and Huang, Jiaxing and Zhang, Jingyi and Wang, Yibo and Fang, Bo and Zhu, Ruolin and Jing, Yongcheng and Liu, Shunyu and Li, Guanbin and others},
  journal={arXiv preprint arXiv:2510.10991},
  year={2025}
}

🌍 Overview

We collect recent advances in Agentic MLLMs and categorize them into three core dimensions: (1) Agentic Internal Intelligence, which leverages reasoning, reflection, and memory to enable accurate long-horizon planning; (2) Agentic External Tool Invocation, whereby models proactively use various external tools to extend their problem-solving capabilities beyond their intrinsic knowledge; and (3) Agentic environment interaction, which situates models within virtual or physical environments, allowing them to perceive changes and incorporate feedback from the real world.

📒 Table of Contents

📄 Paper List

Foundational MLLMs

Dense MLLMs

Date	Title	Paper	Code
2502	Qwen2.5-VL Technical Report
2502	SmolVLM2: Bringing Video Understanding to Every Device
2506	MiMo-VL Technical Report
2507	Kwai Keye-VL Technical Report
2509	SAIL-VL2 Technical Report
2509	LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
2509	MiniCPM-V 4.5 technical report

MoE MLLMs

Date	Title	Code
2509	Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action
2409	MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning	-
2412	DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
2503	Kimi-VL Technical Report
2506	ERNIE 4.5 Technical Report
2507	Seed1.5-VL Technical Report
2507	GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
2507	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding
2508	InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Agentic Internal Intelligence

Agentic Reasoning

Date	Title	Github
2410	Improve Vision Language Model Chain-of-thought Reasoning
2411	LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
2412	Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search
2503	Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
2503	R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
2503	MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
2503	Video-R1: Reinforcing Video Reasoning in MLLMs
2504	SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
2504	NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
2504	Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
2504	VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
2505	SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
2505	R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
2505	EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
2505	Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
2506	GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
2506	WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
2506	APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
2507	Scaling RL to Long Videos
2507	VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning
2507	C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
2507	Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning	-
2508	StructVRM: Aligning Multimodal Reasoning with Structured and Verifiable Reward Models	-
2509	MAPO: Mixed Advantage Policy Optimization	-
2509	MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
2509	VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
2509	Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Agentic Reflection

Date	Title	Code
2410	ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents
2411	Self-Corrected Multimodal Large Language Model for Robot Manipulation and Reflection	-
2411	Vision-Language Models Can Self-Improve Reasoning via Reflection
2412	Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search
2503	V-Stylist: Video Stylization via Collaboration and Reflection of MLLM Agents	-
2504	MASR: Self-Reflective Reasoning through Multimodal Hierarchical Attention Focusing for Agent-based Video Understanding	-
2504	VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
2505	Training-Free Reasoning and Reflection in MLLMs
2506	SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
2507	Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
2509	Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
2510	SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Agentic Memory

Date	Title	Code
2305	MemoryBank: Enhancing Large Language Models with Long-Term Memory
2307	MovieChat
2312	Empowering Working Memory for Large Language Model Agents	-
2402	LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
2502	A-Mem: Agentic Memory for LLM Agents
2503	In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents	-
2504	Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
2506	A Walk to Remember: Mllm Memory-Driven Visual Navigation	-
2506	MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
2507	MemOS: A Memory OS for AI System	-
2507	MIRIX: Multi-Agent Memory System for LLM-Based Agents
2508	Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning	-
2508	Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory	-
2508	MMS: Multiple Memory Systems for Enhancing the Long-term Memory of Agent	-

Agentic External Tool Invocation

Agentic Search for Information Retrieval

Date	Title	Code
2502	Open AI Deep Research: Introducing deep research
2505	VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
2505	Visual Agentic Reinforcement Fine-Tuning
2506	MMSearch-R1: Incentivizing LMMs to Search
2508	Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning
2508	M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation	-
2508	WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
2510	DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search	-

Agentic Coding for Complex Computations

Date	Title	Code
2501	rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
2504	ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
2505	R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
2506	CoRT: Code-integrated Reasoning within Thinking
2507	PyVision: Agentic Vision with Dynamic Tooling
2508	rStar2-Agent: Agentic Reasoning Technical Report
2508	Posterior-GRPO: Rewarding Reasoning Processes in Code Generation	-
2509	Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use

Agentic Visual Processing for Thinking with Image

Date	Title	Code
2501	Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
2505	Visual Planning: Let's Think Only with Images
2505	Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
2505	GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
2505	DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
2505	VLM-R3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought	-
2505	Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
2505	OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning
2505	Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL
2505	Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
2508	Simple o3: Towards Interleaved Vision-Language Reasoning	-
2508	Thyme: Think Beyond Images
2509	Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Agentic Enviroment Interaction

Agentic Virtual Interaction

Date	Title	Code
2411	ShowUI: One Vision-Language-Action Model for GUI Visual Agent
2501	UI-TARS: Pioneering Automated GUI Interaction with Native Agents
2503	UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
2504	TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
2504	GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
2504	InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
2505	WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
2506	GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
2509	InfraMind: A Novel Exploration-based GUI Agentic Framework for Mission-critical Industrial Management	-
2509	UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Agentic Physical Interaction

Date	Title	Code
2406	OpenVLA: An Open-Source Vision-Language-Action Model
2505	ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models	-
2506	Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning
2506	VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
2507	ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
2508	Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
2508	MolmoAct: Action Reasoning Models that can Reason in Space
2508	EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control
2509	Nav-R1: Reasoning and Navigation in Embodied Scenes
2509	Wall-x: Igniting VLMs toward the Embodied Space
2509	VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search	-

Agentic Training Framework

Agentic CPT/SFT

Title	Code
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
ms-swift: SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning)
Megatron-LM
Unsloth

Agentic RL

Title	Code
verl: Volcano Engine Reinforcement Learning for LLMs
rLLM (DeepScaleR): Reinforcement Learning for Language Agents
RLFactory: Easy and Efficient RL Training
ROLL: Reinforcement Learning Optimization for Large-Scale Learning
RAGEN: Training Agents by Reinforcing Reasoning
SkyRL: A Modular Full-stack RL Library for LLMs
Search-R1: Train your LLMs to reason and call a search engine with reinforcement learning
Multimodal-Search-R1: Incentivizing LMMs to Search
Visual Agentic Reinforcement Fine-Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖Awesome-Agentic-MLLMs

🔔 News

🔗 Citation

🌍 Overview

📒 Table of Contents