ZhiYuan Feng¹*, Zhaolu Kang²*, Qijie Wang¹*, Zhiying Du³*, Jiongrui Yan⁴, Shi Shubin⁴, Chengbo Yuan¹, Huizhi Liang¹, Yu Deng⁵, Qixiu Li¹, Rushuai Yang⁶, Ruichuan An², Leqi Zheng¹, Weijie Wang⁷, Shawn Chen⁷, Sicheng Xu⁵, Yaobo Liang⁵, Jiaolong Yang⁵†, Baining Guo⁵
¹Tsinghua University, ²Peking University, ³Fudan University, ⁴Jilin University, ⁵Microsoft Research Asia, ⁶Hong Kong University of Science and Technology, ⁷Zhejiang University
(*Equal Contribution, †Corresponding Author)
- [2025.10] 📢📢 Paper and initial project release.
- Release Evaluation Code
- Release the benchmark dataset on HuggingFace
Benchmark Overview: We introduce MV-RoboBench, a benchmark designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic scenes. It contains [Number] question-answer pairs across [Number] diverse robotic scenes. The benchmark comprises [Number] challenging tasks, such as [Task 1 Name], [Task 2 Name], and [Task 3 Name]. These tasks are designed to probe various aspects of 3D scene understanding, from establishing object correspondences to understanding relative spatial poses.
📌 A Benchmark for Robotic Scenes: We introduce MV-RoboBench, a comprehensive benchmark designed to evaluate the spatial reasoning of Vision-Language Models in robotic scenes.
📊 Comprehensive Evaluation: We evaluate [Number] state-of-the-art VLMs, including models like GPT-4o and Claude 3, revealing a significant performance gap compared to human-level reasoning.
🔍 Revealing Core Challenges: Our analysis pinpoints key failure modes for current models in robotic scene understanding, particularly in cross-view correspondence, relative pose estimation, and action planning.
For any questions or suggestions, please feel free to contact Zhiyuan Feng or another author.

