Whether at home, in the workplace, or elsewhere, the ability to perceive a space, remember its layout, and retrieve this spatial information to answer questions on demand is a key aspect of visual-spatial intelligence. Recent Multimodal LLMs can understand general videos, but can they "think spatially" when presented with a video recording of an environment? Can they build an accurate, implicit "cognitive map" that allows them to answer questions about a space? What are the strengths and limitations of using MLLMs to enhance spatial intelligence? We dig into these questions by setting up video data for MLLMs to watch, building a VQA benchmark to check their recall, and examining what the MLLMs actually remember and understand.
- 2025-08-05- ♥️ We release the meta information used in our VSIBench! Check out at- /data/meta_info!
- 2025-02-27- ♥️ Our paper- Thinking-in-Spaceis accepted by CVPR 2025 as Oral! See u in Nashville!
- 2024-12-19🚀 We released our VSI-Bench and corresponding evaluation code.
Overview: We introduce VSI-Bench, a benchmark designed to evaluate the visual-spatial intelligence of Multimodal LLMs (MLLMs). VSI-Bench comprises over 5,000 question-answer pairs derived from 288 egocentric videos sourced from the validation sets of public indoor 3D scene reconstruction datasets ScanNet, ScanNet++, and ARKitScenes.
VSI-Bench includes eight tasks categorized into three types: configurational, measurement estimation, and spatiotemporal. Iteratively refined for quality, VSI-Bench provides a foundational resource for studying the connection between MLLMs and 3D reconstruction.
Evaluation Setups: We benchmarked 15 video-supporting MLLMs from diverse model families. For proprietary models, we include Gemini-1.5 and GPT-4o. For open-source models, we evaluate models from InternVL2, ViLA, LongViLA, LongVA, LLaVA-OneVision, and LLaVA-NeXT-Video. All evaluations are conducted in zero-shot settings with default prompts and greedy decoding for reproducibility. Tasks are evaluated using either accuracy for Multiple-Choice Answer (MCA) tasks or our proposed Mean Relative Accuracy (MRA) for Numerical Answer (NA) tasks.
Our benchmark is hosted on HuggingFace. You can simply access the benchmark data using the following code.
# NOTE: pip install datasets
from datasets import load_dataset
vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
print(dataset)conda create --name vsibench python=3.10
conda activate vsibench
git clone [email protected]:vision-x-nyu/thinking-in-space.git
cd thinking-in-space
git submodule update --init --recursive
cd transformers && pip install -e . && cd ..
pip install -e .
pip install s2wrapper@git+https://github.com/bfshi/scaling_on_scales
pip install deepspeedWe provide a all-in-one evaluation scripts. You can simply run the following code to start your evaluation.
bash evaluate_all_in_one.sh --model all --num_processes 8 --benchmark vsibenchNote: The evaluation results for open-source models may differ slightly from our tables due to additional data refinement. We will update the tables and our paper soon.
We strive to maintain the highest quality in our benchmark, but some imperfections may persist. If you notice any, we encourage you to reach out and share your valuable feedback!
Our evaluation code is build upon lmms-eval. We acknowledge their team for providing this excellent toolkit for evaluating multimodal large language models.
If you find our paper and code useful in your research, please consider giving us a star ⭐ and citing our work 📝 :)
@article{yang2024think,
    title={{Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces}},
    author={Yang, Jihan and Yang, Shusheng and Gupta, Anjali and Han, Rilyn and Fei-Fei, Li and Xie, Saining},
    year={2024},
    journal={arXiv preprint arXiv:2412.14171},
}