This repository contains the official implementation of VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT) that digitizes 20 vision-centric subtests from established cognitive psychology assessments. Our work systematically investigates the gap between human visual cognition and state-of-the-art Multimodal Large Language Models (MLLMs).
- Comprehensive Evaluation: 4 core domains of human visual cognition
- Visualization and Spatial Processing: Mental rotation, spatial relations
- Perceptual and Closure: Figure-ground discrimination, pattern completion
- Memory: Visual working memory, recognition tasks
- Reasoning: Abstract visual reasoning, analogical thinking
- Extensive Model Coverage: 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families
- Rigorous Assessment: Based on well-established psychometric tests (FRCT)
Note: This evaluation framework is built on VLMEvalKit. For detailed information beyond this guide, please refer to their repository. We extend our sincere gratitude to the VLMEvalKit team for their excellent work.
git clone https://github.com/CUHK-ARISE/VisFactor.git
cd VisFactor
pip install -r requirements.txt-
Download the VisFactor dataset
mkdir -p ~/LMUData cd ~/LMUData # Download files will be placed here
Place
VisFactor.tsvandVisFactor_CoT.tsvin this directory. -
Configure API credentials
cd VisFactor/ vim .env # or use any text editor
Example
.envconfiguration:# OpenAI OPENAI_API_KEY=your_openai_key OPENAI_API_BASE=https://api.openai.com/v1 # Google GOOGLE_API_KEY=your_google_key # Other Services STEPAI_API_KEY=your_stepai_key REKA_API_KEY=your_reka_key GLMV_API_KEY=your_glmv_key SENSENOVA_API_KEY=your_sensenova_key MOONSHOT_API_KEY=your_kimi_key DOUBAO_VL_KEY=your_doubao_key # Hunyuan-Vision HUNYUAN_SECRET_KEY=your_hunyuan_key HUNYUAN_SECRET_ID=your_hunyuan_id # Deployment Services CW_API_BASE=your_congwang_base CW_API_KEY=your_congwang_key LMDEPLOY_API_BASE=your_lmdeploy_base
python3 run.py --data VisFactor --model GeminiPro2-5 --verbosepython3 run.py --data VisFactor_CoT --model GeminiPro2-5 --verbose| Argument | Type | Default | Description |
|---|---|---|---|
--model |
list[str] | required | VLM names supported in VLMEvalKit (see supported_VLM in vlmeval/config.py) |
--mode |
str | 'all' | Evaluation mode: 'all' (inference + evaluation) or 'infer' (inference only) |
--api-nproc |
int | 4 | Number of threads for API requests |
--work-dir |
str | '.' | Directory to save evaluation results |
--reuse |
flag | False | Use previously generated results if available |
We also provide a script to automatically generate some test cases, including CF1-3, CS1-3, MA1, S1-2, SS3, VZ1-2.
First, prepare some images:
mkdir visfactor/Collected_FiguresPlace your images in this folder, then run the script to generate new questions:
cd visfactor
python3 generate_images.pyIf you find VisFactor useful in your research, please cite our paper:
@article{huang202human,
title={Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs},
author={Huang, Jen-Tse and Dai, Dasen and Huang, Jen-Yuan and Yuan, Youliang and Liu, Xiaoyuan and Wang, Wenxuan and Jiao, Wenxiang and He, Pinjia and Tu, Zhaopeng and Duan, Haodong},
journal={arXiv preprint arXiv:2502.16435},
year={2025}
}