Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

👁️Overview

This repository contains the official implementation of VisFactor, a novel benchmark derived from the Factor-Referenced Cognitive Test (FRCT) that digitizes 20 vision-centric subtests from established cognitive psychology assessments. Our work systematically investigates the gap between human visual cognition and state-of-the-art Multimodal Large Language Models (MLLMs).

🎯 Key Features

Comprehensive Evaluation: 4 core domains of human visual cognition
- Visualization and Spatial Processing: Mental rotation, spatial relations
- Perceptual and Closure: Figure-ground discrimination, pattern completion
- Memory: Visual working memory, recognition tasks
- Reasoning: Abstract visual reasoning, analogical thinking
Extensive Model Coverage: 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families
Rigorous Assessment: Based on well-established psychometric tests (FRCT)

📈 Leaderboard

🚀 Quick Start

Note: This evaluation framework is built on VLMEvalKit. For detailed information beyond this guide, please refer to their repository. We extend our sincere gratitude to the VLMEvalKit team for their excellent work.

Installation

git clone https://github.com/CUHK-ARISE/VisFactor.git
cd VisFactor
pip install -r requirements.txt

Preparation

Download the VisFactor dataset
```
mkdir -p ~/LMUData
cd ~/LMUData
# Download files will be placed here
```
Place VisFactor.tsv and VisFactor_CoT.tsv in this directory.

Configure API credentials

cd VisFactor/
vim .env  # or use any text editor

Example .env configuration:

# OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_API_BASE=https://api.openai.com/v1

# Google
GOOGLE_API_KEY=your_google_key

# Other Services
STEPAI_API_KEY=your_stepai_key
REKA_API_KEY=your_reka_key
GLMV_API_KEY=your_glmv_key
SENSENOVA_API_KEY=your_sensenova_key
MOONSHOT_API_KEY=your_kimi_key
DOUBAO_VL_KEY=your_doubao_key

# Hunyuan-Vision
HUNYUAN_SECRET_KEY=your_hunyuan_key
HUNYUAN_SECRET_ID=your_hunyuan_id

# Deployment Services
CW_API_BASE=your_congwang_base
CW_API_KEY=your_congwang_key
LMDEPLOY_API_BASE=your_lmdeploy_base

Evaluation

Standard Evaluation

python3 run.py --data VisFactor --model GeminiPro2-5 --verbose

Chain-of-Thought (CoT) Evaluation

python3 run.py --data VisFactor_CoT --model GeminiPro2-5 --verbose

Command Arguments

Argument	Type	Default	Description
`--model`	list[str]	required	VLM names supported in VLMEvalKit (see `supported_VLM` in `vlmeval/config.py`)
`--mode`	str	'all'	Evaluation mode: 'all' (inference + evaluation) or 'infer' (inference only)
`--api-nproc`	int	4	Number of threads for API requests
`--work-dir`	str	'.'	Directory to save evaluation results
`--reuse`	flag	False	Use previously generated results if available

⚙️Generate testcases

We also provide a script to automatically generate some test cases, including CF1-3, CS1-3, MA1, S1-2, SS3, VZ1-2.

First, prepare some images:

mkdir visfactor/Collected_Figures

Place your images in this folder, then run the script to generate new questions:

cd visfactor
python3 generate_images.py

📄 Citation

If you find VisFactor useful in your research, please cite our paper:

@article{huang202human,
  title={Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs},
  author={Huang, Jen-Tse and Dai, Dasen and Huang, Jen-Yuan and Yuan, Youliang and Liu, Xiaoyuan and Wang, Wenxuan and Jiao, Wenxiang and He, Pinjia and Tu, Zhaopeng and Duan, Haodong},
  journal={arXiv preprint arXiv:2502.16435},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,519 Commits
.github		.github
assets		assets
docs		docs
requirements		requirements
scripts		scripts
visfactor		visfactor
vlmeval		vlmeval
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Leaderboard.jpg		Leaderboard.jpg
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

👁️Overview

🎯 Key Features

📈 Leaderboard

🚀 Quick Start

Installation

Preparation

Evaluation

Standard Evaluation

Chain-of-Thought (CoT) Evaluation

Command Arguments

⚙️Generate testcases

📄 Citation

About

Uh oh!

Releases

Packages

Languages

License

CUHK-ARISE/VisFactor

Folders and files

Latest commit

History

Repository files navigation

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

👁️Overview

🎯 Key Features

📈 Leaderboard

🚀 Quick Start

Installation

Preparation

Evaluation

Standard Evaluation

Chain-of-Thought (CoT) Evaluation

Command Arguments

⚙️Generate testcases

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages