🎮 Play to Generalize:
Learning to Reason Through Game Play

Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille,
Junfei Xiao†, Chen Wei§

§Corresponding Author, †Project Lead

🎯 Overview

We propose a novel post-training paradigm, Visual Game Learning (ViGaL), where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games like Snake and Rotation puzzle significantly enhances its downstream performance on multimodal reasoning benchmarks such as MathVista, MathVerse, and MathVision, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, the resulting model surpasses large-scale proprietary models and models tuned directly on visual math datasets. Ablation studies indicate that distinct games unlock complementary reasoning skills, leading to improved generalization when combined. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that effectively unlock generalizable multimodal reasoning abilities in MLLMs.

🗞️ News

📋 Contents

📦 Installation

git clone https://github.com/yunfeixie233/ViGaL.git
cd ViGaL
pip install -e .[vllm]
pip install flash_attn --no-build-isolation

🤖 ViGaL Weights

Please see ViGaL Weights.

📂 Data Preparation

You can download our training data from ViGaL training data (Coming Soon) (link will be available soon)

🌐 Train

For Snake game:
```
sh examples/scripts/train_snake.sh
```
For Rotation game:
```
sh examples/scripts/train_rotation.sh
```

For Snake and Rotation games:

sh examples/scripts/train_snake_rotation.sh

🔭 Evaluation

For MathVista, MathVision, and MathVerse: We use the evaluation code in the eval/ directory.
For CLEVR+ and Geometry: Please implement the evaluation following Reason-RFT.
For MMMU validation set evaluation: Please implement the evaluation following Qwen2.5-VL.
For other general visual evaluation: Please implement the evaluation following VLMEvalKit.

🎮 Results

Improvement on Unseen Reasoning Tasks

We evaluate ViGaL trained on games on out‑of‑domain tasks that demand reasoning spanning mathematics, 3D understanding in CLEVR+, geometric problem solving, and multi‑discipline on MMMU series. Here are our findings:

Zero‑shot generalization from gameplay to math reasoning and beyond. ViGaL outperforms models specifically fine‑tuned with RL on mathematical, spatial, and multi‑discipline reasoning tasks, showing remarkable generalization capabilities despite having no exposure to in‑domain training data during RL post‑training.
Blending both games leads to better generalization. Visual Game Learning shows promise as a new training paradigm that can enhance generalizable reasoning performance without requiring extensive collections of domain‑specific training data. Simply expanding the diversity of games during training leads to consistent performance scaling across various visual‑reasoning problems.
Preserving general visual capabilities while reasoning enhancement. Experiments on more general and comprehensive multimodal benchmarks show that our gameplay‑based approach enables math generalization without compromising other visual abilities.

Model	Avg.	Math				Geometry			CLEVR+			Multi‑Discipline
Model	Avg.	Avg.	MathVista	MathVerse	MathVision	Avg.	GeoMath	Geo3K	Avg.	CLEVR‑M	S‑CLEVR	Avg.	MMMU_val	MMMU‑Pro_overall
Proprietary Model
GPT‑4o	51.7	48.1	61.4	50.2	30.4	46.8	50.2	43.5	51.2	68.1	34.3	60.5	69.1	51.9
Gemini‑2.0‑Flash	-	56.4	73.4	54.6	41.3	54.4	55.3	53.5	46.3	64.9	27.6	-	71.9	-
General Multimodal Language Model
InternVL2.5‑8B	51.5	41.2	64.4	39.5	19.7	55.2	63.0	47.3	64.4	93.5	35.3	45.2	56.0	34.3
Llava‑OV‑7B	-	-	63.2	26.2	-	60.7	77.6	43.7	49.4	69.7	29.1	36.5	48.8	24.1
Qwen2.5‑VL‑7B	48.3	47.7	68.0	49.0	26.0	44.8	44.0	45.6	54.9	74.6	35.2	45.7	54.3	37.0
Multimodal Reasoning Model Post‑Trained on Qwen2.5‑VL‑7B
R1‑Onevision‑7B	47.3	46.8	64.1	46.4	29.9	35.0	45.4	24.5	65.1	75.5	54.7	42.3	51.9	32.6
R1‑VL‑7B	47.3	42.7	63.5	40.0	24.7	39.0	42.0	36.1	68.0	87.4	48.6	39.7	50.0	29.4
MM‑Eureka‑Qwen‑7B	51.1	50.1	73.0	50.3	26.9	28.4	53.1	3.8	79.3	98.4	60.1	46.4	55.8	36.9
Reason‑RFT‑Zero‑7B	52.5	38.1	60.7	35.3	18.3	54.9	55.0	54.8	76.2	99.4	53.0	40.9	51.2	30.6
VLAA‑Thinker‑7B	56.5	48.7	68.0	51.7	26.4	53.9	51.1	56.6	83.4	94.7	72.1	40.1	48.2	31.9
OpenVLThinker‑7B	56.3	47.8	70.2	47.9	25.3	56.4	49.2	63.5	82.4	93.8	71.0	38.5	54.8	22.1
ViGaL Snake	58.3	49.4	70.7	51.1	26.5	55.0	49.9	60.0	82.6	92.6	72.6	46.2	55.8	36.6
ViGaL Rotation	58.4	49.3	71.2	50.4	26.3	57.9	51.7	64.1	80.7	93.0	68.3	45.9	54.1	37.7
ViGaL Snake + Rotation	59.3	50.6	71.9	52.4	27.5	57.1	51.0	63.3	81.7	91.9	71.4	47.7	58.0	37.4

Main results on multimodal reasoning benchmarks. We primarily compare with multimodal reasoning models post‑trained on math data based on Qwen2.5‑VL‑7B. CLEVR‑M denotes CLEVR‑Math, and S‑CLEVR stands for Super‑CLEVR. Results from reasoning models post‑trained with corresponding in‑domain data are de‑emphasized, while our ViGaL models remain exclusively post‑trained using visual games. Best scores of post‑trained models in each "Avg." column are highlighted in bold.

Model	Avg.	General			Vision‑Centric						OCR & Chart
Model	Avg.	Avg.	Muir‑Bench	CRPE_rel.	Avg.	MMVP	Real‑WorldQA	MMStar	BLINK_val	MME_p	Avg.	AI2D_{w. M.}	SEED‑Bench‑2+	DocVQA_val	OCR‑Bench
Proprietary Model
GPT‑4o	74.8	72.3	68.0	76.6	69.4	-	75.4	64.7	68.0	1614	82.6	84.6	72.0	91.1	736
General Multimodal Language Model
Qwen2.5‑VL‑7B	72.4	68.0	59.6	76.4	65.8	74.3	68.5	63.9	56.4	1698	83.3	83.9	70.4	95.7	864
Multimodal Reasoning Model Post‑Trained on Qwen2.5‑VL‑7B
R1‑Onevision‑7B	-	66.8	46.3	87.3	56.5	61.3	58.0	57.8	48.7	1504	-	-	-	-	-
R1‑VL‑7B	67.4	63.3	54.1	72.4	59.6	70.3	61.4	55.6	51.0	1657	79.2	81.7	66.4	89.4	81.0
MM‑Eureka‑Qwen‑7B	71.8	68.9	61.1	76.7	65.1	74.3	66.1	65.9	54.0	1626	81.5	84.3	68.2	92.0	87.0
Reason‑RFT‑Zero‑7B	68.4	66.9	58.5	75.2	58.5	58.0	65.3	59.1	51.6	1653	79.8	83.3	68.0	88.1	82.0
VLAA‑Thinker‑7B	69.7	65.9	57.1	74.6	62.6	71.6	65.4	60.4	53.0	1593	80.6	83.4	67.4	90.9	84.5
OpenVLThinker‑7B	-	64.3	52.8	75.8	50.4	32.3	60.2	59.1	49.9	1513	-	-	-	-	-
ViGaL Snake + Rotation	72.2	68.6	60.5	76.7	65.7	74.6	67.3	65.4	55.6	1685	82.2	84.8	69.1	92.7	86.6

Main results on multimodal language benchmarks targeting more general and comprehensive visual ability. We compare with models post‑trained on Qwen2.5‑VL‑7B. Best category averages are highlighted in bold. Note that MME_p is excluded from vision‑centric category average accuracy due to scale differences.

📜 Citation

If you find ViGaL useful for your research and applications, please cite using this BibTeX:

@article{xie2025play,
  title     = {Play to Generalize: Learning to Reason Through Game Play},
  author    = {Xie, Yunfei and Ma, Yinsong and Lan, Shiyi and Yuille, Alan and Xiao, Junfei and Wei, Chen},
  journal   = {arXiv preprint arXiv:2506.08011},
  year      = {2025},
}

🎓 Acknowledgement

MM-EUREKA: We start from the codebase from MM-EUREKA

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
eval		eval
examples		examples
fig		fig
openrlhf		openrlhf
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎮 Play to Generalize:
Learning to Reason Through Game Play

🎯 Overview

🗞️ News

📋 Contents

📦 Installation

🤖 ViGaL Weights

📂 Data Preparation

🌐 Train

🔭 Evaluation

🎮 Results

Improvement on Unseen Reasoning Tasks

📜 Citation

🎓 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

yunfeixie233/ViGaL

Folders and files

Latest commit

History

Repository files navigation

🎮 Play to Generalize: Learning to Reason Through Game Play

🎯 Overview

🗞️ News

📋 Contents

📦 Installation

🤖 ViGaL Weights

📂 Data Preparation

🌐 Train

🔭 Evaluation

🎮 Results

Improvement on Unseen Reasoning Tasks

📜 Citation

🎓 Acknowledgement

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

🎮 Play to Generalize:
Learning to Reason Through Game Play

Packages