Thanks to visit codestin.com
Credit goes to github.com

Skip to content

yunfeixie233/ViGaL

Repository files navigation

🎮 Play to Generalize:
Learning to Reason Through Game Play

ViGaL

arXiv Website
HF Model: ViGaL HF Dataset: Snake & Rotation

§Corresponding Author, †Project Lead


🎯 Overview

We propose a novel post-training paradigm, Visual Game Learning (ViGaL), where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games like Snake and Rotation puzzle significantly enhances its downstream performance on multimodal reasoning benchmarks such as MathVista, MathVerse, and MathVision, without seeing any worked solutions, equations, or diagrams during RL. Remarkably, the resulting model surpasses large-scale proprietary models and models tuned directly on visual math datasets. Ablation studies indicate that distinct games unlock complementary reasoning skills, leading to improved generalization when combined. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that effectively unlock generalizable multimodal reasoning abilities in MLLMs.

🗞️ News

📋 Contents

📦 Installation

git clone https://github.com/yunfeixie233/ViGaL.git
cd ViGaL
pip install -e .[vllm]
pip install flash_attn --no-build-isolation

🤖 ViGaL Weights

Please see ViGaL Weights.

📂 Data Preparation

You can download our training data from ViGaL training data (Coming Soon) (link will be available soon)

🌐 Train

  • For Snake game:

    sh examples/scripts/train_snake.sh
  • For Rotation game:

    sh examples/scripts/train_rotation.sh
  • For Snake and Rotation games:

    sh examples/scripts/train_snake_rotation.sh

🔭 Evaluation

  • For MathVista, MathVision, and MathVerse: We use the evaluation code in the eval/ directory.

  • For CLEVR+ and Geometry: Please implement the evaluation following Reason-RFT.

  • For MMMU validation set evaluation: Please implement the evaluation following Qwen2.5-VL.

  • For other general visual evaluation: Please implement the evaluation following VLMEvalKit.

🎮 Results

Improvement on Unseen Reasoning Tasks

We evaluate ViGaL trained on games on out‑of‑domain tasks that demand reasoning spanning mathematics, 3D understanding in CLEVR+, geometric problem solving, and multi‑discipline on MMMU series. Here are our findings:

  • Zero‑shot generalization from gameplay to math reasoning and beyond. ViGaL outperforms models specifically fine‑tuned with RL on mathematical, spatial, and multi‑discipline reasoning tasks, showing remarkable generalization capabilities despite having no exposure to in‑domain training data during RL post‑training.
  • Blending both games leads to better generalization. Visual Game Learning shows promise as a new training paradigm that can enhance generalizable reasoning performance without requiring extensive collections of domain‑specific training data. Simply expanding the diversity of games during training leads to consistent performance scaling across various visual‑reasoning problems.
  • Preserving general visual capabilities while reasoning enhancement. Experiments on more general and comprehensive multimodal benchmarks show that our gameplay‑based approach enables math generalization without compromising other visual abilities.
Model Avg. Math Geometry CLEVR+ Multi‑Discipline
Avg. MathVista MathVerse MathVision Avg. GeoMath Geo3K Avg. CLEVR‑M S‑CLEVR Avg. MMMUval MMMU‑Prooverall
Proprietary Model
GPT‑4o51.748.161.450.230.446.850.243.551.268.134.360.569.151.9
Gemini‑2.0‑Flash-56.473.454.641.354.455.353.546.364.927.6-71.9-
General Multimodal Language Model
InternVL2.5‑8B51.541.264.439.519.755.263.047.364.493.535.345.256.034.3
Llava‑OV‑7B--63.226.2-60.777.643.749.469.729.136.548.824.1
Qwen2.5‑VL‑7B48.347.768.049.026.044.844.045.654.974.635.245.754.337.0
Multimodal Reasoning Model Post‑Trained on Qwen2.5‑VL‑7B
R1‑Onevision‑7B47.346.864.146.429.935.045.424.565.175.554.742.351.932.6
R1‑VL‑7B47.342.763.540.024.739.042.036.168.087.448.639.750.029.4
MM‑Eureka‑Qwen‑7B51.150.173.050.326.928.453.13.879.398.460.146.455.836.9
Reason‑RFT‑Zero‑7B52.538.160.735.318.354.955.054.876.299.453.040.951.230.6
VLAA‑Thinker‑7B56.548.768.051.726.453.951.156.683.494.772.140.148.231.9
OpenVLThinker‑7B56.347.870.247.925.356.449.263.582.493.871.038.554.822.1
ViGaL Snake58.349.470.751.126.555.049.960.082.692.672.646.255.836.6
ViGaL Rotation58.449.371.250.426.357.951.764.180.793.068.345.954.137.7
ViGaL Snake + Rotation59.350.671.952.427.557.151.063.381.791.971.447.758.037.4

Main results on multimodal reasoning benchmarks. We primarily compare with multimodal reasoning models post‑trained on math data based on Qwen2.5‑VL‑7B. CLEVR‑M denotes CLEVR‑Math, and S‑CLEVR stands for Super‑CLEVR. Results from reasoning models post‑trained with corresponding in‑domain data are de‑emphasized, while our ViGaL models remain exclusively post‑trained using visual games. Best scores of post‑trained models in each "Avg." column are highlighted in bold.

Model Avg. General Vision‑Centric OCR & Chart
Avg. Muir‑Bench CRPErel. Avg. MMVP Real‑WorldQA MMStar BLINKval MMEp Avg. AI2Dw. M. SEED‑Bench‑2+ DocVQAval OCR‑Bench
Proprietary Model
GPT‑4o74.872.368.076.669.4-75.464.768.0161482.684.672.091.1736
General Multimodal Language Model
Qwen2.5‑VL‑7B72.468.059.676.465.874.368.563.956.4169883.383.970.495.7864
Multimodal Reasoning Model Post‑Trained on Qwen2.5‑VL‑7B
R1‑Onevision‑7B-66.846.387.356.561.358.057.848.71504-----
R1‑VL‑7B67.463.354.172.459.670.361.455.651.0165779.281.766.489.481.0
MM‑Eureka‑Qwen‑7B71.868.961.176.765.174.366.165.954.0162681.584.368.292.087.0
Reason‑RFT‑Zero‑7B68.466.958.575.258.558.065.359.151.6165379.883.368.088.182.0
VLAA‑Thinker‑7B69.765.957.174.662.671.665.460.453.0159380.683.467.490.984.5
OpenVLThinker‑7B-64.352.875.850.432.360.259.149.91513-----
ViGaL Snake + Rotation72.268.660.576.765.774.667.365.455.6168582.284.869.192.786.6

Main results on multimodal language benchmarks targeting more general and comprehensive visual ability. We compare with models post‑trained on Qwen2.5‑VL‑7B. Best category averages are highlighted in bold. Note that MMEp is excluded from vision‑centric category average accuracy due to scale differences.

📜 Citation

If you find ViGaL useful for your research and applications, please cite using this BibTeX:

@article{xie2025play,
  title     = {Play to Generalize: Learning to Reason Through Game Play},
  author    = {Xie, Yunfei and Ma, Yinsong and Lan, Shiyi and Yuille, Alan and Xiao, Junfei and Wei, Chen},
  journal   = {arXiv preprint arXiv:2506.08011},
  year      = {2025},
}

🎓 Acknowledgement

  • MM-EUREKA: We start from the codebase from MM-EUREKA

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages