π This is the official repository which contains the training and inference code for ThinkMorph.
- [2025.12.22] The evaluation code for ThinkMorph is now accessible at VLMEvalKit_Thinkmorph.
- [2025.10.29] Our model checkpoint and training data are now accessible at Huggingface.
- [2025.10.29] Our paper is now accessible at arxiv.
We present ThinkMorph, a unified model fine-tuned on βΌ24K high-quality interleaved reasoning traces across tasks, learning to generate progressive textβimage reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.
Beyond strong vision-benchmark performance and robust out-of-domain generalization, ThinkMorph demonstrates emergent multimodal intelligence, including novel visual manipulation skills and so on. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
1οΈβ£ Set up environment
git clone https://github.com/ThinkMorph/ThinkMorph.git
cd ThinkMorph
conda create -n thinkmorph python=3.10 -y
conda activate thinkmorph
pip install -r requirements.txt2οΈβ£ Download checkpoint
from huggingface_hub import snapshot_download
save_dir = "models/ThinkMorph-7B"
repo_id = "ThinkMorph/ThinkMorph-7B"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
3οΈβ£ Use inference.ipynb to play with ThinkMorph!
We opensource our training data mentioned in our paper containing four tasks: Jigsaw Assembly, Spatial Navigation, Visual Search , and Chart Refocus. Here we show typical examples of four tasks. Training data can be downloaded from Huggingface.
-
Download the training dataset
from datasets import load_dataset # Jigsaw Assembly dataset = load_dataset("ThinkMorph/Jigsaw_Assembly", split="train") # Spatial Navigation dataset = load_dataset("ThinkMorph/Spatial_Navigation", split="train") # Visual Search dataset = load_dataset("ThinkMorph/Visual_Search", split="train") # Chart Refocus dataset = load_dataset("ThinkMorph/Chart_Refocus", split="train")
-
Convert the downloaded dataset into a data format suitable for model training. For details on the Bagel officially supported data formats, see in Train. Based on Bagel's implementation, we modify the training code to support our interleaved data format, and an easy-to-understand example of a parquet file is shown below:
{
"image_list": [problem_image_0, reasoning_image_0],
"instruction_list": [question],
"output_text_list": [f"<think>{resoning_thought_0}</think><image_start>",f"<image_end><think>{resoning_thought_1}</think><answer>{answer}</answer>"],
}-
Edit
data/dataset_info.pywith your own data path. -
Edit
configs/example.yaml. Additionally, we provide example configuration files corresponding to the different training settings indata/configs.
torchrun \
--nnodes=$num_nodes \
--node_rank=$node_rank \
--nproc_per_node=8 \
--master_addr=$master_addr \
--master_port=$master_port \
train/pretrain_unified_navit.py \
--dataset_config_file ./data/configs/interleaved_reasoning.yaml \
--model_path $model_path \
--layer_module Qwen2MoTDecoderLayer \
--finetune_from_hf True \
--auto_resume True \
--finetune-from-ema True \
--resume-from $model_path \
--results_dir $output_path \
--checkpoint_dir $ckpt_path \
--lr 1e-5 \
--num_worker 4 \
--max_latent_size 64 \
--max_num_tokens 32768 \
--mse_weight 1 \
--ce_weight 1 \
--total_steps 8000 \
You can replace the variables in the script with your own before running. More training scripts are provided in ./script.
See Bagel's TRAIN for more details.
Our evaluation code is open-sourced in VLMEvalKit_Thinkmorph. This repository provides evaluation support for the ThinkMorph model based on VLMEvalKit. And this repo also supports all the benchmarks evaluated in our paper, including: VSP, VisPuzzle, ChartQA, VStar, BLINK-J, MMVP, SAT, BLINK, and CV-Bench.
| Model | Size | VSP | VisPuzzle | ChartQA | VStar | BLINK-J | MMVP | SAT | BLINK | CV-Bench | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | β | 33.50 | 43.75 | 76.34 | 61.78 | 72.67 | 84.67 | 28.00 | 60.28 | 75.61 | |
| GPT-5 | β | 57.33 | 78.00 | 80.85 | 71.73 | 77.33 | 86.33 | 73.30 | 69.86 | 85.46 | |
| Gemini 2.5 Flash | β | 59.33 | 47.00 | 83.79 | 70.68 | 66.00 | 80.33 | 56.00 | 67.49 | 85.07 | |
| InternVL3.5 | 8B | 8.17 | 34.75 | 76.26 | 68.59 | 71.33 | 76.33 | 45.33 | 59.60 | 81.99 | |
| 38B | 20.16 | 36.50 | 80.44 | 76.96 | 80.67 | 80.33 | 49.33 | 62.65 | 85.96 | ||
| Qwen2.5-VL | 7B | 2.16 | 34.75 | 78.12 | 76.44 | 59.33 | 77.33 | 51.33 | 55.92 | 75.20 | |
| 72B | 41.83 | 40.00 | 82.03 | 85.86 | 61.33 | 82.00 | 64.67 | 61.91 | 82.54 | ||
| Janus-pro | 7B | 0.00 | 33.50 | 43.08 | 38.22 | 50.67 | 63.33 | 22.00 | 38.51 | 67.83 | |
| Chameleon | 7B | 0.83 | 30.50 | 5.74 | 28.27 | 0.67 | 47.67 | 10.67 | 16.52 | 36.52 | |
| Bagel | 7B | 0.83* | 35.00* | 61.82 | 55.49 | 67.33 | 70.33 | 44.67 | 47.66 | 76.03 | |
| ThinkMorph | 7B | 75.83 | 79.00 | 78.10 | 67.02 | 72.00 | 80.33 | 52.67 | 60.07 | 80.82 | |
| Ξ (vs Bagel) | +75.00 | +44.00 | +16.28 | +11.53 | +4.67 | +10.00 | +8.00 | +12.41 | +4.79 |
@article{gu2025thinkmorph,
title={ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning},
author={Gu, Jiawei and Hao, Yunzhuo and Wang, Huichen Will and Li, Linjie and Shieh, Michael Qizhe and Choi, Yejin and Krishna, Ranjay and Cheng, Yu},
journal={arXiv preprint arXiv:2510.27492},
year={2025}
}