Codestin Search App

Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

🌟 This is the official repository which contains the training and inference code for ThinkMorph.

💥 News

[2025.12.22] The evaluation code for ThinkMorph is now accessible at VLMEvalKit_Thinkmorph.
[2025.10.29] Our model checkpoint and training data are now accessible at Huggingface.
[2025.10.29] Our paper is now accessible at arxiv.

👀 About ThinkMorph

We present ThinkMorph, a unified model fine-tuned on ∼24K high-quality interleaved reasoning traces across tasks, learning to generate progressive text–image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic.

Beyond strong vision-benchmark performance and robust out-of-domain generalization, ThinkMorph demonstrates emergent multimodal intelligence, including novel visual manipulation skills and so on. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/ThinkMorph/ThinkMorph.git
cd ThinkMorph
conda create -n thinkmorph python=3.10 -y
conda activate thinkmorph
pip install -r requirements.txt

2️⃣ Download checkpoint

from huggingface_hub import snapshot_download

save_dir = "models/ThinkMorph-7B"
repo_id = "ThinkMorph/ThinkMorph-7B"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

3️⃣ Use inference.ipynb to play with ThinkMorph!

🔥 Train & Eval

Training Data prepration

We opensource our training data mentioned in our paper containing four tasks: Jigsaw Assembly, Spatial Navigation, Visual Search , and Chart Refocus. Here we show typical examples of four tasks. Training data can be downloaded from Huggingface.

Download the training dataset

 from datasets import load_dataset

 # Jigsaw Assembly
 dataset = load_dataset("ThinkMorph/Jigsaw_Assembly", split="train")

 # Spatial Navigation
 dataset = load_dataset("ThinkMorph/Spatial_Navigation", split="train")

 # Visual Search
 dataset = load_dataset("ThinkMorph/Visual_Search", split="train")

 # Chart Refocus
 dataset = load_dataset("ThinkMorph/Chart_Refocus", split="train")

Convert the downloaded dataset into a data format suitable for model training. For details on the Bagel officially supported data formats, see in Train. Based on Bagel's implementation, we modify the training code to support our interleaved data format, and an easy-to-understand example of a parquet file is shown below:

{
    "image_list": [problem_image_0, reasoning_image_0],
    "instruction_list": [question],
    "output_text_list": [f"<think>{resoning_thought_0}</think><image_start>",f"<image_end><think>{resoning_thought_1}</think><answer>{answer}</answer>"],
}

Edit data/dataset_info.py with your own data path.
Edit configs/example.yaml. Additionally, we provide example configuration files corresponding to the different training settings in data/configs.

Train

torchrun \
  --nnodes=$num_nodes \
  --node_rank=$node_rank \
  --nproc_per_node=8 \
  --master_addr=$master_addr \
  --master_port=$master_port \
  train/pretrain_unified_navit.py \
  --dataset_config_file ./data/configs/interleaved_reasoning.yaml \
  --model_path $model_path \
  --layer_module Qwen2MoTDecoderLayer \
  --finetune_from_hf True \
  --auto_resume True \
  --finetune-from-ema True \
  --resume-from $model_path \
  --results_dir $output_path \
  --checkpoint_dir $ckpt_path \
  --lr 1e-5 \
  --num_worker 4 \
  --max_latent_size 64  \
  --max_num_tokens 32768 \
  --mse_weight 1 \
  --ce_weight 1 \
  --total_steps 8000 \

You can replace the variables in the script with your own before running. More training scripts are provided in ./script. See Bagel's TRAIN for more details.

Eval

Our evaluation code is open-sourced in VLMEvalKit_Thinkmorph. This repository provides evaluation support for the ThinkMorph model based on VLMEvalKit. And this repo also supports all the benchmarks evaluated in our paper, including: VSP, VisPuzzle, ChartQA, VStar, BLINK-J, MMVP, SAT, BLINK, and CV-Bench.

📊 Benchmarks

Model	Size	VSP	VisPuzzle	ChartQA	VStar	BLINK-J	MMVP	SAT	BLINK	CV-Bench
GPT-4o	–	33.50	43.75	76.34	61.78	72.67	84.67	28.00	60.28	75.61
GPT-5	–	57.33	78.00	80.85	71.73	77.33	86.33	73.30	69.86	85.46
Gemini 2.5 Flash	–	59.33	47.00	83.79	70.68	66.00	80.33	56.00	67.49	85.07
InternVL3.5	8B	8.17	34.75	76.26	68.59	71.33	76.33	45.33	59.60	81.99
	38B	20.16	36.50	80.44	76.96	80.67	80.33	49.33	62.65	85.96
Qwen2.5-VL	7B	2.16	34.75	78.12	76.44	59.33	77.33	51.33	55.92	75.20
	72B	41.83	40.00	82.03	85.86	61.33	82.00	64.67	61.91	82.54
Janus-pro	7B	0.00	33.50	43.08	38.22	50.67	63.33	22.00	38.51	67.83
Chameleon	7B	0.83	30.50	5.74	28.27	0.67	47.67	10.67	16.52	36.52
Bagel	7B	0.83*	35.00*	61.82	55.49	67.33	70.33	44.67	47.66	76.03
ThinkMorph	7B	75.83	79.00	78.10	67.02	72.00	80.33	52.67	60.07	80.82
Δ (vs Bagel)		+75.00	+44.00	+16.28	+11.53	+4.67	+10.00	+8.00	+12.41	+4.79

✍️ Citation

@article{gu2025thinkmorph,
  title={ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning},
  author={Gu, Jiawei and Hao, Yunzhuo and Wang, Huichen Will and Li, Linjie and Shieh, Michael Qizhe and Choi, Yejin and Krishna, Ranjay and Cheng, Yu},
  journal={arXiv preprint arXiv:2510.27492},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
data		data
modeling		modeling
scripts		scripts
test_images		test_images
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TRAIN.md		TRAIN.md
inference.ipynb		inference.ipynb
inferencer.py		inferencer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

💥 News

👀 About ThinkMorph

🔥 Quick Start

🔥 Train & Eval

Training Data prepration

Train

Eval

📊 Benchmarks

✍️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

ThinkMorph/ThinkMorph

Folders and files

Latest commit

History

Repository files navigation

Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

💥 News

👀 About ThinkMorph

🔥 Quick Start

🔥 Train & Eval

Training Data prepration

Train

Eval

📊 Benchmarks

✍️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages