MM-PRM

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

🎯Overview

While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at MM-PRM.

🗞️ News

[2025/05/19] We released MM-PRM.
- 📖 Paper: MM-PRM-Paper
- 📊 Data: MM-K12
- 🤗 Model: MM-PRM

📊 MM-K12 Dataset

We released MM-K12 dataset at MM-K12.

🤖 Models

Figure 1 | Qualitative example of MM-PRM accurately identifying error steps in multimodal reasoning process.

Figure 2 | Performance improvements across various benchmarks when applying the MM-PRM to different models.

🤗 MM-PRM

🏁 Getting Started

📦 Installation

git clone https://github.com/ModalMinds/MM-PRM.git
cd MM-PRM
pip install -r requirements.txt

# install flash-attn==2.3.6:

pip install flash-attn==2.3.6 --no-build-isolation

# Alternatively you can compile from source:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git checkout v2.3.6
python setup.py install

📂 Data Pipeline

Seed dataset preparation

To begin, prepare a seed dataset consisting of verifiable problems. Each example should be formatted as a JSON object containing the following fields:
```
[
    {
        "id": "unique identifier for the problem",
        "question": "problem statement",
        "correct_answer": "ground-truth final answer for evaluation and verification",
        "image_path": "/path/to/image.png"
    },
    ...
]
```
This dataset will be used as input to the data pipeline to generate annotated solution trees with step-wise correctness labels.

To enable parallel data generation, you need to split the seed dataset into smaller chunks.
```
cd data_pipeline
python process_json.py
```
API endpoint setup (Optional)

The data generation process requires an API endpoint to automatically verify whether the final answer in a rollout is correct. You can deploy a model (e.g., Qwen2.5) locally to act as the answer judge.

We recommend using vLLM to deploy a local model.
Run data pipeline

Once you have all set, you can run the data pipeline to generate step-level supervision data.

Before running, ensure that all necessary parameters are correctly set in the script or passed through the environment.
```
sh run_data_pipeline.sh
```
Sampling Training Data from annotation trees

After generating annotated reasoning trees, you need to sample step-by-step solution paths from these trees to construct the training data for the Process Reward Model (PRM). This can be done using the script:
```
python traverse.py
```
The next step is to convert this data into the format required for PRM training. Use the following script to perform the formatting:
```
python prm_data_format.py
```

🌐 Start PRM Training

Create a JSON file in internvl_chat/shell/data/

The format for the JSON file should be:

{
  "your-custom-prm_dataset": {
    "root": "/path/to/the/image/root",
    "annotation": "/path/to/the/jsonl/annotation",
    "data_augment": false,
    "repeat_time": 1,
    "length": "number of samples in the dataset"
  }
}

Once the dataset configuration is in place, you can start training the PRM model with:

GPUS=8 sh shell/internvl2.5/2nd_finetune/internvl2_5_38b_dynamic_res_2nd_finetune_full_prm.sh

📊 Evaluation

We provide our evaluation code in the eval/ directory.

⭐ Starchart

🤝 Contribution

If you want to contribute, please feel free to make a pull request or create an issue.

Please refer to CONTRIBUTING.md before you dive in！

📬 Contact

If you have any questions or would like to engage with our community, feel free to scan the QR code below to join our WeChat group.

🎓 Acknowledgements

We acknowledge the outstanding open-source contributions from OpenR and vLLM. We also extend our gratitude to InternVL for their open-source techniques and base models, which have enabled us to further our exploration.

📜 Citation

@article{du2025mmprm,
      title={MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision},
      author={Lingxiao Du and Fanqing Meng and Zongkai Liu and Zhixiang Zhou and Ping Luo and Qiaosheng Zhang and Wenqi Shao},
      year={2025},
      journal={arXiv preprint arXiv:2505.13427},
}
@article{meng2025mmeureka,
      title={MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning},
      author={Fanqing Meng and Lingxiao Du and Zongkai Liu and Zhixiang Zhou and Quanfeng Lu and Daocheng Fu and Tiancheng Han and Botian Shi and Wenhai Wang and Junjun He and Kaipeng Zhang and Ping Luo and Yu Qiao and Qiaosheng Zhang and Wenqi Shao},
      year={2025},
      journal={arXiv preprint arXiv:2503.07365},
}
@article{liu2025cpgd,
      title={CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models},
      author={Zongkai Liu and Fanqing Meng and Lingxiao Du and Zhixiang Zhou and Chao Yu and Wenqi Shao and Qiaosheng Zhang},
      year={2025},
      journal={arXiv preprint arXiv:2505.12504},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data_pipeline		data_pipeline
docs		docs
eval/prm		eval/prm
internvl		internvl
requirements		requirements
shell/internvl2.5/2nd_finetune		shell/internvl2.5/2nd_finetune
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
evaluate.sh		evaluate.sh
requirements.txt		requirements.txt
zero_stage1_config.json		zero_stage1_config.json
zero_stage2_config.json		zero_stage2_config.json
zero_stage3_config.json		zero_stage3_config.json
zero_stage3_config_100b.json		zero_stage3_config_100b.json
zero_stage3_config_100b_1e8.json		zero_stage3_config_100b_1e8.json
zero_stage3_config_34b.json		zero_stage3_config_34b.json
zero_stage3_config_70b.json		zero_stage3_config_70b.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM-PRM

🎯Overview

🗞️ News

📊 MM-K12 Dataset

🤖 Models

🏁 Getting Started

📦 Installation

📂 Data Pipeline

🌐 Start PRM Training

📊 Evaluation

⭐ Starchart

🤝 Contribution

📬 Contact

🎓 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Languages

License

ModalMinds/MM-PRM

Folders and files

Latest commit

History

Repository files navigation

MM-PRM

🎯Overview

🗞️ News

📊 MM-K12 Dataset

🤖 Models

🏁 Getting Started

📦 Installation

📂 Data Pipeline

🌐 Start PRM Training

📊 Evaluation

⭐ Starchart

🤝 Contribution

📬 Contact

🎓 Acknowledgements

📜 Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages