🏗️ Quickstart | 📊 Datasets | 🏆 Leaderboard | 📝 Report | 🖊️ Citation
This repository is the official implementation of MDK12-Bench.
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Pengfei Zhou*, Fanrui Zhang*, Xiaopeng Peng*, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You†, Kaipeng Zhang†
* Equal Contribution
† Corresponding Author
- [2025-04-09] The technical report of MDK12-Bench is released!
MDK12-Bench is a comprehensive benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs) across multiple disciplines. Our benchmark covers a diverse range of tasks that require high-level reasoning abilities, providing a robust platform for challenging and assessing state-of-the-art MLLMs. MDK12-Bench aims to push the boundaries of multimodal intelligence by offering standardized evaluation metrics and high-quality test cases that reflect real-world reasoning scenarios.
Please refer to MDK12EvalHub for your quick start.
After setting up the environment following instructions in Handbook, you may refer to quick-inference-qwenvl.py for a very quick inference using vLLM project. Then you may directly refer to judge.sh to use our judge logic for evaluating the inference results and obtaining the final performance score. Make sure the API for judge model is set in MDK12EvalHub/.env before you run the judge.sh. Finally, you can use count_all_acc_per_disc.py to summarize the performance on one subset.
MDK12mini-easy.zip, MDK12mini-medium.zip, MDK12mini-hard.zip are all well-cleaned data of the MDK12mini set.
The performance score of accuracy on our official leaderboards can be briefly previewed here!
If you feel MDK12-Bench useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!
@misc{zhou2025mdk12,
title={MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models},
author={Pengfei Zhou and Fanrui Zhang and Xiaopeng Peng and Zhaopan Xu and Jiaxin Ai and Yansheng Qiu and Chuanhao Li and Zhen Li and Ming Li and Yukang Feng and Jianwen Sun and Haoquan Zhang and Zizhen Li and Xiaofeng Mao and Wangbo Zhao and Kai Wang and Xiaojun Chang and Wenqi Shao and Yang You and Kaipeng Zhang},
year={2025},
eprint={2504.05782},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.05782},
}