Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ MDK12 Public

Official Implementation of MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Notifications You must be signed in to change notification settings

LanceZPF/MDK12

Repository files navigation

MDK12-Bench

🏗️ Quickstart | 📊 Datasets | 🏆 Leaderboard | 📝 Report | 🖊️ Citation

This repository is the official implementation of MDK12-Bench.

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models
Pengfei Zhou*, Fanrui Zhang*, Xiaopeng Peng*, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, Jianwen Sun, Haoquan Zhang, Zizhen Li, Xiaofeng Mao, Zekai Li, Wangbo Zhao, Kai Wang, Xiaojun Chang, Wenqi Shao, Yang You†, Kaipeng Zhang†
* Equal Contribution
Corresponding Author

🆕 News

  • [2025-04-09] The technical report of MDK12-Bench is released!

📖 Introduction

MDK12-Bench is a comprehensive benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs) across multiple disciplines. Our benchmark covers a diverse range of tasks that require high-level reasoning abilities, providing a robust platform for challenging and assessing state-of-the-art MLLMs. MDK12-Bench aims to push the boundaries of multimodal intelligence by offering standardized evaluation metrics and high-quality test cases that reflect real-world reasoning scenarios.

🏗️ Quick Start

Please refer to MDK12EvalHub for your quick start.

After setting up the environment following instructions in Handbook, you may refer to quick-inference-qwenvl.py for a very quick inference using vLLM project. Then you may directly refer to judge.sh to use our judge logic for evaluating the inference results and obtaining the final performance score. Make sure the API for judge model is set in MDK12EvalHub/.env before you run the judge.sh. Finally, you can use count_all_acc_per_disc.py to summarize the performance on one subset.

📊 Datasets

MDK12mini-easy.zip, MDK12mini-medium.zip, MDK12mini-hard.zip are all well-cleaned data of the MDK12mini set.

🏆 Leaderboard

The performance score of accuracy on our official leaderboards can be briefly previewed here!

MDK12-Bench Preview

🖊️ Citation

If you feel MDK12-Bench useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@misc{zhou2025mdk12,
      title={MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models}, 
      author={Pengfei Zhou and Fanrui Zhang and Xiaopeng Peng and Zhaopan Xu and Jiaxin Ai and Yansheng Qiu and Chuanhao Li and Zhen Li and Ming Li and Yukang Feng and Jianwen Sun and Haoquan Zhang and Zizhen Li and Xiaofeng Mao and Wangbo Zhao and Kai Wang and Xiaojun Chang and Wenqi Shao and Yang You and Kaipeng Zhang},
      year={2025},
      eprint={2504.05782},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.05782}, 
}

About

Official Implementation of MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages