3MDBench, or Medical Multimodal Multi-agent Dialogue Benchmark, is an open-source benchmark for evaluating large vision-language models (LVLMs) through simulated doctor-patient dialogues. It features a Doctor Agent interacting with a temperament-driven Patient Agent using images and structured complaints. After that, an Assessor Agent, aligned with human experts, evaluates diagnostic and communication quality.
[2025-08-20] 3MDBench has been accepted to the EMNLP 2025 Main Conference 🎉
- Install dependencies from
requirements.txt - Check out the 3MDBench dataset on Hugging Face!
- Go to the
scriptsfolder; - Run
run_dialogue.sh, choosing models from used in the paper or implementing custom in theagents/doctor_agent.pyfile;
- Run
run_assessment.shto estimate generated dialogue, which will be contained in theresults/assessmentsfolder; - Run
run_diagnoses_obtaining.shto extract final diagnoses by Doctor Agent for each case, which will be contained in theresults/assessments/diagsfolder; - Explore
benchmarking/count_metrics.ipynbto analyze model's metrics.
@misc{sviridov20253mdbenchmedicalmultimodalmultiagent,
title={3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark},
author={Ivan Sviridov and Amina Miftakhova and Artemiy Tereshchenko and Galina Zubkova and Pavel Blinov and Andrey Savchenko},
year={2025},
eprint={2504.13861},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2504.13861},
}
We appreciate your interest in our work! If you have any questions, please open an issue or contact Ivan at [email protected] or Amina at [email protected].