Kuicai Dong* · Yujing Chang* · Shijie Huang · Yasheng Wang · Ruiming Tang · Yong Liu
📖Paper | 🏠Homepage | 🤗Huggingface | 👉Github
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence integration and selection. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that combine text with relevant visual elements. Through large-scale experiments with 60 language/vision models and 14 retrieval systems, we identify persistent challenges in multimodal evidence handling. Key findings reveal proprietary vision-language models show moderate advantages over text-only models, while open-source alternatives trail significantly. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.
Download images.zip and unzip it into ./dataset/.
We support inference from the API providers as follows:
- For Google Gemini key, please visit https://ai.google.dev/gemini-api/docs/api-key
- For Anthropic key, please visit https://console.anthropic.com/settings/keys
- For OpenAI key, please visit https://platform.openai.com/api-keys
- For xAI key, please visit https://console.x.ai/
- For Alibaba Cloud Qwen key, please visit https://bailian.console.aliyun.com/?tab=api#/api
- For Deepinfra key, please visit https://deepinfra.com/dash/api_keys
You can infer using the command:
python inference_api.py qwen3-32b --setting 20 --mode pure-text --no-enable-thinkingThe model name for example "qwen3-32b", is compulsory.
--settingparameter is to pass either 15 or 20 quotes for evaluation.
--modeparameter is to control passing quotes as either pure-text or multimodal inputs.
--no-enable-thinkingparameter is to disable thinking process for Qwen3 model, which does not applicable to non-Qwen3 models.
python 3.9
2.1.2+cu121
ms-swift
Download relevant model and adapter checkpoints and unzip it into ./checkpoint/.
You can infer using the command:
python inference_checkpoint.py Qwen2.5-7B-Instruct --setting 20 --lora Qwen2.5-7B-Instruct_loraThe model checkpoint ID (same as huggingface repo name), for example "Qwen2.5-7B-Instruct", is compulsory.
--settingparameter is to pass either 15 or 20 quotes for evaluation.
--loraparameter is for loading pre-trained checkpoint with fine-tuned LoRA weights.
You can initialize the model training using the command.
python train_swift_qwen.py Qwen2.5-7B-InstructThe model checkpoint ID (same as huggingface repo name), for example "Qwen2.5-7B-Instruct", is compulsory.
--settingparameter is to pass either 15 or 20 quotes for evaluation.
LoRA weights will be saved to path Qwen2.5-7B-Instruct_lora
You can evaluate the generated multimodal answer using the command:
python eval_llm_judge.py response/qwen3-4b_pure-text_response_quotes20.jsonl --setting 20The path of the response jsonl generated by
inference_api.pyorinference_checkpoint.py, for exampleresponse/qwen3-4b_pure-text_response_quotes20.jsonl, is compulsory.
--settingparameter is to pass either 15 or 20 quotes for evaluation.
This will generate new jsonl file with detailed qualitative scores by LLM-Judge.
This is to generation all scores: quotes selection F1, BLEU, ROUGE-L, and LLM-as-Judge scores (if applicable).
You can evaluate the generated multimodal answer using the command:
python eval_all.py --path xxx/xxx.jsonl --setting 20 --path_judge xxx/xxx.jsonl
--path_responseparather is the path of the response jsonl generated byinference_api.pyorinference_checkpoint.py, for exampleresponse/qwen3-4b_pure-text_response_quotes20.jsonl, is compulsory.
--path_judgeparameter is the path of LLM-Judge scores generated byeval_llm_judge.py.
--settingparameter is to pass either 15 or 20 quotes for evaluation.If
--llm-judgeis enabled but no related scores can be found, LLM-as-Judge scores will not be shown.
We have released all our inference and LLM-Judge jsonl results for 30 open-source, 25 proprietary, 5 sft models.
./response/for inference./response/evaluation/for LLM-Judge scores
You can reproduce the results in our paper by using the command:
python eval_all.py --model qwen3-4b --setting 20 --mode pure-text
--modelparameter is the model identifier used in our paper, for example "qwen3-32b".
--settingparameter is to pass either 15 or 20 quotes for evaluation.
--modeparameter is to control passing quotes as either pure-text or multimodal inputs.
This will generate new jsonl file with detailed qualitative scores by LLM-Judge.
model_dict = {
"Qwen2.5-3B-Inst":"qwen2.5-3b",
"Qwen2.5-3B-Inst-Fine-tuning":"qwen2.5-3b-ft",
"Llama3.2-3B-Inst":"llama3.2-3b",
"Qwen3-4B (think)":"qwen3-4b",
"Mistral-7B-Inst":"mistral-7b",
"Qwen2.5-7B-Inst":"qwen2.5-7b",
"Qwen2.5-7B-Inst-Fine-tuning":"qwen2.5-7b-ft",
"Llama3.1-8B-Inst":"llama3.1-8b",
"Qwen3-8B (think)":"qwen3-8b",
"Qwen2.5-14B-Inst":"qwen2.5-14b",
"Qwen2.5-14B-Inst-Fine-tuning":"qwen2.5-14b-ft",
"Qwen3-14B (think)":"qwen3-14b",
"Mistral-Small-24B-Inst":"mistral-small-24b",
"Qwen3-30B-A3B":"qwen3-30b-a3b",
"Qwen2.5-32B-Inst":"qwen2.5-32b",
"Qwen2.5-32B-Inst-Fine-tuning": "qwen2.5-32b-ft",
"Qwen3-32B (think)":"qwen3-32b",
"Mistral-8x7B-Inst":"mistral-8x7b",
"Llama3.3-70B-Inst":"llama3.3-70b",
"Qwen2.5-72B-Inst":"qwen2.5-72b",
"Qwen2.5-72B-Inst-Fine-tuning":"qwen2.5-72b-ft",
"Qwen3-235B-A22B":"qwen3-235b-a22b",
"Deepseek-V3":"deepseek-v3",
"Deepseek-R1":"deepseek-r1",
"Deepseek-R1-Distill-Qwen-32B":"deepseek-r1-distill-qwen-32b",
"Deepseek-R1-Distill-Llama-70B":"deepseek-r1-distill-llama-70b",
"Qwen-Plus":"qwen-plus",
"Qwen-Max":"qwen-max",
"Gemini-1.5-Pro":"gemini-1.5-pro",
"Gemini-2.0-Pro":"gemini-2.0-pro",
"Gemini-2.0-Flash":"gemini-2.0-flash",
"Gemini-2.0-Flash-Think":"gemini-2.0-flash-tk",
"Gemini-2.5-Flash":"gemini-2.5-flash",
"Gemini-2.5-Pro":"gemini-2.5-pro",
"Claude-3.5-Sonnet":"claude-3.5-sonnet",
"GPT-4-turbo":"gpt-4-turbo",
"GPT-4o-mini":"gpt-4o-mini",
"GPT-4o":"gpt-4o",
"GPT-o3-mini":"gpt-o3-mini",
"GPT-4.1-nano":"gpt-4.1-nano",
"GPT-4.1-mini":"gpt-4.1-mini",
"GPT-4.1":"gpt-4.1",
"Janus-Pro-7B":"janus-pro-7b",
"MiniCPM-o-2.6-8B":"minicpm-o-2.6-8b",
"InternVL2.5-8B":"internvl2.5-8b",
"InternVL3-8B":"internvl3-8b",
"InternVL3-9B":"internvl3-9b",
"InternVL3-14B":"internvl3-14b",
"InternVL2.5-26B":"internvl2.5-26b",
"InternVL2.5-38B":"internvl2.5-38b",
"InternVL3-38B":"internvl3-38b",
"InternVL2.5-78B":"internvl2.5-78b",
"InternVL3-78B":"internvl3-78b",
"Qwen2.5-VL-7B-Inst":"qwen2.5-vl-7b",
"Qwen2.5-VL-32B-Inst":"qwen2.5-vl-32b",
"Qwen2.5-VL-72B-Inst":"qwen2.5-vl-72b",
"Qwen-VL-Plus":"qwen-vl-plus",
"Qwen-VL-Max":"qwen-vl-max",
"Qwen-QVQ-Max":"qwen-qvq-max",
"Qwen-QwQ-Plus":"qwen-qwq-plus",
"Llama4-Scout-17Bx16E":"llama4-scout-17b-16e",
"Llama4-Mave-17Bx128E":"llama4-mave-17b-128e"
}@misc{dong2025mmdocrag,
title={Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering},
author={Kuicai Dong and Yujing Chang and Shijie Huang and Yasheng Wang and Ruiming Tang and Yong Liu},
year={2025},
eprint={2505.16470},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.16470},
}
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use