Thanks to visit codestin.com
Credit goes to gitHub.com

Skip to content

MMDocRAG/MMDocRAG

Repository files navigation

MMDocRAG: Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering

Kuicai Dong* · Yujing Chang* · Shijie Huang · Yasheng Wang · Ruiming Tang · Yong Liu

📖Paper | 🏠Homepage | 🤗Huggingface | 👉Github

Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence integration and selection. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that combine text with relevant visual elements. Through large-scale experiments with 60 language/vision models and 14 retrieval systems, we identify persistent challenges in multimodal evidence handling. Key findings reveal proprietary vision-language models show moderate advantages over text-only models, while open-source alternatives trail significantly. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems.

Logo

🛠️Dataset Usage

Download Image Quotes

Download images.zip and unzip it into ./dataset/.

1. For Inference using API

API Key Preparation

We support inference from the API providers as follows:

Inference Command

You can infer using the command:

python inference_api.py qwen3-32b --setting 20 --mode pure-text --no-enable-thinking

The model name for example "qwen3-32b", is compulsory.

--setting parameter is to pass either 15 or 20 quotes for evaluation.

--mode parameter is to control passing quotes as either pure-text or multimodal inputs.

--no-enable-thinking parameter is to disable thinking process for Qwen3 model, which does not applicable to non-Qwen3 models.

2. For Inference using Checkpoints

Environment

python 3.9
2.1.2+cu121
ms-swift

Download Checkpoints

Download relevant model and adapter checkpoints and unzip it into ./checkpoint/.

Inference Command

You can infer using the command:

python inference_checkpoint.py Qwen2.5-7B-Instruct --setting 20 --lora Qwen2.5-7B-Instruct_lora

The model checkpoint ID (same as huggingface repo name), for example "Qwen2.5-7B-Instruct", is compulsory.

--setting parameter is to pass either 15 or 20 quotes for evaluation.

--lora parameter is for loading pre-trained checkpoint with fine-tuned LoRA weights.

3. For Finetuning Models

You can initialize the model training using the command.

python train_swift_qwen.py Qwen2.5-7B-Instruct

The model checkpoint ID (same as huggingface repo name), for example "Qwen2.5-7B-Instruct", is compulsory.

--setting parameter is to pass either 15 or 20 quotes for evaluation.

LoRA weights will be saved to path Qwen2.5-7B-Instruct_lora

🔮Dataset Evaluation

1. LLM-as-Judge

You can evaluate the generated multimodal answer using the command:

python eval_llm_judge.py response/qwen3-4b_pure-text_response_quotes20.jsonl --setting 20

The path of the response jsonl generated by inference_api.py or inference_checkpoint.py, for example response/qwen3-4b_pure-text_response_quotes20.jsonl, is compulsory.

--setting parameter is to pass either 15 or 20 quotes for evaluation.

This will generate new jsonl file with detailed qualitative scores by LLM-Judge.

2. All Scores

This is to generation all scores: quotes selection F1, BLEU, ROUGE-L, and LLM-as-Judge scores (if applicable).

You can evaluate the generated multimodal answer using the command:

python eval_all.py --path xxx/xxx.jsonl --setting 20 --path_judge xxx/xxx.jsonl

--path_response parather is the path of the response jsonl generated by inference_api.py or inference_checkpoint.py, for example response/qwen3-4b_pure-text_response_quotes20.jsonl, is compulsory.

--path_judge parameter is the path of LLM-Judge scores generated by eval_llm_judge.py.

--setting parameter is to pass either 15 or 20 quotes for evaluation.

If --llm-judge is enabled but no related scores can be found, LLM-as-Judge scores will not be shown.

☀️Reproducing Our Results

We have released all our inference and LLM-Judge jsonl results for 30 open-source, 25 proprietary, 5 sft models.

  • ./response/ for inference
  • ./response/evaluation/ for LLM-Judge scores

You can reproduce the results in our paper by using the command:

python eval_all.py --model qwen3-4b --setting 20 --mode pure-text

--model parameter is the model identifier used in our paper, for example "qwen3-32b".

--setting parameter is to pass either 15 or 20 quotes for evaluation.

--mode parameter is to control passing quotes as either pure-text or multimodal inputs.

This will generate new jsonl file with detailed qualitative scores by LLM-Judge.

model_dict = {
   "Qwen2.5-3B-Inst":"qwen2.5-3b",
   "Qwen2.5-3B-Inst-Fine-tuning":"qwen2.5-3b-ft",
   "Llama3.2-3B-Inst":"llama3.2-3b",
   "Qwen3-4B (think)":"qwen3-4b",
   "Mistral-7B-Inst":"mistral-7b",
   "Qwen2.5-7B-Inst":"qwen2.5-7b",
   "Qwen2.5-7B-Inst-Fine-tuning":"qwen2.5-7b-ft",
   "Llama3.1-8B-Inst":"llama3.1-8b",
   "Qwen3-8B (think)":"qwen3-8b",
   "Qwen2.5-14B-Inst":"qwen2.5-14b",
   "Qwen2.5-14B-Inst-Fine-tuning":"qwen2.5-14b-ft",
   "Qwen3-14B (think)":"qwen3-14b",
   "Mistral-Small-24B-Inst":"mistral-small-24b",
   "Qwen3-30B-A3B":"qwen3-30b-a3b",
   "Qwen2.5-32B-Inst":"qwen2.5-32b",
   "Qwen2.5-32B-Inst-Fine-tuning": "qwen2.5-32b-ft",
   "Qwen3-32B (think)":"qwen3-32b",
   "Mistral-8x7B-Inst":"mistral-8x7b",
   "Llama3.3-70B-Inst":"llama3.3-70b",
   "Qwen2.5-72B-Inst":"qwen2.5-72b",
   "Qwen2.5-72B-Inst-Fine-tuning":"qwen2.5-72b-ft",
   "Qwen3-235B-A22B":"qwen3-235b-a22b",
   "Deepseek-V3":"deepseek-v3",
   "Deepseek-R1":"deepseek-r1",
   "Deepseek-R1-Distill-Qwen-32B":"deepseek-r1-distill-qwen-32b",
   "Deepseek-R1-Distill-Llama-70B":"deepseek-r1-distill-llama-70b",
   "Qwen-Plus":"qwen-plus",
   "Qwen-Max":"qwen-max",
   "Gemini-1.5-Pro":"gemini-1.5-pro",
   "Gemini-2.0-Pro":"gemini-2.0-pro",
   "Gemini-2.0-Flash":"gemini-2.0-flash",
   "Gemini-2.0-Flash-Think":"gemini-2.0-flash-tk",
   "Gemini-2.5-Flash":"gemini-2.5-flash",
   "Gemini-2.5-Pro":"gemini-2.5-pro",
   "Claude-3.5-Sonnet":"claude-3.5-sonnet",
   "GPT-4-turbo":"gpt-4-turbo",
   "GPT-4o-mini":"gpt-4o-mini",
   "GPT-4o":"gpt-4o",
   "GPT-o3-mini":"gpt-o3-mini",
   "GPT-4.1-nano":"gpt-4.1-nano",
   "GPT-4.1-mini":"gpt-4.1-mini",
   "GPT-4.1":"gpt-4.1",
   "Janus-Pro-7B":"janus-pro-7b",
   "MiniCPM-o-2.6-8B":"minicpm-o-2.6-8b",
   "InternVL2.5-8B":"internvl2.5-8b",
   "InternVL3-8B":"internvl3-8b",
   "InternVL3-9B":"internvl3-9b",
   "InternVL3-14B":"internvl3-14b",
   "InternVL2.5-26B":"internvl2.5-26b",
   "InternVL2.5-38B":"internvl2.5-38b",
   "InternVL3-38B":"internvl3-38b",
   "InternVL2.5-78B":"internvl2.5-78b",
   "InternVL3-78B":"internvl3-78b",
   "Qwen2.5-VL-7B-Inst":"qwen2.5-vl-7b",
   "Qwen2.5-VL-32B-Inst":"qwen2.5-vl-32b",
   "Qwen2.5-VL-72B-Inst":"qwen2.5-vl-72b",
   "Qwen-VL-Plus":"qwen-vl-plus",
   "Qwen-VL-Max":"qwen-vl-max",
   "Qwen-QVQ-Max":"qwen-qvq-max",
   "Qwen-QwQ-Plus":"qwen-qwq-plus",
   "Llama4-Scout-17Bx16E":"llama4-scout-17b-16e",
   "Llama4-Mave-17Bx128E":"llama4-mave-17b-128e"
}

💾Citation

@misc{dong2025mmdocrag,
      title={Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering}, 
      author={Kuicai Dong and Yujing Chang and Shijie Huang and Yasheng Wang and Ruiming Tang and Yong Liu},
      year={2025},
      eprint={2505.16470},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2505.16470}, 
}

📄 License

Code License Data License Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use