🤗Data
This repositary contains codes to evaluate models on M3SciQA from the paper M3SciQA: A Multi-Modal Multi-Document Scientific Benchmark for Evaluating Foundation Models
In the realm of foundation models for scientific research, current benchmarks predominantly focus on single-document, text-only tasks and fail to adequately represent the complex workflow of such research. These benchmarks lack the
git clone https://github.com/yale-nlp/M3SciQA
cd M3SciQATour for the code base
.
├── data/
│ ├── locality.jsonl
│ ├── combined_test.jsonl
│ ├── combined_val.jsonl
│ ├── locality/
| | ├── 2310.04988
| | └── HVI_figure.png
| | ├── 2310.05030
| | └── diversity_score.png
| | ...
├── src/
│ ├── data_utils.py
│ ├── evaluate_detail.py
│ ├── evaluate_locality.py
│ ├── generate_detail.py
│ ├── generate_locality.py
│ ├── models_w_vision.py
│ ├── models_wo_vision.py
│ ├── README.md
├── results/
│ ├── locality_response/
│ ├── retrieval@1/
│ ├── retrieval@2/
│ ├── retrieval@3/
│ ├── retrieval@4/
│ ├── retrieval@5/
├── paper_cluster_S2_content.json
├── paper_cluster_S2.json
├── paper_full_content.json
├── retrieval_paper.json
├── README.md
└── .gitignore
datafolder contains locality-specific questions, combined question validation, and combined question test. Answers, explanations, and evidence for the test split are set tonullto prevent testing data from leaking to the public.data/locality/folder contains all images used to compose locality-specific questions.results/contains evaluation results under different settings.src/generate_locality.py: script for generating responses for locality-specific questions.src/evaluate_locality.py: script for evaluating responses for locality-specific questions.src/generate_detail.py: script for generating responses for detail-specific questions.src/evaluate_detail.py: script for evaluating responses for detail-specific questions.- For locality reasoning types, we use the mapping:
{
"1": Comparison
"2": Data Extraction
"3": Location
"4": Visual Understanding
}{
"question_anchor": ... <str>,
"reference_arxiv_id": ... <str>,
"reference_s2_id": ... <str>,
"response": ... <str>
}responsefield contains model's output ranking.
For example,
{"question_anchor": "Which large language model achieves a lower HVI score than OPT but a higher HVI score than Alpaca?",
"reference_arxiv_id": "2303.08774",
"reference_s2_id": "163b4d6a79a5b19af88b8585456363340d9efd04",
"response": "```json\n{\"ranking\":
[\"1827dd28ef866eaeb929ddf4bcfa492880aba4c7\", \"57e849d0de13ed5f91d086936296721d4ff75a75\", \"2b2591c151efc43e8836a5a6d17e44c04bb68260\", \"62b322b0bead56d6a252a2e24de499ea8385ad7f\", \"964bd39b546f0f6625ff3b9ef1083f797807ef2e\", \"597d9134ffc53d9c3ba58368d12a3e4d24893bf0\"
]}```"}For example, to evaluate GPT-4o, run the following command:
cd src
python generate_locality.py
--model gpt_4_oFor open-source models, we provide the code for Qwen2 VL 7B. You can modify this function to be suitable for other models. To run it, you need go to the root folder and create a folder named pretrained. If you are in the src/ folder:
cd ..
mkdir pretrained && cd src
python generate_locality.py --model qwen2vl_7b
Similarly, to calculate the MRR, NDCG@3, and Recall@3 of GPT-4o, run the following command:
python evaluate_locality.py
--result_path ../results/locality_response/gpt_4_o.jsonl
--k 3{
"question": ... <str>,
"answer": ... <str>,
"response": ... <str>,
"reference_reasoning_type": ... <str>
}Parameters:
model: model that you want to evaluatek: number of papers that you want to retrieve from the re-ranked paper listchunk_length: chunk length that you want to pass into the model for models with short context length
For evaluating open-source models, we offer two methods: (1) using TogetherAI API and (2) caccessing models directly from Hugging Face.
Currently, local execution is supported only for Qwen2vl 7B, but you can easily modify the function to work with any other LLMs available on Hugging Face.
For example, to use GPT-4 with
cd src
python generate_detail.py
--model gpt_4
--k 3
--chunk_length 15000To evaluate GPT-4's generated response, run the following command:
python evaluate_detail.py
--result_path ../results/retrieval@3/gpt_4.jsonl