Retrieved sentences for each (question, answer option) pair in three multiple-choice science question answering datasets (i.e., ARC-Easy, ARC-Challenge, and OpenBookQA) from the integrated reference corpus (IRC) plus the integrated external corpus (IEC) described in the paper Improving Question Answering with External Knowledge).
This is a re-implementation. As of the release date of this repository, the Allen Institute for Artificial Intelligence (AI2) disallows third parties to redistribute the ARC Corpus. Therefore, we cannot directly release a resource containing the retrieved sentences from the ARC Corpus. Instead, for all such sentences, we provide pointers to the ARC Corpus as well as a script for fetching the retrieved sentences based on the pointers and your local copy of the corpus.
If you find this resource useful, please cite the following paper.
@inproceedings{pan2019improving,
title={Improving Question Answering with External Knowledge},
author={Pan, Xiaoman and Sun, Kai and Yu, Dian and Chen, Jianshu and
Ji, Heng and Cardie, Claire and Yu, Dong},
booktitle={Proceedings of the Workshop on Machine Reading for Question Answering},
address={Hong Kong, China},
url={https://arxiv.org/abs/1902.00993v2},
year={2019}
}
Below are the detailed instructions.
- Clone this repository.
- Download
ARC-V1-Feb2018.zipfrom AI2, unzip it, and copyARC_Corpus.txt(in the unzipped folderARC-V1-Feb2018-2) todatafolder. The CRC ofARC_Corpus.txtshould be8CFE08C6. - Run
python3 gen.pyto generatearc_challenge.json,arc_easy.json, andopenbookqa.json, which are input for models IRC + IEC and IRC + IEC + MD in Table 5 in the paper. The format of these files are as follows.
{
FileName-QuestionID: [
retrieved sentences for the 1st option,
retrieved sentences for the 2nd option,
...
],
...
}
File names and question IDs follow ARC-V1-Feb2018.zip and OpenBookQA-V1-Sep2018.zip. Retrieved sentences are splitted by "\n".