This repository contains codes and KI-QFS dataset described in the paper "Tackling Query-Focused Summarization as A Knowledge-Intensive Task: A Pilot Study". The paper is accepted by GenIR@SIGIR 2023 workshop.
- data description
- relevance annotation
- codes
The dataset is based on DUC 2005-2007 datasets in NIST. Please ask for their data access before using our dataset. The dataset is located in dataset/. Please refer to the paper for more details of the dataset.
We repurpose the DUC datasets for a knowledge-intensive task, spliting them into input-output pairs and a knowledge corpus.
For the pairs, we also divide them into train, validation, and test splits, which are kiqfs_pairs_train/val/test.jsonl. The data format of each line in *.jsonl is:
{
'id': 'D301I', # original id of each cluster on the DUC Datasets
'query': 'Nobel prizes are awarded each year for achievement...',
'summaries': ['s1', 's2', ..., 'sn'] # a list of summaries
}For knowledge corpora, we consider three alternatives:
- Internal corpus
- External corpus
- Augmented corpus
The internal corpus is kiqfs_internal_knowledge.json, which only contain documents from the DUC datasets. The data format is:
{
'D301I': [{'title': 'FT 02 NOV 94...', 'text': 'CRIME WITHOUT FRONTIERS By...'}, ...] # a list of documents in the cluster D301I,
... # all clusters
}For external corpus, we use Wikipedia dump kilt_w100_title.tsv from KILT Benchmark. Please follow their instructions to download the data.
We also provide processed version of internal corpus kiqfs_internal_w100_title.tsv, which has the same data format with kilt_w100_title.tsv.
For augmented corpus, we simply combine previous two corpora to form it.
TODO
KI-QFS is MIT licensed. See the LICENSE file for details.