Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Codes and the dataset for the paper "Tackling Query-Focused Summarization as A Knowledge-Intensive Task: A Pilot Study""

License

wjzhang392/KI-QFS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KI-QFS

This repository contains codes and KI-QFS dataset described in the paper "Tackling Query-Focused Summarization as A Knowledge-Intensive Task: A Pilot Study". The paper is accepted by GenIR@SIGIR 2023 workshop.

Updates

  • data description
  • relevance annotation
  • codes

Data Description

The dataset is based on DUC 2005-2007 datasets in NIST. Please ask for their data access before using our dataset. The dataset is located in dataset/. Please refer to the paper for more details of the dataset.

Data Structure

We repurpose the DUC datasets for a knowledge-intensive task, spliting them into input-output pairs and a knowledge corpus.
For the pairs, we also divide them into train, validation, and test splits, which are kiqfs_pairs_train/val/test.jsonl. The data format of each line in *.jsonl is:

{
    'id': 'D301I', # original id of each cluster on the DUC Datasets
    'query': 'Nobel prizes are awarded each year for achievement...',
    'summaries': ['s1', 's2', ..., 'sn'] # a list of summaries
}

For knowledge corpora, we consider three alternatives:

  • Internal corpus
  • External corpus
  • Augmented corpus

The internal corpus is kiqfs_internal_knowledge.json, which only contain documents from the DUC datasets. The data format is:

{
    'D301I': [{'title': 'FT 02 NOV 94...', 'text': 'CRIME WITHOUT FRONTIERS By...'}, ...] # a list of documents in the cluster  D301I,
    ... # all clusters
}

For external corpus, we use Wikipedia dump kilt_w100_title.tsv from KILT Benchmark. Please follow their instructions to download the data.
We also provide processed version of internal corpus kiqfs_internal_w100_title.tsv, which has the same data format with kilt_w100_title.tsv.
For augmented corpus, we simply combine previous two corpora to form it.

Relevance Annotation

TODO

License

KI-QFS is MIT licensed. See the LICENSE file for details.

About

Codes and the dataset for the paper "Tackling Query-Focused Summarization as A Knowledge-Intensive Task: A Pilot Study""

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published