Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
103 views33 pages

Generating Wikipedia by Summarizing Long Sequence

This document summarizes a research paper that generates Wikipedia articles through multi-document summarization. It first uses extractive summarization to identify main information from input documents. It then uses an abstractive model with a decoder-only Transformer architecture to generate coherent multi-sentence paragraphs and full Wikipedia articles based on the extracted information. The paper introduces a large dataset of 2 million Wikipedia article-reference document pairs to train and evaluate their models. It compares different extractive methods and abstractive models, finding that the Transformer decoder with memory-compressed attention performs best at generating fluent summaries.

Uploaded by

Hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views33 pages

Generating Wikipedia by Summarizing Long Sequence

This document summarizes a research paper that generates Wikipedia articles through multi-document summarization. It first uses extractive summarization to identify main information from input documents. It then uses an abstractive model with a decoder-only Transformer architecture to generate coherent multi-sentence paragraphs and full Wikipedia articles based on the extracted information. The paper introduces a large dataset of 2 million Wikipedia article-reference document pairs to train and evaluate their models. It compares different extractive methods and abstractive models, finding that the Transformer decoder with memory-compressed attention performs best at generating fluent summaries.

Uploaded by

Hadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Generating

Wikipedia by
summarizing long
sequences
Presented By : Jinali Ghoghari

1
General Information

u Published Date : 30th January 2018

u Number of citiation : 688

u Dataset Used : wikisum

u Published by : Google Brain


2
ABSTRACT
u We use multi-document summarization

Multi –Document
Summarization

Extractive Abstractive

u First they use Extractive summarization to identify main information.


u Secondly they use abstractive model to generate the article.
u For abstractive model , they introduce Decoder-only architecture that can
work on very long sequences ,then the ENCODER-DECODER architecture
u As in result , it generates fluent coherent multi sentence ,paragraph and even
whole Wikipedia article.

3
Extractive vs
Abstractive
u Extractive : Take
sentences from source
to summarize it

u Abstractive : Generate
new sentences to
describe source

4
5
INTRODUCTION

Single-document ,Paraprasing Multi-document,abstractive


text summarization text summarization

u Training end-to-end to predict reference u Collection of related documents


summaries. from which a summary is distilled
u Requiring a significant number of parallel u Consider English Wikipedia as a
articile summary paris since language supervised machine learning task
understanding is a pre-requisite to for multi document summarization
generate fluent summaries.
u To abstractively generate the first
u Selecting sentences or phrases from the section ,or lead ,of Wikipedia
input to form the summaries. articles conditioned on reference

6
ABSTRACT

u Input
u Referenced articles from Wikipedia
u Top 10 search result of google
u Output
u Referenced articles from Wikipedia
u Top 10 search result of google
u 2 Million Pair of dataset used

In this we modify Transformer architecture (Vaswani et al., 2017) to only consist


of a decoder, which performs better in the case of longer input sequences
compared to recurrent neural network (RNN) and Transformer encoder-decoder
model
7
INPUT OUTPUT

8
Related Work
Article Subject Features

English Gigaword Dataset to train headline • Sentence paraphrasing then summarization


corpus (Graff & Cieri, • Only first sentence used to predict headline
2003)

Nallapati et al.(2016) Other dataset used in • Question-answering dataset of news articles paired with story
Neural Abstractive highlights from Daily mail and CNN.
summarization • Order-of-magnitude fewer parallel examples(310k vs. 3.8M) to
learn from
• Standard seq2seq models with attention do less well
• Unclear what the guidelines are for creating story highlights
• There are significant stylistic differences between the two news
publishers

Sauper & Barzilay Tasks involving wikipedia • Articles are generated extractively
(2009) • The Wikipedia articles are restricted to two categories
• The reference documents are obtained from a search engine

Our work • Train neural abstractive models,but in the multi-document regime


with wikipedia
• The summaries(Wikipedia lead) are multiple sentences and
sometimes multiple paragraphs,written in a fairly uniform style
• Use all article types
• Show results with documents only found in the
9 reference section
of the wekipedia articles
2.1 Related work

Input and output text are generally much larger with significance varience
depending on the article
10
ROUGE-1 : the propotion of unigrams/words in the output co-occurring in the
input
English Wikipedia as a
multi-document
summarization dataset

u Cited sources : For each


article,we extract all text
without markup from crawlable
citation documents
u Web Search results : To expand
the collection of reference
documents,we crawl the search
results from the google engine
,using the article section title as
quries.we collect 10 result pages
for each.
u Remove Wikipedia article
u Wikipedia clones

80/10/10 for train/development/test 11


Methods and Models
How humans might summarize multiple long documents :

First highlight pertinent information, then


conditionally generate the summary based on the
highlights. 12
Coarsely select a subset of the input using extractive summarization

Training an abstractive model that generates the Wikipedia text while conditioning on this extraction

13
Extractive stage
1) Idnetity : As a trivial baseline extractor, we simply use the first L tokens of the input
2) tf-idf :
TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the
frequency of words to determine how relevant those words are to a given document.
A non-trivial ranking is to consider ranking paragraphs as documents in a query- retrieval
problem, where the query Is the title of the article, T(ai). We compute tf-idf for the query,
with respect to the documents, {pij }. That is, we sum- mate for each word in the query

Nw : count of the word in the document,


Nd : total number of documets
Ndw: total number of documents contatining the word

14
3) TextRank : A weighted graph is defined where text units are nodes and edges are defined by a
similarity measure based on word overlap. An Algorithm similar to pageRank is the used to
compute the ranking of text units.
u We used paragraphs for the text units

15
4) SumBasic :

Word frequencies in the input text are used to assign scores to words, which are in turn
used to score sentences. After selecting the best scoring sentence, words in it have their
scores reduced, and the process is repeated until the desired summary length is reached.

16
5) Cheating :
we implement a cheating extractor that ranks {pij } using recall of bigrams in the ground truth
text:

17
u Three extractive methods
u trivial and cheating method, to assess the importance of this stage
u For each article, ai we create a ranked list of paragraphs, {piRi(j)}, occurring in (Ci,Si) where Ri(j) is the rank of the
jth paragraph pij of (Ci , Si ). From this we select the first L tokens as input to the second 18
abstractive stage.
Abstractive Stage

u Derive the raw text input simply as the concatenation of the paragraph in order, Most
relevant at the beginning, and prefixed with the title.
u Encode the text with a vocabulary size of 32,000 yielding tokenized input ,xi

For various values of L in experiments, we truncate the tokens to form the input sequence

Output : same vocabulary and tokenization for the Wikipedia lead text

Describe the abstractive Models, W, that learns to write articles

19
Models
u Baseline Model (T-ED)
u Standard LSTM E-D / Symmetric encoder and decoder modules
u Transformer Decoder(T-D)
u Suspect that for monolingual text-to-text tasks redundant information is re-learned
about language in the encoder and decoder
u Combines the imput and output sequences into a single “sentence” and is trained
as a standard language model.

u Almost reducing model parameters by half for a given hyper-parameter set

20
TRANSFORMER DECODER WITH MEMORY-COMPRESSED ATTENTION (T-DMCA)

u Query Q , set of key (K) and value (V ) pairs

21
uLocal Attention
u Divide sequence tokens into blocks of
similar length
u Attention is performed in each block
independently
u Capture local information

22
Memory-compressed attention
• Reduce the number of keys and values
by using a strided 1-D convulation
• Allows us to divide the number of
activations by a compression factor.
• Able to exchange information globally

• For both local and memory-compressed attention, masking is added to prevent the queries
from attending to future keys and values
• Our final architecture is a 5-layer network (LMLML) alternating between local-attention (L)
layers and memory-compressed attention (M) layers (in Vaswani et al. (2017) it is 6 identical
layers). We also added in some experiments one mixture of experts (MoE) layer (Shazeer et
al., 2017) to increase the network’s capacity.
23
5.2 MODEL TRAINING DETAILS AND DECODING

u For all abstractive model training, we use the open-source tensor2tensor2 library.

u The seq2seq baseline had a hidden size of 128 with 2 layers


u For the Transformer encoder-decoder (T-ED), we use the hyper-parameter set
transfomer base v1 and train for 1 million steps. Models exhibited very little over-
fitting and did not require early-stopping.
u Decoding
u during decoding we use a beam search of size 4 and length penalty α = 0.6
u and decode until an end-of-sequence token is reached.

24
Experiment

u There are four main dimension varying in experiments in generating wikipedia


lead sections :
u Extractive Method : SumBasic, TextRank,tf-idf,identity , cheating extractor
u Input corpus : citations , search results ,combined
u Abstractive model input length, L: We try values between 100 and 11000.
u Abstractive model architecture: seq2seq-att, T-ED, T-D, T-DMCA

25
The gaps between combined and using only one of citations or search results are both
significant and their contributions are complementary(Table 3,row 5)
• Extractive-only is not enough(Figure 2)
• The extractive methods perform roughly in-line with each other in terms of
ROUGE-1 F1
• The best abstractive model more than doubled this metric
26
Smart extraction is critical for final abstractive performance (Table 3,row 1-4)
• A significant gap between doing nothing,identity , and extractive
summarization,tf-idf (first- second row)
• The significant gap between tf-idf and the cheating extractor auggests
improving the extraction extraction step could result in significant
improvements
• we also observe that, unsurprisingly, the combined dataset performs best, but the gaps
between it and using only one of citations or search results are both significant and their
contributions are complementary. In subsequent experiments, we report only the
combined results.
27
Transformer Encoder-decoder (T-ED)

Architecture consistently improves in


performance until a best of arounf
L=500-1000 and it is unable to learn at L=2000

This motivated the Transformer-Decoder ,T-


D,which could learn and improve up to L=4000.

By using T-DMCA modification ,it was able to be


trained upto L=11000 and continued to see
improvements in performance.

The MoE-layer helped performance by adding


capacity at high L

28
Human Evalution – Linguistic Quality
• Five different dimensions are assessed.
• As seen in Table,the T-DMCA model does statistically better on all dimensions,except on non-
redundancy where tf-idf does about as well.
• Occasionally ,some repeition of phrases which hurt the non-redundancy and structure were
observed,but it was much rarer compared with the other abstractive method,seq2seq.
• The biggest weakness of the extractive method compared with the best abstractive model was
the lack of structure and coherence in the summaries. 29
30
31
Conclusion

u Wikipedia can be approached as a multi-document summarization problem with a


large, parallel dataset
u wo-stage extractive-abstractive frame- work for carrying it out.
u The coarse extraction method used in the first stage appears to have a sig- nificant effect on
final performance, suggesting further research on improving it
u introduce a new, decoder-only sequence transduction model for the abstractive stage,
capable of handling very long input-output examples
u This model significantly outperforms traditional encoder- decoder architectures on
long sequences, allowing us to condition on many reference documents and to
generate coherent and informative Wikipedia articles.

32
Thank you

33

You might also like