0% found this document useful (0 votes)

103 views33 pages

Generating Wikipedia by Summarizing Long Sequence

This document summarizes a research paper that generates Wikipedia articles through multi-document summarization. It first uses extractive summarization to identify main information from input documents. It then uses an abstractive model with a decoder-only Transformer architecture to generate coherent multi-sentence paragraphs and full Wikipedia articles based on the extracted information. The paper introduces a large dataset of 2 million Wikipedia article-reference document pairs to train and evaluate their models. It compares different extractive methods and abstractive models, finding that the Transformer decoder with memory-compressed attention performs best at generating fluent summaries.

Uploaded by

Hadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views33 pages

Generating Wikipedia by Summarizing Long Sequence

Uploaded by

Hadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Generating

Wikipedia by
summarizing long
sequences
Presented By : Jinali Ghoghari

1
General Information

u Published Date : 30th January 2018

u Number of citiation : 688

u Dataset Used : wikisum

u Published by : Google Brain

2
ABSTRACT
u We use multi-document summarization

Multi –Document
Summarization

Extractive Abstractive

u First they use Extractive summarization to identify main information.

u Secondly they use abstractive model to generate the article.
u For abstractive model , they introduce Decoder-only architecture that can
work on very long sequences ,then the ENCODER-DECODER architecture
u As in result , it generates fluent coherent multi sentence ,paragraph and even
whole Wikipedia article.

3
Extractive vs
Abstractive
u Extractive : Take
sentences from source
to summarize it

u Abstractive : Generate
new sentences to
describe source

4
5
INTRODUCTION

Single-document ,Paraprasing Multi-document,abstractive

text summarization text summarization

u Training end-to-end to predict reference u Collection of related documents

summaries. from which a summary is distilled
u Requiring a significant number of parallel u Consider English Wikipedia as a
articile summary paris since language supervised machine learning task
understanding is a pre-requisite to for multi document summarization
generate fluent summaries.
u To abstractively generate the first
u Selecting sentences or phrases from the section ,or lead ,of Wikipedia
input to form the summaries. articles conditioned on reference

6
ABSTRACT

u Input
u Referenced articles from Wikipedia
u Top 10 search result of google
u Output
u Referenced articles from Wikipedia
u Top 10 search result of google
u 2 Million Pair of dataset used

In this we modify Transformer architecture (Vaswani et al., 2017) to only consist

of a decoder, which performs better in the case of longer input sequences
compared to recurrent neural network (RNN) and Transformer encoder-decoder
model
7
INPUT OUTPUT

8
Related Work
Article Subject Features

English Gigaword Dataset to train headline • Sentence paraphrasing then summarization

corpus (Graff & Cieri, • Only first sentence used to predict headline
2003)

Nallapati et al.(2016) Other dataset used in • Question-answering dataset of news articles paired with story
Neural Abstractive highlights from Daily mail and CNN.
summarization • Order-of-magnitude fewer parallel examples(310k vs. 3.8M) to
learn from
• Standard seq2seq models with attention do less well
• Unclear what the guidelines are for creating story highlights
• There are significant stylistic differences between the two news
publishers

Sauper & Barzilay Tasks involving wikipedia • Articles are generated extractively
(2009) • The Wikipedia articles are restricted to two categories
• The reference documents are obtained from a search engine

Our work • Train neural abstractive models,but in the multi-document regime

with wikipedia
• The summaries(Wikipedia lead) are multiple sentences and
sometimes multiple paragraphs,written in a fairly uniform style
• Use all article types
• Show results with documents only found in the
9 reference section
of the wekipedia articles
2.1 Related work

Input and output text are generally much larger with significance varience
depending on the article
10
ROUGE-1 : the propotion of unigrams/words in the output co-occurring in the
input
English Wikipedia as a
multi-document
summarization dataset

u Cited sources : For each

article,we extract all text
without markup from crawlable
citation documents
u Web Search results : To expand
the collection of reference
documents,we crawl the search
results from the google engine
,using the article section title as
quries.we collect 10 result pages
for each.
u Remove Wikipedia article
u Wikipedia clones

80/10/10 for train/development/test 11

Methods and Models
How humans might summarize multiple long documents :

First highlight pertinent information, then

conditionally generate the summary based on the
highlights. 12
Coarsely select a subset of the input using extractive summarization

Training an abstractive model that generates the Wikipedia text while conditioning on this extraction

13
Extractive stage
1) Idnetity : As a trivial baseline extractor, we simply use the first L tokens of the input
2) tf-idf :
TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the
frequency of words to determine how relevant those words are to a given document.
A non-trivial ranking is to consider ranking paragraphs as documents in a query- retrieval
problem, where the query Is the title of the article, T(ai). We compute tf-idf for the query,
with respect to the documents, {pij }. That is, we sum- mate for each word in the query

Nw : count of the word in the document,

Nd : total number of documets
Ndw: total number of documents contatining the word

14
3) TextRank : A weighted graph is defined where text units are nodes and edges are defined by a
similarity measure based on word overlap. An Algorithm similar to pageRank is the used to
compute the ranking of text units.
u We used paragraphs for the text units

15
4) SumBasic :

Word frequencies in the input text are used to assign scores to words, which are in turn
used to score sentences. After selecting the best scoring sentence, words in it have their
scores reduced, and the process is repeated until the desired summary length is reached.

16
5) Cheating :
we implement a cheating extractor that ranks {pij } using recall of bigrams in the ground truth
text:

17
u Three extractive methods
u trivial and cheating method, to assess the importance of this stage
u For each article, ai we create a ranked list of paragraphs, {piRi(j)}, occurring in (Ci,Si) where Ri(j) is the rank of the
jth paragraph pij of (Ci , Si ). From this we select the first L tokens as input to the second 18
abstractive stage.
Abstractive Stage

u Derive the raw text input simply as the concatenation of the paragraph in order, Most
relevant at the beginning, and prefixed with the title.
u Encode the text with a vocabulary size of 32,000 yielding tokenized input ,xi

For various values of L in experiments, we truncate the tokens to form the input sequence

Output : same vocabulary and tokenization for the Wikipedia lead text

Describe the abstractive Models, W, that learns to write articles

19
Models
u Baseline Model (T-ED)
u Standard LSTM E-D / Symmetric encoder and decoder modules
u Transformer Decoder(T-D)
u Suspect that for monolingual text-to-text tasks redundant information is re-learned
about language in the encoder and decoder
u Combines the imput and output sequences into a single “sentence” and is trained
as a standard language model.

u Almost reducing model parameters by half for a given hyper-parameter set

20
TRANSFORMER DECODER WITH MEMORY-COMPRESSED ATTENTION (T-DMCA)

u Query Q , set of key (K) and value (V ) pairs

21
uLocal Attention
u Divide sequence tokens into blocks of
similar length
u Attention is performed in each block
independently
u Capture local information

22
Memory-compressed attention
• Reduce the number of keys and values
by using a strided 1-D convulation
• Allows us to divide the number of
activations by a compression factor.
• Able to exchange information globally

• For both local and memory-compressed attention, masking is added to prevent the queries
from attending to future keys and values
• Our final architecture is a 5-layer network (LMLML) alternating between local-attention (L)
layers and memory-compressed attention (M) layers (in Vaswani et al. (2017) it is 6 identical
layers). We also added in some experiments one mixture of experts (MoE) layer (Shazeer et
al., 2017) to increase the network’s capacity.
23
5.2 MODEL TRAINING DETAILS AND DECODING

u For all abstractive model training, we use the open-source tensor2tensor2 library.

u The seq2seq baseline had a hidden size of 128 with 2 layers

u For the Transformer encoder-decoder (T-ED), we use the hyper-parameter set
transfomer base v1 and train for 1 million steps. Models exhibited very little over-
fitting and did not require early-stopping.
u Decoding
u during decoding we use a beam search of size 4 and length penalty α = 0.6
u and decode until an end-of-sequence token is reached.

24
Experiment

u There are four main dimension varying in experiments in generating wikipedia

lead sections :
u Extractive Method : SumBasic, TextRank,tf-idf,identity , cheating extractor
u Input corpus : citations , search results ,combined
u Abstractive model input length, L: We try values between 100 and 11000.
u Abstractive model architecture: seq2seq-att, T-ED, T-D, T-DMCA

25
The gaps between combined and using only one of citations or search results are both
significant and their contributions are complementary(Table 3,row 5)
• Extractive-only is not enough(Figure 2)
• The extractive methods perform roughly in-line with each other in terms of
ROUGE-1 F1
• The best abstractive model more than doubled this metric
26
Smart extraction is critical for final abstractive performance (Table 3,row 1-4)
• A significant gap between doing nothing,identity , and extractive
summarization,tf-idf (first- second row)
• The significant gap between tf-idf and the cheating extractor auggests
improving the extraction extraction step could result in significant
improvements
• we also observe that, unsurprisingly, the combined dataset performs best, but the gaps
between it and using only one of citations or search results are both significant and their
contributions are complementary. In subsequent experiments, we report only the
combined results.
27
Transformer Encoder-decoder (T-ED)

Architecture consistently improves in

performance until a best of arounf
L=500-1000 and it is unable to learn at L=2000

This motivated the Transformer-Decoder ,T-

D,which could learn and improve up to L=4000.

By using T-DMCA modification ,it was able to be

trained upto L=11000 and continued to see
improvements in performance.

The MoE-layer helped performance by adding

capacity at high L

28
Human Evalution – Linguistic Quality
• Five different dimensions are assessed.
• As seen in Table,the T-DMCA model does statistically better on all dimensions,except on non-
redundancy where tf-idf does about as well.
• Occasionally ,some repeition of phrases which hurt the non-redundancy and structure were
observed,but it was much rarer compared with the other abstractive method,seq2seq.
• The biggest weakness of the extractive method compared with the best abstractive model was
the lack of structure and coherence in the summaries. 29
30
31
Conclusion

u Wikipedia can be approached as a multi-document summarization problem with a

large, parallel dataset
u wo-stage extractive-abstractive frame- work for carrying it out.
u The coarse extraction method used in the first stage appears to have a significant effect on
final performance, suggesting further research on improving it
u introduce a new, decoder-only sequence transduction model for the abstractive stage,
capable of handling very long input-output examples
u This model significantly outperforms traditional encoder- decoder architectures on
long sequences, allowing us to condition on many reference documents and to
generate coherent and informative Wikipedia articles.

32
Thank you

G W S L S: Enerating Ikipedia by Ummarizing ONG Equences
No ratings yet
G W S L S: Enerating Ikipedia by Ummarizing ONG Equences
18 pages
689 Generating Wikipedia by Summar
No ratings yet
689 Generating Wikipedia by Summar
18 pages
Abstractive Text Summary Generation With Knowledge Graph Representation
No ratings yet
Abstractive Text Summary Generation With Knowledge Graph Representation
9 pages
Neural Network Text Summarization
No ratings yet
Neural Network Text Summarization
11 pages
A Combined Model For Extractive and Abstractive Summarization Based On Transformer Model
No ratings yet
A Combined Model For Extractive and Abstractive Summarization Based On Transformer Model
4 pages
Hybrid Summarization for Scientific Texts
No ratings yet
Hybrid Summarization for Scientific Texts
11 pages
Conference - Mimansha Singh
No ratings yet
Conference - Mimansha Singh
18 pages
Rare Words in Text Summarization
No ratings yet
Rare Words in Text Summarization
11 pages
Final4 W18-2706
No ratings yet
Final4 W18-2706
10 pages
Mark
No ratings yet
Mark
3 pages
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
No ratings yet
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
7 pages
A Hybrid Approach For Text Summarization Using Semantic Latent Dirichlet Allocation and Sentence Concept Mapping With Transformer
No ratings yet
A Hybrid Approach For Text Summarization Using Semantic Latent Dirichlet Allocation and Sentence Concept Mapping With Transformer
10 pages
Inlg 19 TL DR Writeup 4
No ratings yet
Inlg 19 TL DR Writeup 4
7 pages
Fast Abstractive Summarization Model
No ratings yet
Fast Abstractive Summarization Model
12 pages
Group 13 Sem 2 Review 1
No ratings yet
Group 13 Sem 2 Review 1
20 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
2023 Newsum-1 4
No ratings yet
2023 Newsum-1 4
8 pages
ACM Journals Primary Article Template Latest Version 4
No ratings yet
ACM Journals Primary Article Template Latest Version 4
31 pages
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
No ratings yet
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
11 pages
Abstractive Summarization Guide
No ratings yet
Abstractive Summarization Guide
6 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
T-BERTSum Topic-Aware Text Summarization Based On BERT
No ratings yet
T-BERTSum Topic-Aware Text Summarization Based On BERT
12 pages
Research Paper7 - 21
No ratings yet
Research Paper7 - 21
18 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Searching For Effective Neural Extractive Summarization: What Works and What's Next
No ratings yet
Searching For Effective Neural Extractive Summarization: What Works and What's Next
10 pages
Transformer
No ratings yet
Transformer
55 pages
Abstractive Summarization for ETDs
No ratings yet
Abstractive Summarization for ETDs
57 pages
Sample Research
No ratings yet
Sample Research
29 pages
Text Summarization Using NLP Technique
No ratings yet
Text Summarization Using NLP Technique
7 pages
Get To The Point: Summarization With Pointer-Generator Networks
No ratings yet
Get To The Point: Summarization With Pointer-Generator Networks
20 pages
Ranking Sentences For Extractive Summarization With Reinforcement Learning
No ratings yet
Ranking Sentences For Extractive Summarization With Reinforcement Learning
13 pages
AI Text Summarization Report
No ratings yet
AI Text Summarization Report
43 pages
Extractive Text Summarization Using Word Vector Embedding
No ratings yet
Extractive Text Summarization Using Word Vector Embedding
5 pages
Paper For Reference
No ratings yet
Paper For Reference
47 pages
Unsupervised Document Summarization with Word Embeddings
No ratings yet
Unsupervised Document Summarization with Word Embeddings
5 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
Transformers Model
No ratings yet
Transformers Model
11 pages
Research Article: N-GPETS: Neural Attention Graph-Based Pretrained Statistical Model For Extractive Text Summarization
No ratings yet
Research Article: N-GPETS: Neural Attention Graph-Based Pretrained Statistical Model For Extractive Text Summarization
14 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
Towards Abstractive Captioning of Infographics
No ratings yet
Towards Abstractive Captioning of Infographics
94 pages
Abstractive Text Summarization of Multimedia News Content Using RNN
No ratings yet
Abstractive Text Summarization of Multimedia News Content Using RNN
10 pages
Konuralp
No ratings yet
Konuralp
15 pages
8921-Article Text-15992-1-10-20210614
No ratings yet
8921-Article Text-15992-1-10-20210614
7 pages
A Neural Attention Model For Abstractive Sentence Summarization
No ratings yet
A Neural Attention Model For Abstractive Sentence Summarization
11 pages
(2023) Generation of Highlights From Research Papers Using Pointer-Generator Networks and SciBERT Embeddings
No ratings yet
(2023) Generation of Highlights From Research Papers Using Pointer-Generator Networks and SciBERT Embeddings
19 pages
Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
No ratings yet
Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
20 pages
Data Representation For Deep Learning - Based Arabic Text Summarization Performance Using Python Results
No ratings yet
Data Representation For Deep Learning - Based Arabic Text Summarization Performance Using Python Results
18 pages
Extractive Text Summarization On Single Documents Using Deep Lear
No ratings yet
Extractive Text Summarization On Single Documents Using Deep Lear
77 pages
Paper Work
No ratings yet
Paper Work
12 pages
Text Summarisation and Document Understanding Report
No ratings yet
Text Summarisation and Document Understanding Report
50 pages
Abstractive Text Summarization Using Deep Learning
No ratings yet
Abstractive Text Summarization Using Deep Learning
7 pages
Extracting Sentences and Words
No ratings yet
Extracting Sentences and Words
11 pages
Automatic Text Summarization Using Natural Language Processing
No ratings yet
Automatic Text Summarization Using Natural Language Processing
54 pages
Automatic Text Summarization Using Natural Language Processing PDF
No ratings yet
Automatic Text Summarization Using Natural Language Processing PDF
54 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Conceptual Framework For Abstractive Text Summarization
No ratings yet
Conceptual Framework For Abstractive Text Summarization
11 pages
21MCEC01 Prashant MP Review 3
No ratings yet
21MCEC01 Prashant MP Review 3
19 pages
Paper 3
No ratings yet
Paper 3
6 pages
6 - Naive Bayes
No ratings yet
6 - Naive Bayes
26 pages
Chat GPT
No ratings yet
Chat GPT
21 pages
Diplomacy
No ratings yet
Diplomacy
31 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
03 Mind Map Theory
No ratings yet
03 Mind Map Theory
24 pages
GDPR & Cybersecurity Guide
No ratings yet
GDPR & Cybersecurity Guide
284 pages
Building Services for B.Tech Students
No ratings yet
Building Services for B.Tech Students
12 pages
Facial Expression Based Music Recommendation System
No ratings yet
Facial Expression Based Music Recommendation System
10 pages
Economics Thesis Blue Variant
No ratings yet
Economics Thesis Blue Variant
38 pages
Road Restraint Systems Guide
No ratings yet
Road Restraint Systems Guide
82 pages
Cisco EtherChannel Guide
No ratings yet
Cisco EtherChannel Guide
3 pages
Effective Executive Summary by Drucker
No ratings yet
Effective Executive Summary by Drucker
10 pages
Practical Ict Book
No ratings yet
Practical Ict Book
204 pages
Nandini - Gupta - Resume PDF
No ratings yet
Nandini - Gupta - Resume PDF
2 pages
2024 CISA Study Text
No ratings yet
2024 CISA Study Text
330 pages
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
50% (2)
I. Models Arrius 1A Arrius 2B1 Arrius 2B1A Arrius 2F Arrius 2K1 Arrius 2B2 Arrius 1A1
11 pages
Scheufler Abstract Openfoam 2019
No ratings yet
Scheufler Abstract Openfoam 2019
2 pages
Aws Kms Best Practices PDF
No ratings yet
Aws Kms Best Practices PDF
24 pages
Bai Tap Ve May Bien AP
No ratings yet
Bai Tap Ve May Bien AP
7 pages
CV - Inderpreet Kaur
No ratings yet
CV - Inderpreet Kaur
2 pages
How To Test A Power Supply Unit - Corsair
No ratings yet
How To Test A Power Supply Unit - Corsair
1 page
Blowout Preventer Category IV Inspection - Rev1
100% (1)
Blowout Preventer Category IV Inspection - Rev1
5 pages
CDB Review Checklist: Program Analysis (PA) Phase Submittal Design Development (DD) Phase Submittal
No ratings yet
CDB Review Checklist: Program Analysis (PA) Phase Submittal Design Development (DD) Phase Submittal
5 pages
Exercise Getting Started v1 0
No ratings yet
Exercise Getting Started v1 0
3 pages
7MWTW1500AQ0
No ratings yet
7MWTW1500AQ0
8 pages
Dissertation Topics For Civil Engineering Students
100% (2)
Dissertation Topics For Civil Engineering Students
8 pages
Rheotherm: Circulating Water Flow and Fouling Sensor
No ratings yet
Rheotherm: Circulating Water Flow and Fouling Sensor
2 pages
Grounding Installation Guide
100% (1)
Grounding Installation Guide
25 pages
Proof
No ratings yet
Proof
1 page
Annex B - GK Style Guide For Entries
No ratings yet
Annex B - GK Style Guide For Entries
2 pages
TEST 18 (T20 gd2 11.1)
No ratings yet
TEST 18 (T20 gd2 11.1)
5 pages
Elektra Micro Casa A Leva Coffee Machine
No ratings yet
Elektra Micro Casa A Leva Coffee Machine
12 pages
3HAC16591 en
No ratings yet
3HAC16591 en
234 pages
Advanced Transducers & Data Loggers
No ratings yet
Advanced Transducers & Data Loggers
6 pages

Generating Wikipedia by Summarizing Long Sequence

Uploaded by

Generating Wikipedia by Summarizing Long Sequence

Uploaded by

Generating

u Published Date : 30th January 2018

u Number of citiation : 688

u Dataset Used : wikisum

u Published by : Google Brain

u First they use Extractive summarization to identify main information.

Single-document ,Paraprasing Multi-document,abstractive

u Training end-to-end to predict reference u Collection of related documents

In this we modify Transformer architecture (Vaswani et al., 2017) to only consist

English Gigaword Dataset to train headline • Sentence paraphrasing then summarization

Our work • Train neural abstractive models,but in the multi-document regime

u Cited sources : For each

80/10/10 for train/development/test 11

First highlight pertinent information, then

Nw : count of the word in the document,

Describe the abstractive Models, W, that learns to write articles

u Almost reducing model parameters by half for a given hyper-parameter set

u Query Q , set of key (K) and value (V ) pairs

u The seq2seq baseline had a hidden size of 128 with 2 layers

u There are four main dimension varying in experiments in generating wikipedia

Architecture consistently improves in

This motivated the Transformer-Decoder ,T-

By using T-DMCA modification ,it was able to be

The MoE-layer helped performance by adding

u Wikipedia can be approached as a multi-document summarization problem with a

You might also like