0% found this document useful (0 votes)

4 views11 pages

Transformers Model

Uploaded by

munigalofficial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views11 pages

Transformers Model

Uploaded by

munigalofficial

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Efficient Adaptation of Pretrained Transformers for

Abstractive Summarization

Andrew Hoang♥ , Antoine Bosselut♥♠ , Asli Celikyilmaz♣ , Yejin Choi♥♠

♥
Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
♠
Allen Institute for Artificial Intelligence, Seattle, WA
♣
Microsoft Research, Redmond, WA
arXiv:1906.00138v1 [cs.CL] 1 Jun 2019

{antoineb, yejin}@cs.washington.edu {asli}@ieee.org

Abstract

Large-scale learning of transformer language models has yielded improvements

on a variety of natural language understanding tasks. Whether they can be effec-
tively adapted for summarization, however, has been less explored, as the learned
representations are less seamlessly integrated into existing neural text production
architectures. In this work, we propose two solutions for efficiently adapting
pretrained transformer language models as text summarizers: source embeddings
and domain-adaptive training. We test these solutions on three abstractive sum-
marization datasets, achieving new state of the art performance on two of them.
Finally, we show that these improvements are achieved by producing more focused
summaries with fewer superfluous and that performance improvements are more
pronounced on more abstractive datasets.

1 Introduction
Recent work in large-scale language models [19; 20; 5] has allowed pretrained contextual repre-
sentations to be easily adapted for a variety of downstream tasks, yielding improvements on many
benchmarks evaluating natural language understanding [27]. Less explored, however, has been the
effect of these pretrained representations on text production tasks, such as abstractive summarization,
where state of the art performance is still achieved with sequence to sequence (seq2seq) models [1; 6].
These sequence-to-sequence methods typically use an encoder and decoder model with separate
parameters to represent the input article and produce the output summary, and the most successful
solutions [1; 6; 22] use attention mechanisms that learn an alignment between encoder and decoder
states. Pretrained language models, however, do not learn the parameters for such a task-specific
alignment, making it challenging to integrate their learned representations into a summarization
architecture at a higher level of abstraction than the word embedding.
In this work, we adapt full transformer language models for abstractive summarization. Building off
the work of Liu et al. [12], who first proposed concatenating input and output text to a joint sequence
and using a common transformer to encode both, we use a language model as a summarizer (rather
than an encoder-decoder). With this approach, representations from a pretrained transformer language
model (in this case, GPT [20]) can be used to fully initialize the parameters of the summarization
model, allowing it to leverage the representational power of a model trained at much larger scale.
To accomplish this effectively, we outline two strategies for adapting pretrained representations for
abstractive summarization. In the first, we augment the input representation of the summarization
model by instantiating source embeddings that encode the token type of the text being read. This
change allows the model to recognize whether a given token belongs to the input article or the output
summary, thereby learning how to distinguish both types of text when encoding. In the second, we

Preprint. Under review.

Input
_end_
a a s s
x , ..., x _delim_ x , ..., x
Sequence 1 M 1 N

Word a
e[x ], . . . , e[x
a
]
s
e[x ], . . . , e[x
s
]
e[_delim_] e[_end_]
Embeddings 1 M 1 N

Position
p1 , . . . , pM p0 p1 , . . . , pN pN+1
Embeddings

Source a a s s s s
d , ..., d d d , ..., d d
Embeddings

Figure 1: The embedding process for inputs to the Transformer-SM model.

introduce a domain-adaptive training procedure that fine-tunes the transformer toward understanding
general newswire text before training on the summarization end task directly, allowing the model
to learn the general structure and language distribution of newswire text before being fine-tuned to
produce summaries.
A comprehensive empirical study across three datasets, CNN/DailyMail [8], XSum [16], and News-
room [7], shows that transformer language models can be used to train abstractive summarizers,
producing summaries that are more concise and focused than state of the art baselines. Our investiga-
tion also empirically validates several observations about the abstractive summarization task. First,
echoing the results of Sun et al. [24], the most common summarization evaluation metric, ROUGE
[10], is highly sensitive to summary length, providing an advantage to methods that produce longer
summaries, either through learning or with minimum summary length constraints. Second, achieving
higher ROUGE scores is not strongly consistent with human assessments of abstractive summary
quality. Finally, despite being conceived as abstractive summarizers, most current state of the art
models are highly extractive, copying phrases and even sentences verbatim from the document.

2 Model
In this paper, we focus on a variant of the Transformer [26] that has been pretrained on a large corpus
of natural language stories: the GPT model [20]. As our architecture is practically identical to the
one proposed in Radford et al. [20], we point readers to that work for background on the architecture
of the model, and focus below on the enhancements to the input representation made in our approach.

2.1 Input Representation

Each article is represented as a sequence of M tokens Xa = {xa }M a a

m=1 = x1 , ..., xM and its
s s N s s
corresponding summary is a sequence of N tokens X = {x }n=1 = x1 , ..., xN . As outlined
in Figure 1, the input structure of the training set is a pair of article and corresponding summary
concatenated into two sequences similar to [12]:

X = {x}Tt=1 = [Xa ; <D>; Xs , <E>] (1)

where T = M + N + 2, and <D> and <E> are special tokens identifying the delimitation and end of
the sequence. Below, we define the process of encoding these sequences as inputs to the transformer.

Word Embedding First, each token xt in the concatenated sequence X indexes a word embedding
e[xt ] ∈ Rh from a joint vocabulary for the article and summary (and special tokens).

Position Embedding Second, since the transformer (a self-attention model) has no concept of
ordering of tokens, a position embedding pt ∈ Rh is initialized for each absolute position in the
sequence [26]. The embedding for each position in the sequence is added to the word embedding of
the token occupying that position, augmenting the final representation of the input. For example, each
a
token in the article would be represented as: wm = e[xam ] + pm . Once the delimitation token <D> is

2
reached, the position counter is reset. For example, the first token of the article, xa1 , and the first token
of the summary, xs1 , both receive p1 as a positional embedding to augment their representations.

Source Embedding Finally, because the transformer must recognize pragmatic differences between
the text of the article it reads and the text of the summary it learns to produce, an additional, source-
specific embedding is initialized, d ∈ Rh . The source embedding encodes whether a token is from
the article portion da of the concatenated input, or the summary portion ds . For any article token
(Eq. 2) or summary token (Eq. 3) then, the final encoding is:
a
∀ wm = e[xam ] + pm + da (2) ∀ wns = e[xsn ] + pn + ds (3)
m∈[1,M ] n∈[1,N ]

In contrast to the other embeddings in the model, the source embeddings are not pretrained, intro-
ducing the potential that they could dominate pretrained representations for the word and position
embeddings when summed (Eq. 2, 3). To avoid this, we normalize the random initialization of the
source embeddings to have norm equal to half of the average norm of the word embeddings.

3 Training

The model is initialized with pretrained parameters from the GPT model [20] that was trained on
the BooksCorpus [28]. Following this initialization, we pursue two additional training procedures:
domain-adaptive training and end task training.

3.1 Domain-adapative Training

Despite the benefit of using pretrained representations from the GPT model to initialize a summarizer,
there is a language shift between the storybooks data on which the GPT model was trained and
the type of language found in newswire summarization datasets [8; 16; 7]. Additionally, there are
structural differences between how articles are written (usually expressing salient points early on,
followed by details later) and how stories unfold (less front-loading of key information).
To address this discrepancy, we propose domain-adaptive training (DAT) to adapt the transformer
summarization model to the language distribution of newswire text by maximizing the conditional
loglikelihood of the article tokens and summary tokens given all previous tokens in their concatenated
input representation (see Figure 1):

M
X N
X
Ldat = − log P (xam |{xa }m
1 )− log P (xsn |{xs }n1 , {xa }M
1 ) (4)
m=1 n=1

where M is length of the article, N is the length of the summary, {xa }<m is the set of all tokens in the
article that precede xam , {xs }<n is the set of all tokens in the summary that precede xsn , and {xa }M
is the set of all article tokens. In this framework, the model is adapted to produce newswire-like
language before being trained on the summarization end task, which only focuses on learning for
summary production.

3.2 End Task Training

During end task training (ETT), the model is trained specifically to be able to produce a summary
given a document, constraining the loss function toward maximizing the conditional loglikelihood of
producing only the correct summary tokens given the set of article tokens {xa }M :

N
X
Lett = − log P (xsn |{xs }<n , {xa }M ) (5)
n=1

where {xs }<n is the set of tokens in the summary that precede xsn .

3
Table 1: Comparison of summarization datasets with respect to dataset size, proportion of unique
n-grams, mean article length in words, and mean summary length in words.
Split Size % Novel n-grams in Gold Summary Mean # Words
Dataset
Train Validation Test unigrams bigrams trigrams 4-grams Article Summary
Newsroom 993,746 108,590 108,636 17.40 44.05 55.38 61.21 658.6 26.7
XSum 204,045 11,332 11,334 34.88 78.78 92.03 96.80 431.1 23.3
CNN/DailyMail 287,227 13,368 11,490 12.70 46.29 65.04 75.56 685.2 52.0

4 Experimental Setup
Datasets The CNN/Daily Mail dataset [8] consists of articles from CNN and Daily Mail. Each
article is associated with several descriptive bullet point highlights. Similar to previous work [15],
we concatenate the highlights to create a target summary for each article in the dataset and use the
same dataset splits. The Extreme Summarization (XSum) dataset [16] consists of ∼230k article
summary pairs taken from the BBC. Each summary is a single sentence long and is professionally
written (usually by the author), making the dataset exhibit more abstractive content than typical
summarization datasets, such as CNN/DailyMail [8]. The Newsroom dataset [7] consists of ∼1.2M
article summary pairs scraped from the Internet Archive. The articles come from a set of 38 publishers
and cover diverse topics. We provide statistics about each dataset in Table 1.

Data Preprocessing We used a bytepair encoding (BPE) for tokenization. For each summarization
dataset, we use the BPE to tokenize each article and summary, and then truncate the articles to a
maximum length of 512 tokens and each summary to a maximum length of 110 tokens. We then
format each article summary pair into the format outlined in Figure 1.

Model Specifications We used a transformer decoder with N = 12 blocks and h = 12 masked

self-attention heads in each block. We set the dimensionality of each self-attention head to be
dmodel = 768. Unless stated otherwise, we use the pretrained weights of Radford et al. [20] to
initialize the parameters of the model. Special tokens that are added to the vocabulary (i.e. the
end token, start token, and delimiter token) are initialized by sampling from the standard normal
distribution. Our full model with source embeddings (§2.1) is denoted as as Transformer-SM and we
also train an ablation, Transformer-LM, that does not use source embeddings.

Training Details All models were trained with a learning rate of 6.25 × 10−5 and a minibatch
size of 64. When domain-adaptive training (DAT) is used, we train for 10 epochs using DAT and
then for an additional 10 epochs using end task training (ETT). Without DAT, we train on the end
task for 20 epochs. Unless specified otherwise, the final model trained for each dataset uses both
domain-adaptive training and end task training. We did not tune hyperparameters. All models were
trained using the PyTorch package1 and the HuggingFace implementation of GPT.2 We trained each
model on 8 Tesla V100-SMX2. Training for a total of 20 epochs took approximately 1 day of clock
time for the XSum and CNN/Daily Mail datasets, and 3 days for the Newsroom dataset. Our source
code is publicly available.3

Generation We perform generation by using beam search with a beam size of 3. We use the trigram
trick [18] during beam search. Each summary token is generated by decoding from the distribution
yielded by the model from processing an input tensor that is the concatenation of the article tokens,
the delimiter token, and any previously generated summary tokens.

Evaluation We evaluate our system with common summarization metrics: ROUGE-1 (R-1), a
measure of unigram recall between the summary and document, ROUGE-2 (R-2), a similar measure
of bigram recall, and ROUGE-L (R-L), a measure of the longest common subsequence between the
summary and document [11]. We also report the length of the summary in terms of tokens produced.
For each dataset, for evaluation on the test set, we selected models with the largest ROUGE-1 score
on a subset of 500 samples from the validation set.
1
https://pytorch.org/
2
https://github.com/huggingface/pytorch-openai-transformer-lm
3
https://github.com/Andrew03/transformer-abstractive-summarization

4
5 Experiments

5.1 CNN/Daily Mail

Baselines We report the results from various models previously trained and evaluated on the
CNN/Daily Mail dataset. The PGen and PGen + Coverage models [22], consist of attentive RNN
encoder-decoders that integrate the ability to directly copy from the article when generating tokens.
Pasunuru and Bansal [17] extend this work by adding policy gradient training with a mixture of
rewards that promote saliency and entailment. Bottom-up summarization and the Copy Transformer
[6] also extend See et al. [22] by using the copy mechanism to compress the article to only relevant
content before summarizing it. Chen and Bansal [2] also look at performing content selection, but
extract full sentences from the document with a novel extractor model. Finally, the DCA model
[1] uses multiple separate communicating encoders over different parts of the document to produce
representations that are more focused on salient details.

Table 2: ROUGE F1 results on the test set of CNN/Daily Mail. Best model results are bolded.
Model R-1 R-2 R-L Length (L)
PGen [22] 36.44 15.66 33.42 53.69
PGen + Coverage [22] 39.53 17.28 36.38 59.75
RougeSal + Ent RL [17] 40.43 18.00 37.10 -
Bottom-Up Summ [6] 41.22 18.68 38.34 55.25
CopyTransformer [6] 40.96 18.38 38.16 -
rnn-ext + RL [2] 41.47 18.72 37.76 77.44
DCA [1] 41.67 19.47 37.92 51.01
Transformer-LM 38.67 17.47 35.79 43.40
Transformer-SM 37.96 17.36 35.12 42.42

Automatic Metrics We report our results using automatic metrics in Table 2. On this dataset, our
main model, Transformer-SM, performs slightly worse than other state of the art models. We note
that our model tends to generate shorter summaries than the gold summaries (∼ 20% shorter), which
could lower ROUGE recall performance.
In Figure 2, we investigate the correlation of ROUGE-L scores with summary length, and note that a
minimum decoding length used by state-of-the-art algorithms places baseline generated summaries in
length bins of higher average ROUGE-L performance. When Transformer-SM produces summaries
in these same length bins (i.e., more than 30 tokens), its performance is only consistently beaten by
the DCA model, which was fine-tuned with RL.

Figure 2: Average ROUGE-L for summaries in different length bins. Scatter plots correspond to
ROUGE-L scores for each bin, while solid lines correspond to the number of summaries in each bin

5
Table 3: Head-to-head comparison between test set outputs of (Left) DCA [1] and Transformer-
SM (Right) PGen+Cov [22] and Transformer-SM. Analyses done on summaries for CNN/DailyMail.

Model DCA Same T-SM PGen+Cov Same T-SM

Non-redundancy 96 116 163 77 85 213
Coherence 136 60 179 160 35 180
Focus 136 36 203 115 33 227
Overall 138 36 201 150 39 186

Human Evaluation While ROUGE scores are negatively influenced by the shorter average length
of the summaries produced by our model, it is not clear that shorter summaries are correlated with
worse quality. To evaluate this hypothesis, we perform a human evaluation on 125 (article, summary)
pairs randomly sampled from the test set. The article and model-generated summaries were presented
to three workers from Amazon Mechanical Turk (AMT).
Each worker was presented two model-generated summaries, one produced by the Transformer-
SM model, and one from the DCA model [1] or the PGen+Cov model [22]. Workers were asked
to select the better summary for four different quality metrics from Celikyilmaz et al. [1]: non-
redundancy (fewer of the same ideas are repeated), coherence (ideas are expressed clearly), focus
(the main ideas of the document are shared while avoiding superfluous details), and overall.
The results are presented in Table 3. Interestingly, the summaries from Transformer-SM are consis-
tently preferred by humans across all 4 evaluations dimensions compared to those from the DCA and
PGen+Coverage models, indicating that the Transformer-SM’s lower ROUGE scores observed in
Table 2 are not necessarily correlated with human judgments of quality.

Table 4: ROUGE-L precision (R-L P), recall (R- Table 5: Ablation study of training schedules
L R), and F1 (R-L F1) scores computed between on CNN/DailyMail. (PT) Model initialized with
generated summaries and input CNN/DailyMail pretrained weights; (DAT) Model uses domain-
articles after removing stop words adaptive training; (ETT) trained on end task.

Model Name R-L P R-L R L Model R-1 R-2 R-L

PGen [22] 95.22 8.22 53.69 T-LM (ETT) 36.82 16.04 34.03
PGen+Cov [22] 99.83 10.74 59.75 T-LM (DAT+ETT) 38.00 17.13 35.20
Bottom-Up [6] 98.93 9.35 55.25 T-LM (PT+ETT) 38.20 17.39 35.40
rnn-ext + RL [2] 99.05 12.77 77.44 T-LM (PT+DAT+ETT) 39.01 17.87 36.17
DCA [1] 97.31 8.24 51.01
T-SM (ETT) 37.81 16.82 34.87
Transformer-LM 96.66 9.95 43.40
T-SM (DAT+ETT) 38.34 17.34 35.44
Transformer-SM 97.16 9.78 42.42
T-SM (PT+ETT) 38.71 17.53 35.90
Gold Summary 79.88 11.13 52.02 T-SM (PT+DAT+ETT) 38.33 17.79 35.56

Efficiency Due to the large improvements over the baseline models in the human evaluation cate-
gories of non-redundancy and focus, and the generally shorter summaries produced by Transformer-
SM, we investigate whether Transformer-SM is able to more efficiently express key ideas of the
document. To evaluate the efficiency of each model, we remove non-content words from the model-
generated summaries and articles, and compute the ROUGE score between them. This measure
serves as a proxy for the rate at which ideas expressed in the summary can be found in the document.
We report these results in Table 4 and observe that Transformer-SM reports comparable ROUGE-
L recall scores to other baselines when evaluated with respect to the article, despite producing
summaries that, on average, 27% shorter. Meanwhile, ROUGE-L precision is also very similar to the
baseline models, indicating that the summaries of all models indicate a similar degree of information
relevance.4 Combined with the results from Table 3, we conjecture that Transformer-SM is able
4
The high precision scores across all models confirm that despite being conceived as abstractive generators,
these models display highly extractive behavior.

6
to more efficiently express key ideas from the document. While other models may be producing
longer summaries that yield higher ROUGE performance (Table 2), the additional tokens may reflect
redundant and unsalient information, which human evaluators penalize.

Analysis of domain-adaptive training and source embeddings Our approach involved two strate-
gies for efficiently using transformer language models for abstractive summarization: domain-adaptive
training and source embeddings. To assess their individual impact, we evaluate multiple training
schedule permutations (e.g., various combinations of using pretrained representations from the GPT
model and using domain-adaptive training), as well as the impact of source embeddings. Our results
in Table 5 yield multiple interesting conclusions. First, in general, domain-adaptive training (+DAT in
Table 5) provides a clear improvement over training directly on the end task, irrespective of whether
pretrained representations are used. Similarly, using source embeddings (T-SM in Table 5) provides a
repeated improvement over the T-LM ablation. Surprisingly, when pretrained initializations, DAT,
and source embeddings are used in tandem, performance drops slightly compared to not using DAT
or not using source embeddings. We note, however, that this observation does not hold true for the
XSum dataset (§5.2), and conjecture that the extractive nature of the CNN/DailyMail dataset may
make these approaches have redundant effects in this setting.

5.2 XSum

A study on the quality of abstractive summaries is best performed on the XSum dataset [16], which is
specifically designed with gold summaries that are less extractive than the other datasets (Table 1).

Baselines We report the performance of Transformer-SM on this dataset in comparison to baselines

originally reported in Narayan et al. [16]: an attention-based sequence to sequence model (AttnS2S),
a pointer-generator model capable of generating words and copying directly from the input (PGen), a
second pointer-generator model with a coverage mechanism to prevent repetition (PGen+Cov), and
the top performing variant of the topic aware convolutional sequence to sequence model (T-ConvS2S),
in which the encoder and decoder are provided with word topic and document topic distributions
obtained using LDA as additional inputs. Our final baseline is the Multi-Level Memory Network
(MMN) [9], which applies attention over multiple memory layers for varying levels of abstraction.

Results We report our results in Table 6. Our mod- Table 6: Comparison results on the XSum test
els significantly outperform the comparison base- set using the F1 variants of ROUGE
lines across all three variants of the ROUGE metric.
Interestingly, the Transformer-SM achieves notice- Model R-1 R-2 R-L
able improvement over the Transformer-LM model, AttnS2S [16] 28.42 8.77 22.48
suggesting that both source embeddings and domain PGen [16] 29.70 9.21 23.24
adaptive training are helpful when target summaries PGen+Cov [16] 28.10 8.02 21.72
are more abstractive. Examples of model-generated T-ConvS2S [16] 31.89 11.54 25.75
summaries from the XSum dataset illustrate the im- MMN [9] 32.00 12.10 26.00
provement over baselines qualitatively in Table 7. In
Transformer-LM 36.03 14.57 29.20
support of results presented earlier, the model pro-
Transformer-SM 36.73 14.93 29.66
duces abstractive summaries that provide focused
information about the main points of the articles.

5.3 Newsroom

Finally, we report the performance of our model on the Newsroom dataset [7], the largest of the
evaluation datasets. Due to the large cost of training, only the Transformer-SM model was evaluated.

Baselines As baselines, we report the performance of models released by the authors of the
Newsroom dataset [7]. These models included an attentive encoder-decoder (Attn-S2S) and a pointer-
generator network (PGen). We also compared against C10110 [23], a complex encoder-decoder that
uses LSTMs, encoder attention, intra-decoder attention, and pointer-generation to produce summaries.
We also compare against the Multi-Level Memory Network (MMN) [9] mentioned earlier. The
authors of this baseline only evaluated on the abstractive subset of the Newsroom dataset.

7
Table 7: XSum samples for the baseline T-ConvS2S model and Transformer-SM along with the gold
summary. Articles are shortened for brevity. Capitalization was manually added for ease of reading.

Source Source Text

Officials said the attack happened at the Europa shopping centre in the capital Minsk. ... Police
later arrested the 18-year-old suspect. ... "He cut one woman with the chainsaw and hit her with
Article snippet
a hammer. She died. He also attacked others." The injured woman was taken to a local hospital.
The attacker had brought the chainsaw and the axe to the shopping centre ...
A man has been arrested on suspicion of attempted murder by after a knife attack on a shopping
T-ConvS2S
centre in central London.
A teenage girl has been killed by a chainsaw attack at a shopping centre in central Russia, police
Transformer-SM
say.
A young man has attacked people with a chainsaw and an axe at a shopping centre in Belarus,
Gold
killing one woman and injuring another.
The 34-year-old Sweden striker’s contract with the french champions expires in the summer, and
Article snippet he has been linked with Manchester United, LA Galaxy and AC Milan. ... PSG said Ibrahimovic
leaves as "the greatest striker and one of the very best players in the club’s history" . ...
Paris St-Germain have completed the signing of Zlatan Ibrahimovic from Paris St-Germain for
T-ConvS2S
an undisclosed fee.
Zlatan Ibrahimovic says he will leave Paris St-Germain at the end of the season to return to the
Transformer-SM
club.
Gold Zlatan Ibrahimovic will leave Paris St-Germain at the end of the season.
... The animal was taken from Lathom pets and aquatics in Ormskirk on Tuesday afternoon,
Article snippet Lancashire police said. The shop’s owner said CCTV showed a man taking the tortoise - which
needs calcium supplements - out of the tank. ...
T-ConvS2S A puppy’s pet shop has been stolen from a shop in Lancashire.
Transformer-SM A tortoise has been stolen from a pet shop.
Gold A baby tortoise has been stolen from a pet shop in Lancashire.

Table 9: ROUGE F1 results on validation subsets and full validation set for Newsroom
Extractive Mixed Abstractive Newsroom-D
Model Name
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
PGen [7] 39.11 27.95 36.17 25.48 11.04 21.06 14.66 2.26 11.44 26.27 13.55 22.72
MMN [9] - - - - - - 17.5 4.7 14.2 - - -
Transformer-SM 64.69 59.72 63.56 35.75 18.83 30.63 22.44 7.39 18.75 41.42 29.51 38.26

Results We report our results with ROUGE-style au-

Table 8: Comparison results on the News- tomatic metrics in Table 8, showing that Transformer-
room test set using ROUGE F1 SM outperforms the previous best model, C10110 [23],
across all metrics. Interestingly, our model achieves
Model R-1 R-2 R-L its highest performance increase over baseline models
on Rouge-L, the metric usually considered as being
Attn-S2S [16] 5.88 0.39 5.32 most strongly correlated with strong summaries. Fur-
PGen [16] 26.02 13.25 22.43 thermore, an analysis of different validation subsets
C10110 [23] 39.36 27.86 36.35 of the Newsroom dataset in Table 9 (split on the level
T-SM 40.87 28.59 37.62 of extractiveness of the gold summaries) shows that
Transformer-SM performs better than baselines ap-
proaches on all varieties of summary types.

6 Related Work

Abstractive Summarization There has been a large variety of work exploring different methods
for neural abstractive document summarization. Attention mechanisms have been shown to improve
a variety of models [14; 25; 3], and is one of the motivating factors for this work. Pointer generator
networks introduced in See et al. [22] have been shown to increase summary veracity, and inspired

8
the tangential usage of copy mechanisms in Transformers for document summarization Gehrmann
et al. [6]. Other works have also explored the use of reinforcement learning to directly optimize
summarization models on the ROUGE metric [17; 18; 2].

Contextualized Representations Our approach is also relevant to recent work on contextualized

language representations that are pretrained on large-scale language corpora. These representations
can then be simply integrated or fine-tuned for improved performance on many downstream tasks
[27]. SSL [4], CoVe [13], and ELMo [19] all learned contextualized representations through training
RNN language models and encoder-decoders. Follow-up work extended these ideas, but replaced the
RNN with a deep transformer [20] that was trained to learn language patterns on a large story dataset.
BERT [5] more clearly extended the idea of using Transformers for language modeling by making the
encoded representations bidirectional and adding two new loss functions: a masked token loss and
next sentence prediction loss for more accurate discourse representations. More recently, GPT2 [21]
expanded the scale of pretrained language models, and showed promising results on zero-shot tasks.

7 Conclusion
In this work, we introduce two approaches for effectively adapting pretrained language model repre-
sentations to abstractive summarization: domain-adaptive training, and source embeddings. We eval-
uate the effect of both approaches across three abstractive summarization testbeds: CNN/DailyMail,
XSum, and Newsroom, and achieve state of the art ROUGE-L results on two of them, while showing
superior human evaluation performance on the third. In the process, we show that the ROUGE-L
metric often used for abstractive summarization evaluation is quite sensitive to summary length,
allowing it to be exploitable by approaches that use heuristics to control summary length.

References
[1] Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating
agents for abstractive summarization. In NAACL.
[2] Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected
sentence rewriting. In ACL.
[3] Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and
words. arXiv preprint arXiv:1603.07252.
[4] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training
of deep bidirectional transformers for language understanding. In NAACL.
[6] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive
summarization. In EMNLP.
[7] Max Grusky, Mor Naaman, and Yoav Artzi. 2019. Newsroom: A dataset of 1.3 million
summaries with diverse extractive strategies. In NAACL.
[8] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances
in Neural Information Processing Systems, pages 1693–1701.
[9] Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. 2018. Abstractive summarization of
reddit posts with multi-level memory networks. arXiv preprint arXiv:1811.00783.
[10] Chin-Yew Lin. 2004. Looking for a few good metrics: Automatic summarization evaluation-how
many samples are enough? In NTCIR.
[11] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summa-
rization Branches Out.
[12] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In ICLR.

9
[13] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in
translation: Contextualized word vectors. In NIPS.
[14] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural
network based sequence model for extractive summarization of documents. In Thirty-First
AAAI Conference on Artificial Intelligence.
[15] Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang.
2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL.
[16] Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the
summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP.
[17] Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with
saliency and entailment. In ACL.
[18] Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for
abstractive summarization. In ICLR.
[19] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
[20] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding by generative pre-training. 2018. URL https://s3-us-west-2. amazonaws.
com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.
pdf.
[21] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask learners.
[22] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization
with pointer-generator networks. In ACL.
[23] Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K Reddy. 2018. Neural abstrac-
tive text summarization with sequence-to-sequence models. arXiv preprint arXiv:1812.02303.
[24] Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers
without target length? pitfalls, solutions and re-examination of the neural summarization
literature. In Proceedings of NAACL 2019 Workshop on Optimizing and Evaluating Neural
Language Generation (NeuralGen).
[25] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with
a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1171–
1181.
[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008.
[27] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
2018. Glue: A multi-task benchmark and analysis platform for natural language understanding.
arXiv preprint 1804.07461.
[28] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual
explanations by watching movies and reading books. 2015 IEEE International Conference on
Computer Vision (ICCV), pages 19–27.

10
A Reproducibility
We provide additional details relevant to the experimental environment here.

Data Sources The CNN/Daily Mail dataset [8] consists of articles from CNN and Daily Mail.5 Each
article is associated with several descriptive bullet point highlights. Similar to previous work [15], we
concatenate the highlights to create a target summary for each article in the dataset. The Newsroom
dataset [7] consists of ∼1.2M article summary pairs scraped from the Internet Archive.6 The articles
come from a set of 38 publishers and cover diverse topics. Finally, the Extreme Summarization
(XSum) dataset [16] consists of ∼230k article summary pairs taken from the BBC. 7 Each summary
is a single sentence long and is professionally written (usually by the author). For all datasets, we
use the splits defined in the original works that proposed them. Because the datasets are too large to
provide as supplementary material, we provide pointers in the source code README for acquiring
them.

Hyperparameters Details about important hyperparameters can be found in Section 4 of the paper.
Additional training hyperparameters can be found as the default parameters in the training script of the
source code.8 Most hyperparameter values selected were the same ones suggested by previous work
on transformer language models [20]. The only hyperparameter we varied that is not measured as an
ablation (i.e., training schedules and whether to include source embeddings) was the initialization
of source embeddings (if they were included). For this hyperparameter, we explored three different
initializations: 1) initializing both source embeddings with zero vectors, 2) initializing both source
embeddings with values sampled from the standard normal distribution, and 3) initializing both source
embeddings with values sampled from a normal distribution with mean 0 and standard deviation
equal to half the norm of the average norm among pretrained embeddings from the GPT language
model. This last one is the one we report in all experiments.

Experimental Process Each experiment was run as follows for any given model and dataset. First,
we trained the model as described in the paper. After every 1000 minibatches, we compute ROUGE
for a random, but persistent 500 example subset of the validation set. When the ROUGE-1 score of
the model stopped rising, we used the previous checkpoint as a model to generate summaries for all
articles in the test set. We used beam search to decode summaries using a beam with of 3. We ran
exactly one evaluation run for each result we include in our paper.

5
https://www.cnn.com; https://www.dailymail.co.uk
6
https://archive.org/
7
https://www.bbc.com/
8
https://github.com/Andrew03/transformer-abstractive-summarization

A Transformer-Based Approach For Source Code Summarization: Former
No ratings yet
A Transformer-Based Approach For Source Code Summarization: Former
10 pages
Inlg 19 TL DR Writeup 4
No ratings yet
Inlg 19 TL DR Writeup 4
7 pages
Searching For Effective Neural Extractive Summarization: What Works and What's Next
No ratings yet
Searching For Effective Neural Extractive Summarization: What Works and What's Next
10 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Lesson 14 - Transformer
No ratings yet
Lesson 14 - Transformer
124 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time - .Booklet
14 pages
Transformers
No ratings yet
Transformers
27 pages
LectureLtR-neural IR 2
No ratings yet
LectureLtR-neural IR 2
52 pages
Unit 2 Generative AI
No ratings yet
Unit 2 Generative AI
14 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Neural Machine Translation Assignment
No ratings yet
Neural Machine Translation Assignment
11 pages
Getting Started With The Model Architecture of The Transformer
No ratings yet
Getting Started With The Model Architecture of The Transformer
103 pages
08 Transformer
No ratings yet
08 Transformer
56 pages
LLM Report
No ratings yet
LLM Report
6 pages
Final4 W18-2706
No ratings yet
Final4 W18-2706
10 pages
5th Unit
No ratings yet
5th Unit
36 pages
Transformer Architecture
No ratings yet
Transformer Architecture
18 pages
The Curious Case of Absolute Position Embeddings
No ratings yet
The Curious Case of Absolute Position Embeddings
24 pages
DAA FinalReport
No ratings yet
DAA FinalReport
14 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
M5 Topic 1 - Encoder Decoder
No ratings yet
M5 Topic 1 - Encoder Decoder
21 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
Visual Transformer
No ratings yet
Visual Transformer
18 pages
Trend
No ratings yet
Trend
47 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
T-BERTSum Topic-Aware Text Summarization Based On BERT
No ratings yet
T-BERTSum Topic-Aware Text Summarization Based On BERT
12 pages
Language Translation
No ratings yet
Language Translation
15 pages
NLP Transfer Learning with USE
No ratings yet
NLP Transfer Learning with USE
7 pages
Transformer Models Overview for NLP
No ratings yet
Transformer Models Overview for NLP
5 pages
Uppwise Standard PPT 2
No ratings yet
Uppwise Standard PPT 2
13 pages
French To English Translator in PyTorch
No ratings yet
French To English Translator in PyTorch
30 pages
Transformers
No ratings yet
Transformers
23 pages
The Annotated Transformer: Alexander M. Rush
No ratings yet
The Annotated Transformer: Alexander M. Rush
9 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
Transformer Decoder Side
No ratings yet
Transformer Decoder Side
9 pages
03b. Transformers
No ratings yet
03b. Transformers
75 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Prompting in Large Language Models
No ratings yet
Prompting in Large Language Models
66 pages
Transformer-XL: Attentive Language Models Beyond A Fixed-Length Context
No ratings yet
Transformer-XL: Attentive Language Models Beyond A Fixed-Length Context
20 pages
Transformers v1.1
No ratings yet
Transformers v1.1
1 page
Transformer
No ratings yet
Transformer
14 pages
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
No ratings yet
The Illustrated Transformer - Jay Alammar - Visualizing Machine Learning One Concept at A Time
22 pages
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
No ratings yet
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
7 pages
Project
No ratings yet
Project
20 pages
Neural Machine Translation PDF
No ratings yet
Neural Machine Translation PDF
15 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (2)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Approaches For Neural-Network Language Model Adaptation: August 2017
No ratings yet
Approaches For Neural-Network Language Model Adaptation: August 2017
6 pages
500 Quadratic Equation Questions Worksheet
No ratings yet
500 Quadratic Equation Questions Worksheet
94 pages
A Study On Speech Emotion Recognition Based On MFCC and KNN Models
No ratings yet
A Study On Speech Emotion Recognition Based On MFCC and KNN Models
4 pages
An Ensemble Method For Phishing Websites Detection Based On XGBoost
No ratings yet
An Ensemble Method For Phishing Websites Detection Based On XGBoost
6 pages
Solving Simultaneous Equations
No ratings yet
Solving Simultaneous Equations
3 pages
Lecture-4 Strip Method - StructuralDesign (Compatibility Mode) PDF
No ratings yet
Lecture-4 Strip Method - StructuralDesign (Compatibility Mode) PDF
7 pages
Ot Unit-1
No ratings yet
Ot Unit-1
42 pages
Stata Practical Multilevel
No ratings yet
Stata Practical Multilevel
23 pages
Company Profile and OR JD
No ratings yet
Company Profile and OR JD
2 pages
Signal Processing and Communications
No ratings yet
Signal Processing and Communications
38 pages
BIDMAS Practice and Solutions
No ratings yet
BIDMAS Practice and Solutions
8 pages
Very Short Notes of Dbms RGPV (CS502)
No ratings yet
Very Short Notes of Dbms RGPV (CS502)
17 pages
White Box Testing (SW)
No ratings yet
White Box Testing (SW)
24 pages
Powering The Future of Electrical Load Forecasting Using A Regression Learner in Machine Learning
No ratings yet
Powering The Future of Electrical Load Forecasting Using A Regression Learner in Machine Learning
11 pages
Knowledge Discovery in Database
No ratings yet
Knowledge Discovery in Database
10 pages
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
No ratings yet
Simulating A CRCW Algorithm With An EREW Algorithm: Efficient Parallel Algorithms COMP308
11 pages
Week 3c - Phylogenetic - Tree - ConstructionMai PDF
No ratings yet
Week 3c - Phylogenetic - Tree - ConstructionMai PDF
19 pages
Stable Marriage Problem and Sudoko
No ratings yet
Stable Marriage Problem and Sudoko
25 pages
ch10 Sequence Modelling - Recurrent and Recursive Nets
No ratings yet
ch10 Sequence Modelling - Recurrent and Recursive Nets
45 pages
Institute of Engineering and Management, Kolkata Artificial Intelligence Project (CS793C) On Handwriting Analysis
No ratings yet
Institute of Engineering and Management, Kolkata Artificial Intelligence Project (CS793C) On Handwriting Analysis
11 pages
Network Security Unit-2
No ratings yet
Network Security Unit-2
17 pages
Optimizing Initial Basic Feasible Solutions For Transportation Problems: A Novel Approach Incorporating Second Least Cost As Penalty
No ratings yet
Optimizing Initial Basic Feasible Solutions For Transportation Problems: A Novel Approach Incorporating Second Least Cost As Penalty
9 pages
Designing Optimal Routes in A Liner Shipping Problem
No ratings yet
Designing Optimal Routes in A Liner Shipping Problem
10 pages
1.1. Formulation of LPP: Chapter One: Linear Programming Problem/LPP
No ratings yet
1.1. Formulation of LPP: Chapter One: Linear Programming Problem/LPP
54 pages
Classification of Signals Part 1
No ratings yet
Classification of Signals Part 1
15 pages
Solutions 3
No ratings yet
Solutions 3
4 pages
Hybrid Systems Stability Guide
No ratings yet
Hybrid Systems Stability Guide
51 pages
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
100% (3)
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
413 pages
Queue: Dr. Manmath Narayan Sahoo Dept. of CSE, NIT Rourkela
No ratings yet
Queue: Dr. Manmath Narayan Sahoo Dept. of CSE, NIT Rourkela
37 pages
Introduction To CFD Basics Rajesh Bhaskaran
No ratings yet
Introduction To CFD Basics Rajesh Bhaskaran
17 pages
Operations Research Project: A Report Presented in Fulfillment of The Term Project in Operations Research To
No ratings yet
Operations Research Project: A Report Presented in Fulfillment of The Term Project in Operations Research To
10 pages

Transformers Model

Uploaded by

Transformers Model

Uploaded by

Efficient Adaptation of Pretrained Transformers for

Andrew Hoang♥ , Antoine Bosselut♥♠ , Asli Celikyilmaz♣ , Yejin Choi♥♠

{antoineb, yejin}@cs.washington.edu {asli}@ieee.org

Large-scale learning of transformer language models has yielded improvements

Preprint. Under review.

Figure 1: The embedding process for inputs to the Transformer-SM model.

2.1 Input Representation

Each article is represented as a sequence of M tokens Xa = {xa }M a a

X = {x}Tt=1 = [Xa ; <D>; Xs , <E>] (1)

3.1 Domain-adapative Training

3.2 End Task Training

Model Specifications We used a transformer decoder with N = 12 blocks and h = 12 masked

5.1 CNN/Daily Mail

Model DCA Same T-SM PGen+Cov Same T-SM

Model Name R-L P R-L R L Model R-1 R-2 R-L

Baselines We report the performance of Transformer-SM on this dataset in comparison to baselines

Source Source Text

Results We report our results with ROUGE-style au-

Contextualized Representations Our approach is also relevant to recent work on contextualized

You might also like