Transformers Model
Transformers Model
Abstractive Summarization
Abstract
1 Introduction
Recent work in large-scale language models [19; 20; 5] has allowed pretrained contextual repre-
sentations to be easily adapted for a variety of downstream tasks, yielding improvements on many
benchmarks evaluating natural language understanding [27]. Less explored, however, has been the
effect of these pretrained representations on text production tasks, such as abstractive summarization,
where state of the art performance is still achieved with sequence to sequence (seq2seq) models [1; 6].
These sequence-to-sequence methods typically use an encoder and decoder model with separate
parameters to represent the input article and produce the output summary, and the most successful
solutions [1; 6; 22] use attention mechanisms that learn an alignment between encoder and decoder
states. Pretrained language models, however, do not learn the parameters for such a task-specific
alignment, making it challenging to integrate their learned representations into a summarization
architecture at a higher level of abstraction than the word embedding.
In this work, we adapt full transformer language models for abstractive summarization. Building off
the work of Liu et al. [12], who first proposed concatenating input and output text to a joint sequence
and using a common transformer to encode both, we use a language model as a summarizer (rather
than an encoder-decoder). With this approach, representations from a pretrained transformer language
model (in this case, GPT [20]) can be used to fully initialize the parameters of the summarization
model, allowing it to leverage the representational power of a model trained at much larger scale.
To accomplish this effectively, we outline two strategies for adapting pretrained representations for
abstractive summarization. In the first, we augment the input representation of the summarization
model by instantiating source embeddings that encode the token type of the text being read. This
change allows the model to recognize whether a given token belongs to the input article or the output
summary, thereby learning how to distinguish both types of text when encoding. In the second, we
Word a
e[x ], . . . , e[x
a
]
s
e[x ], . . . , e[x
s
]
e[_delim_] e[_end_]
Embeddings 1 M 1 N
Position
p1 , . . . , pM p0 p1 , . . . , pN pN+1
Embeddings
Source a a s s s s
d , ..., d d d , ..., d d
Embeddings
introduce a domain-adaptive training procedure that fine-tunes the transformer toward understanding
general newswire text before training on the summarization end task directly, allowing the model
to learn the general structure and language distribution of newswire text before being fine-tuned to
produce summaries.
A comprehensive empirical study across three datasets, CNN/DailyMail [8], XSum [16], and News-
room [7], shows that transformer language models can be used to train abstractive summarizers,
producing summaries that are more concise and focused than state of the art baselines. Our investiga-
tion also empirically validates several observations about the abstractive summarization task. First,
echoing the results of Sun et al. [24], the most common summarization evaluation metric, ROUGE
[10], is highly sensitive to summary length, providing an advantage to methods that produce longer
summaries, either through learning or with minimum summary length constraints. Second, achieving
higher ROUGE scores is not strongly consistent with human assessments of abstractive summary
quality. Finally, despite being conceived as abstractive summarizers, most current state of the art
models are highly extractive, copying phrases and even sentences verbatim from the document.
2 Model
In this paper, we focus on a variant of the Transformer [26] that has been pretrained on a large corpus
of natural language stories: the GPT model [20]. As our architecture is practically identical to the
one proposed in Radford et al. [20], we point readers to that work for background on the architecture
of the model, and focus below on the enhancements to the input representation made in our approach.
Word Embedding First, each token xt in the concatenated sequence X indexes a word embedding
e[xt ] ∈ Rh from a joint vocabulary for the article and summary (and special tokens).
Position Embedding Second, since the transformer (a self-attention model) has no concept of
ordering of tokens, a position embedding pt ∈ Rh is initialized for each absolute position in the
sequence [26]. The embedding for each position in the sequence is added to the word embedding of
the token occupying that position, augmenting the final representation of the input. For example, each
a
token in the article would be represented as: wm = e[xam ] + pm . Once the delimitation token <D> is
2
reached, the position counter is reset. For example, the first token of the article, xa1 , and the first token
of the summary, xs1 , both receive p1 as a positional embedding to augment their representations.
Source Embedding Finally, because the transformer must recognize pragmatic differences between
the text of the article it reads and the text of the summary it learns to produce, an additional, source-
specific embedding is initialized, d ∈ Rh . The source embedding encodes whether a token is from
the article portion da of the concatenated input, or the summary portion ds . For any article token
(Eq. 2) or summary token (Eq. 3) then, the final encoding is:
a
∀ wm = e[xam ] + pm + da (2) ∀ wns = e[xsn ] + pn + ds (3)
m∈[1,M ] n∈[1,N ]
In contrast to the other embeddings in the model, the source embeddings are not pretrained, intro-
ducing the potential that they could dominate pretrained representations for the word and position
embeddings when summed (Eq. 2, 3). To avoid this, we normalize the random initialization of the
source embeddings to have norm equal to half of the average norm of the word embeddings.
3 Training
The model is initialized with pretrained parameters from the GPT model [20] that was trained on
the BooksCorpus [28]. Following this initialization, we pursue two additional training procedures:
domain-adaptive training and end task training.
Despite the benefit of using pretrained representations from the GPT model to initialize a summarizer,
there is a language shift between the storybooks data on which the GPT model was trained and
the type of language found in newswire summarization datasets [8; 16; 7]. Additionally, there are
structural differences between how articles are written (usually expressing salient points early on,
followed by details later) and how stories unfold (less front-loading of key information).
To address this discrepancy, we propose domain-adaptive training (DAT) to adapt the transformer
summarization model to the language distribution of newswire text by maximizing the conditional
loglikelihood of the article tokens and summary tokens given all previous tokens in their concatenated
input representation (see Figure 1):
M
X N
X
Ldat = − log P (xam |{xa }m
1 )− log P (xsn |{xs }n1 , {xa }M
1 ) (4)
m=1 n=1
where M is length of the article, N is the length of the summary, {xa }<m is the set of all tokens in the
article that precede xam , {xs }<n is the set of all tokens in the summary that precede xsn , and {xa }M
is the set of all article tokens. In this framework, the model is adapted to produce newswire-like
language before being trained on the summarization end task, which only focuses on learning for
summary production.
During end task training (ETT), the model is trained specifically to be able to produce a summary
given a document, constraining the loss function toward maximizing the conditional loglikelihood of
producing only the correct summary tokens given the set of article tokens {xa }M :
N
X
Lett = − log P (xsn |{xs }<n , {xa }M ) (5)
n=1
where {xs }<n is the set of tokens in the summary that precede xsn .
3
Table 1: Comparison of summarization datasets with respect to dataset size, proportion of unique
n-grams, mean article length in words, and mean summary length in words.
Split Size % Novel n-grams in Gold Summary Mean # Words
Dataset
Train Validation Test unigrams bigrams trigrams 4-grams Article Summary
Newsroom 993,746 108,590 108,636 17.40 44.05 55.38 61.21 658.6 26.7
XSum 204,045 11,332 11,334 34.88 78.78 92.03 96.80 431.1 23.3
CNN/DailyMail 287,227 13,368 11,490 12.70 46.29 65.04 75.56 685.2 52.0
4 Experimental Setup
Datasets The CNN/Daily Mail dataset [8] consists of articles from CNN and Daily Mail. Each
article is associated with several descriptive bullet point highlights. Similar to previous work [15],
we concatenate the highlights to create a target summary for each article in the dataset and use the
same dataset splits. The Extreme Summarization (XSum) dataset [16] consists of ∼230k article
summary pairs taken from the BBC. Each summary is a single sentence long and is professionally
written (usually by the author), making the dataset exhibit more abstractive content than typical
summarization datasets, such as CNN/DailyMail [8]. The Newsroom dataset [7] consists of ∼1.2M
article summary pairs scraped from the Internet Archive. The articles come from a set of 38 publishers
and cover diverse topics. We provide statistics about each dataset in Table 1.
Data Preprocessing We used a bytepair encoding (BPE) for tokenization. For each summarization
dataset, we use the BPE to tokenize each article and summary, and then truncate the articles to a
maximum length of 512 tokens and each summary to a maximum length of 110 tokens. We then
format each article summary pair into the format outlined in Figure 1.
Training Details All models were trained with a learning rate of 6.25 × 10−5 and a minibatch
size of 64. When domain-adaptive training (DAT) is used, we train for 10 epochs using DAT and
then for an additional 10 epochs using end task training (ETT). Without DAT, we train on the end
task for 20 epochs. Unless specified otherwise, the final model trained for each dataset uses both
domain-adaptive training and end task training. We did not tune hyperparameters. All models were
trained using the PyTorch package1 and the HuggingFace implementation of GPT.2 We trained each
model on 8 Tesla V100-SMX2. Training for a total of 20 epochs took approximately 1 day of clock
time for the XSum and CNN/Daily Mail datasets, and 3 days for the Newsroom dataset. Our source
code is publicly available.3
Generation We perform generation by using beam search with a beam size of 3. We use the trigram
trick [18] during beam search. Each summary token is generated by decoding from the distribution
yielded by the model from processing an input tensor that is the concatenation of the article tokens,
the delimiter token, and any previously generated summary tokens.
Evaluation We evaluate our system with common summarization metrics: ROUGE-1 (R-1), a
measure of unigram recall between the summary and document, ROUGE-2 (R-2), a similar measure
of bigram recall, and ROUGE-L (R-L), a measure of the longest common subsequence between the
summary and document [11]. We also report the length of the summary in terms of tokens produced.
For each dataset, for evaluation on the test set, we selected models with the largest ROUGE-1 score
on a subset of 500 samples from the validation set.
1
https://pytorch.org/
2
https://github.com/huggingface/pytorch-openai-transformer-lm
3
https://github.com/Andrew03/transformer-abstractive-summarization
4
5 Experiments
Baselines We report the results from various models previously trained and evaluated on the
CNN/Daily Mail dataset. The PGen and PGen + Coverage models [22], consist of attentive RNN
encoder-decoders that integrate the ability to directly copy from the article when generating tokens.
Pasunuru and Bansal [17] extend this work by adding policy gradient training with a mixture of
rewards that promote saliency and entailment. Bottom-up summarization and the Copy Transformer
[6] also extend See et al. [22] by using the copy mechanism to compress the article to only relevant
content before summarizing it. Chen and Bansal [2] also look at performing content selection, but
extract full sentences from the document with a novel extractor model. Finally, the DCA model
[1] uses multiple separate communicating encoders over different parts of the document to produce
representations that are more focused on salient details.
Table 2: ROUGE F1 results on the test set of CNN/Daily Mail. Best model results are bolded.
Model R-1 R-2 R-L Length (L)
PGen [22] 36.44 15.66 33.42 53.69
PGen + Coverage [22] 39.53 17.28 36.38 59.75
RougeSal + Ent RL [17] 40.43 18.00 37.10 -
Bottom-Up Summ [6] 41.22 18.68 38.34 55.25
CopyTransformer [6] 40.96 18.38 38.16 -
rnn-ext + RL [2] 41.47 18.72 37.76 77.44
DCA [1] 41.67 19.47 37.92 51.01
Transformer-LM 38.67 17.47 35.79 43.40
Transformer-SM 37.96 17.36 35.12 42.42
Automatic Metrics We report our results using automatic metrics in Table 2. On this dataset, our
main model, Transformer-SM, performs slightly worse than other state of the art models. We note
that our model tends to generate shorter summaries than the gold summaries (∼ 20% shorter), which
could lower ROUGE recall performance.
In Figure 2, we investigate the correlation of ROUGE-L scores with summary length, and note that a
minimum decoding length used by state-of-the-art algorithms places baseline generated summaries in
length bins of higher average ROUGE-L performance. When Transformer-SM produces summaries
in these same length bins (i.e., more than 30 tokens), its performance is only consistently beaten by
the DCA model, which was fine-tuned with RL.
Figure 2: Average ROUGE-L for summaries in different length bins. Scatter plots correspond to
ROUGE-L scores for each bin, while solid lines correspond to the number of summaries in each bin
5
Table 3: Head-to-head comparison between test set outputs of (Left) DCA [1] and Transformer-
SM (Right) PGen+Cov [22] and Transformer-SM. Analyses done on summaries for CNN/DailyMail.
Human Evaluation While ROUGE scores are negatively influenced by the shorter average length
of the summaries produced by our model, it is not clear that shorter summaries are correlated with
worse quality. To evaluate this hypothesis, we perform a human evaluation on 125 (article, summary)
pairs randomly sampled from the test set. The article and model-generated summaries were presented
to three workers from Amazon Mechanical Turk (AMT).
Each worker was presented two model-generated summaries, one produced by the Transformer-
SM model, and one from the DCA model [1] or the PGen+Cov model [22]. Workers were asked
to select the better summary for four different quality metrics from Celikyilmaz et al. [1]: non-
redundancy (fewer of the same ideas are repeated), coherence (ideas are expressed clearly), focus
(the main ideas of the document are shared while avoiding superfluous details), and overall.
The results are presented in Table 3. Interestingly, the summaries from Transformer-SM are consis-
tently preferred by humans across all 4 evaluations dimensions compared to those from the DCA and
PGen+Coverage models, indicating that the Transformer-SM’s lower ROUGE scores observed in
Table 2 are not necessarily correlated with human judgments of quality.
Table 4: ROUGE-L precision (R-L P), recall (R- Table 5: Ablation study of training schedules
L R), and F1 (R-L F1) scores computed between on CNN/DailyMail. (PT) Model initialized with
generated summaries and input CNN/DailyMail pretrained weights; (DAT) Model uses domain-
articles after removing stop words adaptive training; (ETT) trained on end task.
Efficiency Due to the large improvements over the baseline models in the human evaluation cate-
gories of non-redundancy and focus, and the generally shorter summaries produced by Transformer-
SM, we investigate whether Transformer-SM is able to more efficiently express key ideas of the
document. To evaluate the efficiency of each model, we remove non-content words from the model-
generated summaries and articles, and compute the ROUGE score between them. This measure
serves as a proxy for the rate at which ideas expressed in the summary can be found in the document.
We report these results in Table 4 and observe that Transformer-SM reports comparable ROUGE-
L recall scores to other baselines when evaluated with respect to the article, despite producing
summaries that, on average, 27% shorter. Meanwhile, ROUGE-L precision is also very similar to the
baseline models, indicating that the summaries of all models indicate a similar degree of information
relevance.4 Combined with the results from Table 3, we conjecture that Transformer-SM is able
4
The high precision scores across all models confirm that despite being conceived as abstractive generators,
these models display highly extractive behavior.
6
to more efficiently express key ideas from the document. While other models may be producing
longer summaries that yield higher ROUGE performance (Table 2), the additional tokens may reflect
redundant and unsalient information, which human evaluators penalize.
Analysis of domain-adaptive training and source embeddings Our approach involved two strate-
gies for efficiently using transformer language models for abstractive summarization: domain-adaptive
training and source embeddings. To assess their individual impact, we evaluate multiple training
schedule permutations (e.g., various combinations of using pretrained representations from the GPT
model and using domain-adaptive training), as well as the impact of source embeddings. Our results
in Table 5 yield multiple interesting conclusions. First, in general, domain-adaptive training (+DAT in
Table 5) provides a clear improvement over training directly on the end task, irrespective of whether
pretrained representations are used. Similarly, using source embeddings (T-SM in Table 5) provides a
repeated improvement over the T-LM ablation. Surprisingly, when pretrained initializations, DAT,
and source embeddings are used in tandem, performance drops slightly compared to not using DAT
or not using source embeddings. We note, however, that this observation does not hold true for the
XSum dataset (§5.2), and conjecture that the extractive nature of the CNN/DailyMail dataset may
make these approaches have redundant effects in this setting.
5.2 XSum
A study on the quality of abstractive summaries is best performed on the XSum dataset [16], which is
specifically designed with gold summaries that are less extractive than the other datasets (Table 1).
Results We report our results in Table 6. Our mod- Table 6: Comparison results on the XSum test
els significantly outperform the comparison base- set using the F1 variants of ROUGE
lines across all three variants of the ROUGE metric.
Interestingly, the Transformer-SM achieves notice- Model R-1 R-2 R-L
able improvement over the Transformer-LM model, AttnS2S [16] 28.42 8.77 22.48
suggesting that both source embeddings and domain PGen [16] 29.70 9.21 23.24
adaptive training are helpful when target summaries PGen+Cov [16] 28.10 8.02 21.72
are more abstractive. Examples of model-generated T-ConvS2S [16] 31.89 11.54 25.75
summaries from the XSum dataset illustrate the im- MMN [9] 32.00 12.10 26.00
provement over baselines qualitatively in Table 7. In
Transformer-LM 36.03 14.57 29.20
support of results presented earlier, the model pro-
Transformer-SM 36.73 14.93 29.66
duces abstractive summaries that provide focused
information about the main points of the articles.
5.3 Newsroom
Finally, we report the performance of our model on the Newsroom dataset [7], the largest of the
evaluation datasets. Due to the large cost of training, only the Transformer-SM model was evaluated.
Baselines As baselines, we report the performance of models released by the authors of the
Newsroom dataset [7]. These models included an attentive encoder-decoder (Attn-S2S) and a pointer-
generator network (PGen). We also compared against C10110 [23], a complex encoder-decoder that
uses LSTMs, encoder attention, intra-decoder attention, and pointer-generation to produce summaries.
We also compare against the Multi-Level Memory Network (MMN) [9] mentioned earlier. The
authors of this baseline only evaluated on the abstractive subset of the Newsroom dataset.
7
Table 7: XSum samples for the baseline T-ConvS2S model and Transformer-SM along with the gold
summary. Articles are shortened for brevity. Capitalization was manually added for ease of reading.
Table 9: ROUGE F1 results on validation subsets and full validation set for Newsroom
Extractive Mixed Abstractive Newsroom-D
Model Name
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
PGen [7] 39.11 27.95 36.17 25.48 11.04 21.06 14.66 2.26 11.44 26.27 13.55 22.72
MMN [9] - - - - - - 17.5 4.7 14.2 - - -
Transformer-SM 64.69 59.72 63.56 35.75 18.83 30.63 22.44 7.39 18.75 41.42 29.51 38.26
6 Related Work
Abstractive Summarization There has been a large variety of work exploring different methods
for neural abstractive document summarization. Attention mechanisms have been shown to improve
a variety of models [14; 25; 3], and is one of the motivating factors for this work. Pointer generator
networks introduced in See et al. [22] have been shown to increase summary veracity, and inspired
8
the tangential usage of copy mechanisms in Transformers for document summarization Gehrmann
et al. [6]. Other works have also explored the use of reinforcement learning to directly optimize
summarization models on the ROUGE metric [17; 18; 2].
7 Conclusion
In this work, we introduce two approaches for effectively adapting pretrained language model repre-
sentations to abstractive summarization: domain-adaptive training, and source embeddings. We eval-
uate the effect of both approaches across three abstractive summarization testbeds: CNN/DailyMail,
XSum, and Newsroom, and achieve state of the art ROUGE-L results on two of them, while showing
superior human evaluation performance on the third. In the process, we show that the ROUGE-L
metric often used for abstractive summarization evaluation is quite sensitive to summary length,
allowing it to be exploitable by approaches that use heuristics to control summary length.
References
[1] Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating
agents for abstractive summarization. In NAACL.
[2] Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected
sentence rewriting. In ACL.
[3] Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and
words. arXiv preprint arXiv:1603.07252.
[4] Andrew M. Dai and Quoc V. Le. 2015. Semi-supervised sequence learning. In NIPS.
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training
of deep bidirectional transformers for language understanding. In NAACL.
[6] Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive
summarization. In EMNLP.
[7] Max Grusky, Mor Naaman, and Yoav Artzi. 2019. Newsroom: A dataset of 1.3 million
summaries with diverse extractive strategies. In NAACL.
[8] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances
in Neural Information Processing Systems, pages 1693–1701.
[9] Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim. 2018. Abstractive summarization of
reddit posts with multi-level memory networks. arXiv preprint arXiv:1811.00783.
[10] Chin-Yew Lin. 2004. Looking for a few good metrics: Automatic summarization evaluation-how
many samples are enough? In NTCIR.
[11] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summa-
rization Branches Out.
[12] Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and
Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. In ICLR.
9
[13] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in
translation: Contextualized word vectors. In NIPS.
[14] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Summarunner: A recurrent neural
network based sequence model for extractive summarization of documents. In Thirty-First
AAAI Conference on Artificial Intelligence.
[15] Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang.
2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In CoNLL.
[16] Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the
summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP.
[17] Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-reward reinforced summarization with
saliency and entailment. In ACL.
[18] Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for
abstractive summarization. In ICLR.
[19] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
[20] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving
language understanding by generative pre-training. 2018. URL https://s3-us-west-2. amazonaws.
com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.
pdf.
[21] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.
Language models are unsupervised multitask learners.
[22] Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization
with pointer-generator networks. In ACL.
[23] Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K Reddy. 2018. Neural abstrac-
tive text summarization with sequence-to-sequence models. arXiv preprint arXiv:1812.02303.
[24] Simeng Sun, Ori Shapira, Ido Dagan, and Ani Nenkova. 2019. How to compare summarizers
without target length? pitfalls, solutions and re-examination of the neural summarization
literature. In Proceedings of NAACL 2019 Workshop on Optimizing and Evaluating Neural
Language Generation (NeuralGen).
[25] Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with
a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1171–
1181.
[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural
Information Processing Systems, pages 5998–6008.
[27] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman.
2018. Glue: A multi-task benchmark and analysis platform for natural language understanding.
arXiv preprint 1804.07461.
[28] Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual
explanations by watching movies and reading books. 2015 IEEE International Conference on
Computer Vision (ICCV), pages 19–27.
10
A Reproducibility
We provide additional details relevant to the experimental environment here.
Data Sources The CNN/Daily Mail dataset [8] consists of articles from CNN and Daily Mail.5 Each
article is associated with several descriptive bullet point highlights. Similar to previous work [15], we
concatenate the highlights to create a target summary for each article in the dataset. The Newsroom
dataset [7] consists of ∼1.2M article summary pairs scraped from the Internet Archive.6 The articles
come from a set of 38 publishers and cover diverse topics. Finally, the Extreme Summarization
(XSum) dataset [16] consists of ∼230k article summary pairs taken from the BBC. 7 Each summary
is a single sentence long and is professionally written (usually by the author). For all datasets, we
use the splits defined in the original works that proposed them. Because the datasets are too large to
provide as supplementary material, we provide pointers in the source code README for acquiring
them.
Hyperparameters Details about important hyperparameters can be found in Section 4 of the paper.
Additional training hyperparameters can be found as the default parameters in the training script of the
source code.8 Most hyperparameter values selected were the same ones suggested by previous work
on transformer language models [20]. The only hyperparameter we varied that is not measured as an
ablation (i.e., training schedules and whether to include source embeddings) was the initialization
of source embeddings (if they were included). For this hyperparameter, we explored three different
initializations: 1) initializing both source embeddings with zero vectors, 2) initializing both source
embeddings with values sampled from the standard normal distribution, and 3) initializing both source
embeddings with values sampled from a normal distribution with mean 0 and standard deviation
equal to half the norm of the average norm among pretrained embeddings from the GPT language
model. This last one is the one we report in all experiments.
Experimental Process Each experiment was run as follows for any given model and dataset. First,
we trained the model as described in the paper. After every 1000 minibatches, we compute ROUGE
for a random, but persistent 500 example subset of the validation set. When the ROUGE-1 score of
the model stopped rising, we used the previous checkpoint as a model to generate summaries for all
articles in the test set. We used beam search to decode summaries using a beam with of 3. We ran
exactly one evaluation run for each result we include in our paper.
5
https://www.cnn.com; https://www.dailymail.co.uk
6
https://archive.org/
7
https://www.bbc.com/
8
https://github.com/Andrew03/transformer-abstractive-summarization
11