Longformer: The Long-Document
Transformer
Iz Beltagy, Matt Peters, Arman Cohan
1
Long documents:
- Many NLP tasks require processing full documents
- e.g: QA, summarization, document-classification
- Semantic scholar: 90% percentile paper length = 7,000 tokens
Transformers: SOTA
- LSTM scales to long documents but doesn’t work as well
Long documents + Transformers is hard
- Self-attention is expensive: O(n^2)
2
Prior work
- Avoid working with long documents
- e.g. MRQA: skip the question if the answer is not in the first 800 tokens
- Chunk, extract, combine
- task and dataset specific
- Loses global context
3
Prior work
4
Longformer - sparse attention matrix
Avoid computing the full
attention matrix
Tokens attend to each others
following an “attention pattern”
Large receptive field with
stacked layers
5
Longformer - local and global attention
- Global attention is user defined based on the task
- Local attention - good token representation
- Global attention - flexibility to learn tasks
- LM: local representation
- Classification: aggregate seq into CLS
- QA: compare context and question tokens
- Notice: indirect information flow is not enough.
Direct attention is needed.
6
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 1: sliding chunks
○ Native PyTorch (easy to deploy and use)
○ Limited to the non-dilated case
○ Splits the sequence into chunks of size W and overlapping of size W/2, matrix
multiply each chunk, then mask out the extra elements
○ Computed 2x more values than needed
7
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 2: custom cuda kernel
○ Implemented in TVM
■ Compiles python code into cuda c++
■ Easier to use than writing cuda from scratch
○ Supports dilation
○ Only computes the non-zero elements (memory efficient)
○ A bit harder to deploy and use
8
Performance
loop is a naive implementation using loops
loop and cuda are the most memory efficient
chunks uses 2x more memory than cuda (but still linear)
chunks faster than cuda because it uses nvidia MM
Use chunks for pretrain/finetune, use cuda for char-lm
9
10
Evaluation (1) - Character Level Language Modeling
dataset - Metric (bpc, lower is better) Ours SOTA (small model)
enwik8 0.999 1.02
text8 1.103 1.11
Large
improvement
- Match SOTA result on the large model
- Requires sliding window with dilation on a few heads
- Largest train seqlen is 23k, evaluate on seqlen 32k
- fp16 and gradient checkpointing are important
11
Character Level Language Modeling - Details
- Many different ways to configure window sizes and dilation
- increasing window size from 512 to 8192
- dilation on layers 6-12 on two heads only
- seqlen 32k
- Training in 5 phases, with every phase, 2x window size and seqlen and 0.5x LR
- starting window sizes: 32 to 512
- starting seqlen 2048
- Doubling window size has the most impact on bpc
- At first, loss increases a lot then drops quickly (relevant for global attention later)
- Keep selfattention in fp32 because the model learns to use large attention scores
12
Pretraining and finetuning
Goal: a BERT-like for long-doc NLP tasks
Procedure to convert RoBERTa into Longformer
● Increase size of position embedding matrix. Initialize it by copying the first 512
● Continue MLM pretrianing on a corpus of long docs
● Copy q, k, v linear projections to get separate projections for global attention
This procedure can be easily applied to other models to convert them into Long
13
Pretraining - MLM results
14
Finetuning on downstream tasks
- HotpotQA -- Multihop reasoning and evidence extraction
- TriviaQA -- QA dataset from long Wikipedia articles
- WikiHop -- QA requiring combining facts spread across multiple paragraphs
- Coref -- Coreference chains across the document
- IMDB -- Document classification
- Hyperpartisan news -- Document classification
15
Finetuning on downstream tasks - Global attention
Classification: on CLS token
QA: on all question tokens
Coref: local attention only
16
Finetuning on downstream tasks - Results
17
Finetuning on downstream tasks - large model
SOTA on WikiHop and TriviaQA
Competitive HotpotQA result
- Ours is simpler
- Better models use GNN of entities
- Can we simulate it with global attention?
18
Finetuning on downstream tasks - ablation
19
Conclusion
We should consider tasks beyond LM to develop and evaluate long models
Usage
model = transformers.AutoModel('allenai/longformer-base-4096')
Global attention is important and task specific
Follow our procedure to convert existing pretrained models into Long
20
Future work
LongformerEncoderDecoder
Better pretraining, other objective functions
Longer sequence
Better encoding of document structure
Global attention instead of GNN
More tasks; summarization, generation, IE
Multi-document
21
22