Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
115 views22 pages

Longformer: Efficient Long-Doc NLP

Longformer introduces a novel attention mechanism that allows Transformers to process long documents efficiently. It uses a sparse attention matrix where each token attends to nearby tokens within a fixed window, while also allowing for attention to all tokens through global attention. Longformer achieves state-of-the-art results on character language modeling benchmarks and outperforms baselines on long document tasks after pretraining and finetuning. Future work includes developing Longformer models for generation and multi-document tasks.

Uploaded by

Karl Gemayel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views22 pages

Longformer: Efficient Long-Doc NLP

Longformer introduces a novel attention mechanism that allows Transformers to process long documents efficiently. It uses a sparse attention matrix where each token attends to nearby tokens within a fixed window, while also allowing for attention to all tokens through global attention. Longformer achieves state-of-the-art results on character language modeling benchmarks and outperforms baselines on long document tasks after pretraining and finetuning. Future work includes developing Longformer models for generation and multi-document tasks.

Uploaded by

Karl Gemayel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Longformer: The Long-Document

Transformer

Iz Beltagy, Matt Peters, Arman Cohan

1
Long documents:
- Many NLP tasks require processing full documents
- e.g: QA, summarization, document-classification
- Semantic scholar: 90% percentile paper length = 7,000 tokens

Transformers: SOTA
- LSTM scales to long documents but doesn’t work as well

Long documents + Transformers is hard


- Self-attention is expensive: O(n^2)
2
Prior work
- Avoid working with long documents
- e.g. MRQA: skip the question if the answer is not in the first 800 tokens

- Chunk, extract, combine


- task and dataset specific

- Loses global context

3
Prior work

4
Longformer - sparse attention matrix

Avoid computing the full


attention matrix

Tokens attend to each others


following an “attention pattern”

Large receptive field with


stacked layers
5
Longformer - local and global attention
- Global attention is user defined based on the task
- Local attention - good token representation
- Global attention - flexibility to learn tasks
- LM: local representation
- Classification: aggregate seq into CLS
- QA: compare context and question tokens
- Notice: indirect information flow is not enough.
Direct attention is needed.

6
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 1: sliding chunks
○ Native PyTorch (easy to deploy and use)
○ Limited to the non-dilated case
○ Splits the sequence into chunks of size W and overlapping of size W/2, matrix
multiply each chunk, then mask out the extra elements
○ Computed 2x more values than needed

7
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 2: custom cuda kernel
○ Implemented in TVM
■ Compiles python code into cuda c++
■ Easier to use than writing cuda from scratch
○ Supports dilation
○ Only computes the non-zero elements (memory efficient)
○ A bit harder to deploy and use

8
Performance
loop is a naive implementation using loops

loop and cuda are the most memory efficient

chunks uses 2x more memory than cuda (but still linear)

chunks faster than cuda because it uses nvidia MM

Use chunks for pretrain/finetune, use cuda for char-lm

9
10
Evaluation (1) - Character Level Language Modeling
dataset - Metric (bpc, lower is better) Ours SOTA (small model)

enwik8 0.999 1.02

text8 1.103 1.11


Large
improvement

- Match SOTA result on the large model


- Requires sliding window with dilation on a few heads
- Largest train seqlen is 23k, evaluate on seqlen 32k
- fp16 and gradient checkpointing are important

11
Character Level Language Modeling - Details
- Many different ways to configure window sizes and dilation
- increasing window size from 512 to 8192
- dilation on layers 6-12 on two heads only
- seqlen 32k
- Training in 5 phases, with every phase, 2x window size and seqlen and 0.5x LR
- starting window sizes: 32 to 512
- starting seqlen 2048
- Doubling window size has the most impact on bpc
- At first, loss increases a lot then drops quickly (relevant for global attention later)
- Keep selfattention in fp32 because the model learns to use large attention scores
12
Pretraining and finetuning

Goal: a BERT-like for long-doc NLP tasks

Procedure to convert RoBERTa into Longformer

● Increase size of position embedding matrix. Initialize it by copying the first 512
● Continue MLM pretrianing on a corpus of long docs
● Copy q, k, v linear projections to get separate projections for global attention

This procedure can be easily applied to other models to convert them into Long

13
Pretraining - MLM results

14
Finetuning on downstream tasks
- HotpotQA -- Multihop reasoning and evidence extraction
- TriviaQA -- QA dataset from long Wikipedia articles

- WikiHop -- QA requiring combining facts spread across multiple paragraphs

- Coref -- Coreference chains across the document

- IMDB -- Document classification

- Hyperpartisan news -- Document classification

15
Finetuning on downstream tasks - Global attention

Classification: on CLS token

QA: on all question tokens

Coref: local attention only

16
Finetuning on downstream tasks - Results

17
Finetuning on downstream tasks - large model

SOTA on WikiHop and TriviaQA

Competitive HotpotQA result

- Ours is simpler
- Better models use GNN of entities
- Can we simulate it with global attention?

18
Finetuning on downstream tasks - ablation

19
Conclusion
We should consider tasks beyond LM to develop and evaluate long models

Usage
model = transformers.AutoModel('allenai/longformer-base-4096')

Global attention is important and task specific

Follow our procedure to convert existing pretrained models into Long

20
Future work
LongformerEncoderDecoder

Better pretraining, other objective functions

Longer sequence

Better encoding of document structure

Global attention instead of GNN

More tasks; summarization, generation, IE

Multi-document
21
22

You might also like