0% found this document useful (0 votes)

115 views22 pages

Longformer: Efficient Long-Doc NLP

Longformer introduces a novel attention mechanism that allows Transformers to process long documents efficiently. It uses a sparse attention matrix where each token attends to nearby tokens within a fixed window, while also allowing for attention to all tokens through global attention. Longformer achieves state-of-the-art results on character language modeling benchmarks and outperforms baselines on long document tasks after pretraining and finetuning. Future work includes developing Longformer models for generation and multi-document tasks.

Uploaded by

Karl Gemayel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views22 pages

Longformer: Efficient Long-Doc NLP

Uploaded by

Karl Gemayel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Longformer: The Long-Document

Transformer

Iz Beltagy, Matt Peters, Arman Cohan

1
Long documents:
- Many NLP tasks require processing full documents
- e.g: QA, summarization, document-classification
- Semantic scholar: 90% percentile paper length = 7,000 tokens

Transformers: SOTA
- LSTM scales to long documents but doesn’t work as well

Long documents + Transformers is hard

- Self-attention is expensive: O(n^2)
2
Prior work
- Avoid working with long documents
- e.g. MRQA: skip the question if the answer is not in the first 800 tokens

- Chunk, extract, combine

- task and dataset specific

- Loses global context

3
Prior work

4
Longformer - sparse attention matrix

Avoid computing the full

attention matrix

Tokens attend to each others

following an “attention pattern”

Large receptive field with

stacked layers
5
Longformer - local and global attention
- Global attention is user defined based on the task
- Local attention - good token representation
- Global attention - flexibility to learn tasks
- LM: local representation
- Classification: aggregate seq into CLS
- QA: compare context and question tokens
- Notice: indirect information flow is not enough.
Direct attention is needed.

6
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 1: sliding chunks
○ Native PyTorch (easy to deploy and use)
○ Limited to the non-dilated case
○ Splits the sequence into chunks of size W and overlapping of size W/2, matrix
multiply each chunk, then mask out the extra elements
○ Computed 2x more values than needed

7
Implementation
● Required banded-multiplication is not supported in existing DL
libraries
● Implementation 2: custom cuda kernel
○ Implemented in TVM
■ Compiles python code into cuda c++
■ Easier to use than writing cuda from scratch
○ Supports dilation
○ Only computes the non-zero elements (memory efficient)
○ A bit harder to deploy and use

8
Performance
loop is a naive implementation using loops

loop and cuda are the most memory efficient

chunks uses 2x more memory than cuda (but still linear)

chunks faster than cuda because it uses nvidia MM

Use chunks for pretrain/finetune, use cuda for char-lm

9
10
Evaluation (1) - Character Level Language Modeling
dataset - Metric (bpc, lower is better) Ours SOTA (small model)

enwik8 0.999 1.02

text8 1.103 1.11

Large
improvement

- Match SOTA result on the large model

- Requires sliding window with dilation on a few heads
- Largest train seqlen is 23k, evaluate on seqlen 32k
- fp16 and gradient checkpointing are important

11
Character Level Language Modeling - Details
- Many different ways to configure window sizes and dilation
- increasing window size from 512 to 8192
- dilation on layers 6-12 on two heads only
- seqlen 32k
- Training in 5 phases, with every phase, 2x window size and seqlen and 0.5x LR
- starting window sizes: 32 to 512
- starting seqlen 2048
- Doubling window size has the most impact on bpc
- At first, loss increases a lot then drops quickly (relevant for global attention later)
- Keep selfattention in fp32 because the model learns to use large attention scores
12
Pretraining and finetuning

Goal: a BERT-like for long-doc NLP tasks

Procedure to convert RoBERTa into Longformer

● Increase size of position embedding matrix. Initialize it by copying the first 512
● Continue MLM pretrianing on a corpus of long docs
● Copy q, k, v linear projections to get separate projections for global attention

This procedure can be easily applied to other models to convert them into Long

13
Pretraining - MLM results

14
Finetuning on downstream tasks
- HotpotQA -- Multihop reasoning and evidence extraction
- TriviaQA -- QA dataset from long Wikipedia articles

- WikiHop -- QA requiring combining facts spread across multiple paragraphs

- Coref -- Coreference chains across the document

- IMDB -- Document classification

- Hyperpartisan news -- Document classification

15
Finetuning on downstream tasks - Global attention

Classification: on CLS token

QA: on all question tokens

Coref: local attention only

16
Finetuning on downstream tasks - Results

17
Finetuning on downstream tasks - large model

SOTA on WikiHop and TriviaQA

Competitive HotpotQA result

- Ours is simpler
- Better models use GNN of entities
- Can we simulate it with global attention?

18
Finetuning on downstream tasks - ablation

19
Conclusion
We should consider tasks beyond LM to develop and evaluate long models

Usage
model = transformers.AutoModel('allenai/longformer-base-4096')

Global attention is important and task specific

Follow our procedure to convert existing pretrained models into Long

20
Future work
LongformerEncoderDecoder

Better pretraining, other objective functions

Longer sequence

Better encoding of document structure

Global attention instead of GNN

More tasks; summarization, generation, IE

Multi-document
21
22

Attention Is All You Need
No ratings yet
Attention Is All You Need
4 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Understanding LSTM Networks - Colah's Blog
No ratings yet
Understanding LSTM Networks - Colah's Blog
7 pages
Sandeep Interview
No ratings yet
Sandeep Interview
27 pages
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
No ratings yet
Text2Video-Zero: High-Quality and Consistent Video Generation With Low Overhead
3 pages
Lec16 DiffusionModels
No ratings yet
Lec16 DiffusionModels
57 pages
LLaVA - Large Multimodal Model
No ratings yet
LLaVA - Large Multimodal Model
15 pages
SDXL Diffusion Model Training - Style & Objects
No ratings yet
SDXL Diffusion Model Training - Style & Objects
49 pages
Assignment-2 Nptel
No ratings yet
Assignment-2 Nptel
2 pages
Deep Learning in Computer Vision
No ratings yet
Deep Learning in Computer Vision
15 pages
LangChain - Chat With Your Data
No ratings yet
LangChain - Chat With Your Data
32 pages
Sustainable Smart City Assistant Using IBM Granite LLM
No ratings yet
Sustainable Smart City Assistant Using IBM Granite LLM
18 pages
LCM LoRA Technical Report
No ratings yet
LCM LoRA Technical Report
7 pages
Computer Vision Ii: Ai Courses by Opencv
No ratings yet
Computer Vision Ii: Ai Courses by Opencv
8 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
CNN RNN Assignment Set 4
0% (1)
CNN RNN Assignment Set 4
2 pages
MemGPT - Towards LLMs As Operating Systems - 2310.08560
No ratings yet
MemGPT - Towards LLMs As Operating Systems - 2310.08560
15 pages
Efficient CORDIC-Based Activation Functions For RNN Acceleration On FPGAs
No ratings yet
Efficient CORDIC-Based Activation Functions For RNN Acceleration On FPGAs
11 pages
NPU MachineLearning
No ratings yet
NPU MachineLearning
28 pages
CS7015 (Deep Learning) : Lecture 1
No ratings yet
CS7015 (Deep Learning) : Lecture 1
108 pages
Optimizing Huffman Coding For Modern GPU Architectures
No ratings yet
Optimizing Huffman Coding For Modern GPU Architectures
10 pages
Introduction - Hugging Face NLP Course
No ratings yet
Introduction - Hugging Face NLP Course
8 pages
Pytorch: Tensors and Datasets
No ratings yet
Pytorch: Tensors and Datasets
9 pages
Transformers From Scratch
No ratings yet
Transformers From Scratch
39 pages
Efficient Fine-Tuning with PEFT
No ratings yet
Efficient Fine-Tuning with PEFT
10 pages
Unit 3-2
100% (1)
Unit 3-2
50 pages
AI & LLMs: A Developer's Guide
100% (1)
AI & LLMs: A Developer's Guide
2 pages
Cody Mckeand Resume-Lang
No ratings yet
Cody Mckeand Resume-Lang
5 pages
LLM Guide for Interns
No ratings yet
LLM Guide for Interns
4 pages
1905.13750 Sketch2code Generating A Website From A Paper
No ratings yet
1905.13750 Sketch2code Generating A Website From A Paper
64 pages
5-Day Gen AI Intensive Course 2024 November 11-15 (Full)
No ratings yet
5-Day Gen AI Intensive Course 2024 November 11-15 (Full)
347 pages
Reading:: Sources
No ratings yet
Reading:: Sources
15 pages
Res Net
No ratings yet
Res Net
13 pages
Keras
No ratings yet
Keras
7 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
Diffusion: by Aryan Jain
100% (1)
Diffusion: by Aryan Jain
55 pages
2 Convolutional Neural Network For Image Classification
No ratings yet
2 Convolutional Neural Network For Image Classification
6 pages
What Is The Need For Residual Learning?
No ratings yet
What Is The Need For Residual Learning?
3 pages
LoRA Techniques for LLM Fine-Tuning
No ratings yet
LoRA Techniques for LLM Fine-Tuning
27 pages
TensorFlow 2.0 Tutorials & Examples
No ratings yet
TensorFlow 2.0 Tutorials & Examples
6 pages
Lecture No 6 Deep Learning Algorithm
No ratings yet
Lecture No 6 Deep Learning Algorithm
37 pages
Phys Bench
No ratings yet
Phys Bench
11 pages
RAG - Genai
No ratings yet
RAG - Genai
11 pages
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
LLaMa Model Hallucination Analysis
No ratings yet
LLaMa Model Hallucination Analysis
3 pages
A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing
No ratings yet
A Survey On Multimodal Bidirectional Machine Learning Translation of Image and Natural Language Processing
14 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
3-Sentiment Analysis BERT
No ratings yet
3-Sentiment Analysis BERT
5 pages
Retrieval Augmentation Reduces Hallucination in Conversation
No ratings yet
Retrieval Augmentation Reduces Hallucination in Conversation
21 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Transformers For Natural Language Processing and Computer Vision
No ratings yet
Transformers For Natural Language Processing and Computer Vision
150 pages
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
No ratings yet
Christoph Molnar - Interpretable Machine Learning-Lulu - Com (2020)
255 pages
Pawan Resume May 2023
No ratings yet
Pawan Resume May 2023
2 pages
Self RAG
No ratings yet
Self RAG
12 pages
Longformer: The Long-Document Transformer (2020)
No ratings yet
Longformer: The Long-Document Transformer (2020)
17 pages
M6L5 Lyst1370
No ratings yet
M6L5 Lyst1370
22 pages
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
No ratings yet
Pre-Training BERT With Hugging Face Transformers and Habana Gaudi
12 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
Big Bird Transformers For Longer Sequences Paper
No ratings yet
Big Bird Transformers For Longer Sequences Paper
15 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Leonardo's Horse - Wikipedia
No ratings yet
Leonardo's Horse - Wikipedia
1 page
Vienna 2407-W2
No ratings yet
Vienna 2407-W2
22 pages
Tensor Factorization in Drug Target Prediction
No ratings yet
Tensor Factorization in Drug Target Prediction
12 pages
Perspectives: Improving Target Assessment in Biomedical Research: The GOT-IT Recommendations
No ratings yet
Perspectives: Improving Target Assessment in Biomedical Research: The GOT-IT Recommendations
18 pages
Computer Practical: Gibbs Sampling Model Answers: Model Code in Python
No ratings yet
Computer Practical: Gibbs Sampling Model Answers: Model Code in Python
3 pages
The Bread Bakers Apprentice - Mastering The Art of Extraordinary Bread (Dragged)
No ratings yet
The Bread Bakers Apprentice - Mastering The Art of Extraordinary Bread (Dragged)
7 pages
Introduction To Mechanics - Problem Set
No ratings yet
Introduction To Mechanics - Problem Set
3 pages
Feelings British English Teacher B2 C1
No ratings yet
Feelings British English Teacher B2 C1
5 pages
Top IELTS Speaking Topics 2023
No ratings yet
Top IELTS Speaking Topics 2023
3 pages
Historical Development of English
No ratings yet
Historical Development of English
19 pages
Title Pages: Reference in Discourse
No ratings yet
Title Pages: Reference in Discourse
913 pages
Expressing Preference: English Worksheet For 8º Grade Students 2019 Expressing Prefference
No ratings yet
Expressing Preference: English Worksheet For 8º Grade Students 2019 Expressing Prefference
2 pages
Cambridge IGCSE: 0500/11 First Language English
No ratings yet
Cambridge IGCSE: 0500/11 First Language English
16 pages
Describing A Map
No ratings yet
Describing A Map
14 pages
Contact Sura Books Online
0% (1)
Contact Sura Books Online
14 pages
Kiswahili: East Africa's Lingua Franca
No ratings yet
Kiswahili: East Africa's Lingua Franca
1 page
Discourse Analysis 1
No ratings yet
Discourse Analysis 1
9 pages
Myanmar Literature & Figurative Language
No ratings yet
Myanmar Literature & Figurative Language
26 pages
The Types of Communication: Dr. Radhika Kapur
No ratings yet
The Types of Communication: Dr. Radhika Kapur
12 pages
Obtain and Convey Workplace Information
No ratings yet
Obtain and Convey Workplace Information
37 pages
020 - Doing - Task-Based - Teaching - Dave Willis - Jane - Willis PDF
No ratings yet
020 - Doing - Task-Based - Teaching - Dave Willis - Jane - Willis PDF
169 pages
Dlp-In-English 6
No ratings yet
Dlp-In-English 6
5 pages
CV 2024070321095473
No ratings yet
CV 2024070321095473
2 pages
Language Learning Through Chunking
No ratings yet
Language Learning Through Chunking
3 pages
05 - Writing Advice
No ratings yet
05 - Writing Advice
2 pages
Land
No ratings yet
Land
2 pages
Scholastic Scope Thesis
100% (2)
Scholastic Scope Thesis
5 pages
Betrayal at Shrink House RB Ashton Instant Download
No ratings yet
Betrayal at Shrink House RB Ashton Instant Download
35 pages
Model 13 Full 2 1
No ratings yet
Model 13 Full 2 1
15 pages
English Language Test Prep
No ratings yet
English Language Test Prep
2 pages
Knowledge Representation Methods
No ratings yet
Knowledge Representation Methods
26 pages
ND National Museum Sanskrit
No ratings yet
ND National Museum Sanskrit
66 pages
God of The Sparrow, God of The Whale - Score
No ratings yet
God of The Sparrow, God of The Whale - Score
1 page
My Idol LEONARDO-1
No ratings yet
My Idol LEONARDO-1
14 pages
Courses - Exam Type - VISH
No ratings yet
Courses - Exam Type - VISH
6 pages
Linguistics: Understanding Morphology
No ratings yet
Linguistics: Understanding Morphology
11 pages

Longformer: Efficient Long-Doc NLP

Uploaded by

Longformer: Efficient Long-Doc NLP

Uploaded by

Longformer: The Long-Document

Iz Beltagy, Matt Peters, Arman Cohan

Long documents + Transformers is hard

- Chunk, extract, combine

- Loses global context

Avoid computing the full

Tokens attend to each others

Large receptive field with

loop and cuda are the most memory efficient

chunks uses 2x more memory than cuda (but still linear)

chunks faster than cuda because it uses nvidia MM

Use chunks for pretrain/finetune, use cuda for char-lm

enwik8 0.999 1.02

text8 1.103 1.11

- Match SOTA result on the large model

Goal: a BERT-like for long-doc NLP tasks

Procedure to convert RoBERTa into Longformer

- WikiHop -- QA requiring combining facts spread across multiple paragraphs

- Coref -- Coreference chains across the document

- IMDB -- Document classification

- Hyperpartisan news -- Document classification

Classification: on CLS token

QA: on all question tokens

Coref: local attention only

SOTA on WikiHop and TriviaQA

Competitive HotpotQA result

Global attention is important and task specific

Follow our procedure to convert existing pretrained models into Long

Better pretraining, other objective functions

Better encoding of document structure

Global attention instead of GNN

More tasks; summarization, generation, IE

You might also like