COMP 3361 Natural Language Processing
Lecture 12: LLM prompting, in-context learning,
scaling laws, emergent capacities
Spring 2024
Many materials from COS484@Princeton and CSE447@UW (Taylor Sorensen) with special thanks!
Announcements
• Final exam is scheduled at 9:30 - 11:30am on May 8, Wed @Rm 3
Library Ext.
• #assignment-2 due next week!
• Join #assignment-2 Slack channel for discussion
Lecture 3: Tokenization
Lecture plan
• LLM pretraining objectives: recap
• LLM prompting and in-context learning
• Scaling laws of LLMs
• Emergent capacities of LLMs
Lecture 3: Tokenization
Pretraining: training objectives?
• During pretraining, we have a large text corpus (no task labels)
• Key question: what labels or objectives used to train the vanilla
Transformers?
Training
labels/objectives?
Pretraining Transformers
Natural Language Processing - CSE 517 / CSE 447
Pretraining objectives
BERT (Encoder-only) T5 (Encoder-decoder) Decoder-only
Devlin et al., 2018 Raffel et al., 2019
Masked token prediction Denoising span-mask prediction Next token prediction
https://github.com/manueldeprada/Pretraining-T5-PyTorch-Lightning Lecture 3: Tokenization
Evolution tree of pretrained LMs
Open-sourced
Close-sourced
~200 billion
Model size
(# of parameters)
~1000 times larger
~300 million
https://github.com/Mooler0410/LLMsPracticalGuide
Natural Language Processing -
https://mistral.ai/news/mistral-large/ CSE 517 / CSE 447
From GPT1 to GPT-2 to GPT-3
• All decoder-only Transformer-based language models
• Model size ↑, training corpora ↑
Context size = 1024
GPT-2
.. trained on 40Gb of Internet text ..
(Radford et al., 2019): Language Models are Unsupervised Multitask Learners
GPT-3: language models are few-shot learners
• GPT-2 → GPT-3: 1.5B → 175B (# of parameters), ~14B → 300B (# of tokens)
Context size = 2048
Training computation is measured using
floating-point operations or “FLOP”.
One FLOP represents a single arithmetic
operation involving floating-point
numbers, such as addition, subtraction,
multiplication, or division.
(Brown et al., 2020): Language Models are Few-Shot Learners
Before GPT3: Modern learning paradigm
• Pre-training + supervised training/ ne-tuning
• First train Transformer using a lot of general text using unsupervised
learning. This is called pretraining.
• Then train the pretrained Transformer for a speci c task using supervised
learning. This is called netuning.
Natural Language Processing - CSE 517 / CSE 447
fi
fi
fi
Paradigm shift since GPT-3
• Before GPT-3, Pre-training + supervised training/ ne-
tuning is the default way of doing learning in models like
BERT/T5/GPT-2
• SST-2 has 67k examples, SQuAD has 88k (passage,
answer, question) triples
• Fine-tuning requires computing the gradient
and applying a parameter update on every
example (or every K examples in a mini-batch)
• However, this is very expensive for the
175B GPT-3 model
fi
Latest learning paradigm shift since GPT-3
• Pre-training + prompting/in-context learning (no training this
step)
• First train a large (>7~175B) Transformer using a lot of general text using
unsupervised learning. This is called large language model pretraining.
• Then directly use the pretrained large Transformer (no further netuning/
training) for any different task given only a natural language description of
the task or a few task (x, y) examples. This is called prompting/in-context
learning.
Natural Language Processing - CSE 517 / CSE 447
fi
GPT-3: few-shot in-context learning
• GPT-3 proposes an alternative: in-context learning
• This is just a forward pass,
no gradient update at all!
•You only need to feed a small
number of examples (e.g., 32)
(On the other hand, you can’t
feed many examples at once
too as it is bounded by
context size)
GPT-3: task speci cations
DROP
(a reading comprehension task)
Unscrambling words
Word in context (WiC)
fi
GPT-3’s in-context learning
http://ai.stanford.edu/blog/in-context-learning/
(Brown et al., 2020): Language Models are Few-Shot Learners 14
GPT-3’s scaling laws in performance
(Brown et al., 2020): Language Models are Few-Shot Learners 15
Chain-of-thought (CoT) prompting
16
(Wei et al., 2022): Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Why in-context learning with LLMs?
•Amazing zero/few-shot performance
◦Save a lot of annotation! 🎉
•Easy to use without training
◦Just talk to them! 👍
•One model for many NLP applications 😄
◦No need to annotate and ne-tune for different tasks
But, again, they are sensitive to prompts! Need to design a good prompt or train a good
example retriever! 😂
Natural Language Processing - CSE 517 / CSE 447
fi
Okay, so bigger is better? Can you be more speci c?
18 In-Context Learning, Scaling Laws, Emergent Capabilities
fi