Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views72 pages

Intro LLM v1

Uploaded by

haryrise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views72 pages

Intro LLM v1

Uploaded by

haryrise
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Introduction to Large Language Model

Kun Yuan (袁 坤)

Feb 20, 2024


Contents

• Large language model (LLM)

• How to effectively train LLM

• How to effectively use LLM

• Course plans

Note: The main contents of this lecture is summarized from two wonderful talks [1,2] by Andrej Karpathy

[1] State of GPT


[2] The busy person’s intro to LLMs

<2>
Teaching assistants

白禹东 耿云腾 何雨桐 李佩津 刘梓豪

鲁可儿 宋奕龙 孙乾祐 王宇驰


PART 01

Large language model (LLM)


Large language model

• Meta Llama 2 is probably the most powerful open-source LLM

• Weights, architectures, and the paper were all released by Meta

• Neural network parameters + the code to run them; that’s all you need

• No need to access your WIFI. Just one laptop

<5>
Large language model

<6>
What is the model parameter?

• LLM can be regarded as a magic function that maps the context to the next word

• Model parameter parameterize the magic function to a series of matrix-matrix(vector) products

<latexit sha1_base64="ZrH5KG8RDPFMJZfO36vqFz9dXhc=">AAACG3icbVDLSgMxFM3Ud31VXboJFmndlJlSVBBBdONSwT6gLW0mzbShmWRI7ohl6H+48VfcuFDEleDCvzGtFbT1QOBwzrnk3uNHghtw3U8nNTe/sLi0vJJeXVvf2MxsbVeMijVlZaqE0jWfGCa4ZGXgIFgt0oyEvmBVv38x8qu3TBuu5A0MItYMSVfygFMCVmplikG+EfrqLmm3rYQNB6wkJrnc8KQBPQbkAJ/in0RIwBqtTNYtuGPgWeJNSBZNcNXKvDc6isYhk0AFMabuuRE0E6KBU8GG6UZsWERon3RZ3VJJQmaayfi2Id63SgcHStsnAY/V3xMJCY0ZhL5N2vV6Ztobif959RiC42bCZRQDk/T7oyAWGBQeFYU7XDMKYmAJoZrbXTHtEU0o2DrTtgRv+uRZUikWvMNC6bqUPTuf1LGMdtEeyiMPHaEzdImuUBlRdI8e0TN6cR6cJ+fVefuOppzJzA76A+fjC/adoCE=</latexit>

f (“cat sit on a”; ✓) = “mat”

• Given the model parameter ✓ , LLM can predict the next word
<latexit sha1_base64="pYM132qgRhMHaU/a61ywLFbdrWg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9HHGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYgllWthbCRtRTRnagEo2BG/55VXSuqh6l9Xafa1Sv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPPHMMfOJ8/puWPMQ==</latexit>

<7>
LLM can generate texts of various styles

code book information wikipedia

<8>
How to get the weights? Training the deep neural network

• Use tremendous data and computing resources to get the valuable model parameters

• Very very expensive; update the model weights probably once a year or once a few years
<9>
How to make LLM as your personal copilot? PE and finetune

• Over 90% of my interactions with ChatGPT are

• But we should use LLM more frequently and smartly. It can be your personal copilot

• It is not easy to have your own LLM copilot. You need to know prompt engineering and finetune

< 10 >
PART 02

ChatGPT Training Pipeline


ChatGPT training pipeline has 4 stages

Source: Andrej Karpathy, State of GPT < 12 >


Pretraining

99% training time


and resource

Source: Andrej Karpathy, State of GPT < 13 >


Pretraining

Data collection

Crawled data from websites; in both high quality and low quality

High-quality data

Training data mixture used in lLaMA model

< 14 >
Pretraining

Tokenization (分词)

Transform long texts to lists of integers

< 15 >
Pretraining

Token and vocabulary

Sentence: "The cat sat on the mat. The cat is orange."

Token: ["The", "cat", "sat", "on", "the", "mat", ".", "The", "cat", "is", "orange", "."]

Vocabulary : {"The", "cat", "sat", "on", "the", "mat", ".", "is", "orange"}

Vocabulary is a set with each element unique

< 16 >
Pretraining

< 17 >
Pretraining

While GPT-3 is larger, LLaMa utilizes more tokens. In practice, LLaMA significantly performs better.

We cannot judge the power of one LLM model only by its number of parameters; data also matters

It is still in debate that whether one should increase model size or data size given limited resource budget

< 18 >
Pretraining

< 19 >
Pretraining

< 20 >
Pretraining

• Effective representation learning

• Long-range dependency with attention

• Parallelizable architecture

• Flexibility and Adaptability


(In recent popular SORA, Diffusion +
transformer is used)

Transformer architecture
(will discuss it in later lectures)
< 21 >
Pretraining

[Training Compute-Optimal Large Language Models]

Larger dataset + bigger model + longer training


=
better prediction accuracy

A very straightforward way to achieve good LLM.


All you need is MONEY!

Amazing representation power

< 22 >
Pretraining

Larger dataset + bigger model + longer training = better prediction accuracy

< 23 >
Pretraining

< 24 >
Pretraining

• Pretraining a base model is extremely expensive

• Several effective pretraining techniques:

§ 3D parallelism: data/model/tensor parallelism

§ Memory-efficient optimizers

§ Large-batch training

§ Mixed-precision training

• Will discuss them later lectures

< 25 >
Pretrained model provides strong transfer learning capabilities

Pretrained base model performs well after finetuning

< 26 >
Pretrained model provides strong transfer learning capabilities

• Pretraining + finetuning/prompting reshapes the AI industry.

• Pretrained base model only needs a small amount of data to be adapted to the down-stream applications.

• The cost to deploy AI to down-stream applications decreases significantly

§ Achieve powerful base models from OpenAI/Google/Meta/GitHub

§ Collect a small number of downstream data and use it to finetune the base model

§ No need for expensive investment of money and talents

< 27 >
Pretraining

Base models in the wild

< 28 >
Pretraining

LLaMA and Bloom are popular open-source base models

• LLaMA https://github.com/facebookresearch/llama

• Bloom https://huggingface.co/bigscience

< 29 >
Supervised Finetuning

< 30 >
Supervised Finetuning

Base models cannot be deployed directly. It is still far away from being a smart assistant

< 31 >
Supervised Finetuning

Base models can be


tricked into being AI
assistants with prompting

We need to finetune the


base model to make it
chat like humans

< 32 >
Supervised Finetuning

Ask human contractors to


respond to prompts and
generate high-quality,
helpful, truthful, and
harmless responses

Collect 10,000+ high-


quality human-generated
responses

Finetune base models with


these high-quality data

< 33 >
Supervised Finetuning

• Dataset: 10~100K human-generated data pairs {(prompt, response)}

• Training: repeat what we did in the “Pretraining” stage

• After supervised finetuning stage, base models can chat like humans

• 1-100 GPUs; days of training; but can still be very expensive due to human-generated data

• To save money, some (or most) models use ChatGPT-generated data to finetune

< 34 >
Reward modeling

< 35 >
Reward modeling

• SFT model performs like an “assistant”, but still not good enough.

• To further improve it, one can ask human contractors to generate more data; effective but expensive

• Another way is to let the model learn what response is good, and how to generate good response

• Reward model will enable GPT to judge whether a certain response is good or not

• Reward model will be used in the Reinforcement learning stage to reinforce good response

< 36 >
Reward modeling
Dataset

SFT model
generates different
responses to the
same prompt

< 37 >
Reward modeling
Dataset

SFT model
generates different
responses to the
same prompt

Ask contractors to
rank the responses;
much cheaper

< 38 >
Reward modeling
Dataset

SFT model
generates different
responses to the
same prompt

Ask contractors to
rank the responses;
much cheaper

Dataset:
{(prompt, response,
reward)}

< 39 >
Reward modeling

• Given a prompt, SFT model generates several responses, and then makes a reward prediction (green).

• This reward will be supervised by ground-truth reward.

• After training, we achieve a RW model that can predict the reward after its generated response.

< 40 >
Reinforcement learning

< 41 >
Reinforcement learning

RL makes the model learn to generate responses with great scores

< 42 >
Reinforcement learning

< 43 >
Reinforcement learning

< 44 >
ChatGPT training pipeline

Source: Andrej Karpathy, State of GPT < 45 >


Assistant models in the wild

< 46 >
A short summary

• We discuss the pipeline to train ChatGPT

• SFT, RM, and RL are critical to transform GPT to ChatGPT

• SFT, RM, and RL are also critical to transform GPT to your own personalized assistant

< 47 >
PART 03

Use LLM Effectively As Your Personal Copilot


Understand how human and LLM work differently

• Human can plan and reflect

• Human can use tools

• Human typically thinks more

< 49 >
Understand how human and LLM work differently

• LLM strips away all human behavior

< 50 >
Use prompt to help LLM work like a human

• Chain of thoughts: break up tasks into multiple steps/stages

(will discuss it in later lectures)

< 51 >
Tree of thought

• Tree of thoughts: expand thoughts, evaluate them and then go deeper

(will discuss it in later lectures)

• How to find simple and effective prompts are still a hot research topic
< 52 >
Prompt ensemble

< 53 >
Ask for reflection

< 54 >
Automatic prompt engineering (APE)

• Learn a good prompt automatically

[Large language models are human-level prompt engineers, 2023]

< 55 >
RAG empowered LLM

Retrieval-augmented generation (RAG) helps LLM generate


more precise, up-to-date, and personalized contents.

< 56 >
RAG empowered LLM

RAG Bing Copilot

ChatGPT 3.5

< 57 >
Tool use

Offload tasks that LLM are not good at.

< 58 >
Finetuning

SFT and RLHF are all finetuning


the base pretrained model

< 59 >
LoRA: Low-rank adaptation

Finetune: Inject weights to base model

Fine-tuned weight Base model weight Additional weight

LoRA: low-rank adaptation

Fine-tuned weight Base model weight Low-rank weight

< 60 >
LoRA: Low-rank adaptation

< 61 >
LoRA: Low-rank adaptation

Light but powerful

< 62 >
LoRA: Low-rank adaptation

Reference

E. J. Hu et. al., LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685

Q. Zhang et. al., LoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2303.10512

< 63 >
How to use LLM effectively?

Recommendations from OpenAI

< 64 >
Use cases

< 65 >
Course plan

• 1. Preliminary

§ Linear algebra; optimization

§ Machine learning; deep neural network

§ Word embedding; recurrent neural network; Seq2Seq

§ Attention; Transformer;

§ GPT

< 66 >
Course plan

• 2. LLM pretraining

§ SGD

§ Momentum SGD; Adaptive SGD; Adam

§ Large-batch training; mixed-precision training

§ Data parallelism; model parallelism; tensor parallelism

< 67 >
Course plan

• 3. Finetuning

§ Supervised finetuning

§ RLHF

§ Parameter efficient finetuning (PEFT), e.g., LoRA

< 68 >
Course plan

• 4. Prompt engineering

§ Chain of thought; tree of thought

§ Principles to generate high quality prompt

§ Automatic prompt engineering

< 69 >
Course plan

• 5. Applications

§ LLM agent

§ LLM in decision intelligence

< 70 >
Grading policy

• Homework (~30%)

• Mid-term (~30%)

• Final project and presentation (~40%)

< 71 >
Thank you!

Kun Yuan homepage: https://kunyuan827.github.io/

You might also like