Fine-tuning an llm
What does pre-training mean?
What does fine-tuning mean?
How many parameters does BERT have?
Is BERT much smaller than GPT?
How was the BERT model pre-trained?
How does MLM pre-training objective work?
How does NSP pre-training objective work?
Pre-training
is like a child learning to
read and write his/her mother tongue.
Fine Tuning
is like a student learning to use language to
perform complex tasks in high school and college.
In-Context Learning
is like a working professional trying to
figure out his/her manager’s instructions
Zero Shot vs Few Shot
TEXT CLASSIFICATION
Classical NLP Approach
Requires Fine Tuning
Requires Fine Tuning
Is only the classifier layer on top trained or
are the BERT parameters also updated during fine-tuning?
NAMED ENTITY RECOGNITION
BERT NER : The B-I-O Notation
Yesterday , Rohan Sharma traveled to Mumbai .
O O B-PER I-PER O O B-LOC O
INFORMATION RETRIEVAL
SBERT Fine-Tuning
- The query has a vector representation using embeddings
- Documents in the database stored as embeddings
- Brute Force Approach:
Do a dot product of the query vector with the embeddings
of all the documents, and choose the one that gives the
closest match
- Hierarchical Navigable Small World (HNSW):
Create a layered graph structure of the document
embedding vectors so that the search process is made
much faster
QUESTION ANSWERING
How to fine-tune BIG models?
Quantization
- LLMs require a large amount of expensive GPU memory
- Large number of parameters
- High precision of the floating point numbers
Model Original Size Quantized Size (4-bit)
LLaMA2 7B 13 GB 3.9 GB
LLaMA2 13B 24 GB 7.8 GB
LLaMA2 30B 60 GB 19.5 GB
LLaMA2 65B 120 GB 38.5 GB
NVIDIA A100 has 80 GB memory and costs around INR 12-15 lakhs
Distillation
- Transfer of knowledge from larger “teacher” model
to a smaller “student” model
- Smaller model represents the bigger model for specific tasks
- Larger model learns the distribution from the data
- Smaller model learns the distribution from the larger model