LLMs for text
classification and
generation
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Loading a pre-trained LLM
Pipelines: pipeline() Auto classes ( AutoModel class)
Simple, high-level interface Flexibility, control and customization
Automatic model and tokenizer selection Complexity: manual set-ups
More abstraction = less control Support very diverse language tasks
Limited task flexibility Enable model fine-tuning
INTRODUCTION TO LLMS IN PYTHON
The AutoModel and AutoTokenizer classes
import torch.nn as nn from_pretrained()
from transformers import AutoModel, AutoTokenizer
Load pre-trained model weights and
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer as specified in model_name
model = AutoModel.from_pretrained(model_name)
model_name : model checkpoint:
text = "I am an example sequence for text classification."
A unique model version with specific
class SimpleClassifier(nn.Module): architecture, configuration, and weights
def __init__(self, input_size, num_classes):
super(SimpleClassifier, self).__init__() AutoModel does not provide task-specific
self.fc = nn.Linear(input_size, num_classes)
head
def forward(self, x):
return self.fc(x)
INTRODUCTION TO LLMS IN PYTHON
The AutoModel and AutoTokenizer classes
inputs = tokenizer( Tokenize inputs
text, return_tensors="pt", padding=True,
truncation=True, max_length=64) Get model's hidden states in outputs
outputs = model(**inputs)
pooled_output = outputs.pooler_output pooler_output : high-level, aggregated
print("Hidden states size: ", outputs.last_hidden_state.shape)
representation of the sequence
print("Pooled output size: ", pooled_output.shape)
last_hidden_states : raw unaggregated
classifier_head = SimpleClassifier(
pooled_output.size(-1), num_classes=2) hidden states
logits = classifier_head(pooled_output)
probs = torch.softmax(logits, dim=1) Forward pass through classification head
print("Predicted Class Probabilities:", probs)
to obtain class probabilities
Hidden states size: torch.Size([1, 11, 768])
Pooled output size: torch.Size([1, 768])
Predicted Class Probabilities:
tensor([[0.4334, 0.5666]], grad_fn=<SoftmaxBackward0>)
INTRODUCTION TO LLMS IN PYTHON
Auto class for text classification
from transformers import AutoModelForSequenceClassification, AutoModelForSequenceClassification class:
AutoTokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment" Provides pre-configured model with a
tokenizer = AutoTokenizer.from_pretrained(model_name)
classification head
model = AutoModelForSequenceClassification.from_pretrained(
model_name)
No need to manually add model head
text = "The quality of the product was just okay."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits outputs already passed through head's
predicted_class = torch.argmax(logits, dim=1).item() linear layer
print(f"Predicted class index: {predicted_class + 1} star.")
Access raw class logits and return "most
likely" class
Predicted class index: 3 star.
INTRODUCTION TO LLMS IN PYTHON
Auto class for text generation
from transformers import AutoModelForCausalLM, AutoTokenizer AutoModelForCausalLM class:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name) Pre-configured model for causal (auto-
model = AutoModelForCausalLM.from_pretrained(model_name)
regressive) language generation, e.g.:
prompt = "This is a simple example for text generation," "gpt2"
inputs = tokenizer.encode(
prompt, return_tensors="pt") Model head for next-word prediction
output = model.generate(inputs, max_length=26)
generate() takes prompt and generates
generated_text = tokenizer.decode(
output[0], skip_special_tokens=True) up to max_length subsequent tokens
print("Generated Text:")
print(generated_text) Raw outputs are decoded before printing
extended prompt with generated text
Generated Text:
This is a simple example for text generation, but it's also
a good way to get a feel for how the text is generated.
INTRODUCTION TO LLMS IN PYTHON
Exploring a dataset for text classification
from datasets import load_dataset load_dataset() : loads a dataset from
from torch.utils.data import DataLoader
Hugging Face hub
dataset = load_dataset("imdb") imdb: review sentiment classification
train_data = dataset["train"]
dataloader = DataLoader(train_data, batch_size=2, shuffle=True)
for batch in dataloader:
for i in range(len(batch["text"])): DataLoader class: simplifies iterating,
print(f"Example {i + 1}:")
print("Text:", batch["text"][i])
batch processing and parallelism
print("Label:", batch["label"][i]) Iterating through review-sentiment
examples
Example 1:
Text: Much worse than the original. It was actually *painf(...)
Label: tensor(0)
Example 2:
Text: I have to agree with Cal-37 it's a great movie, spec(...)
Label: tensor(1)
INTRODUCTION TO LLMS IN PYTHON
Exploring a dataset for text generation
from datasets import load_dataset Using a dataset from standfordnlp
dataset = load_dataset("stanfordnlp/shp", "askculinary")
catalogue
train_data = dataset["train"] Suitable for text generation and
print(train_data[i])
generative QA
for i in range(5):
example = train_data[i] Display some text information in data
print(f"Example {i + 1}:")
print("Title:", example["post_id"])
instances
print("Paragraph:", example["history"])
print()
Example 1:
Title: himc90
Paragraph: In an interview right before receiving the 2013
Nobel prize in physics, Peter Higgs stated that he (...)
Example 2 (...)
INTRODUCTION TO LLMS IN PYTHON
How text generation LLM training works
Input + target (labels) pairs
Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"
INTRODUCTION TO LLMS IN PYTHON
How text generation LLM training works
Input + target (labels) pairs
Input sequences: a segment of the text, e.g. "the cat is" from "the cat is sleeping on the
mat"
Target sequences: tokens shifted one position to the left, e.g. "cat is sleeping"
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLMs for text
summarization and
translation
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Inside text summarization
Goal: create a summarized version of a
text, preserving important information
Inputs: Original text
Target (labels): summarized text
INTRODUCTION TO LLMS IN PYTHON
Inside text summarization
Goal: create a summarized version of a
text, preserving important information
Inputs: Original text
Target (labels): summarized text
Extractive summarization: select, extract, and
combine parts of the original text
INTRODUCTION TO LLMS IN PYTHON
Inside text summarization
Goal: create a summarized version of a
text, preserving important information
Inputs: Original text
Target (labels): summarized text
Extractive summarization: select, extract, and
combine parts of the original text
Abstractive summarization: generate a
summary word by word
INTRODUCTION TO LLMS IN PYTHON
Exploring a text summarization dataset
from datasets import load_dataset example = dataset["train"][21]
example['Article']
dataset = load_dataset("ILSUM/ILSUM-1.0", "English")
print(f"Features: {dataset['train'].column_names}")
This is how an Apple Watch saved a man's life after detecting
accident. It all started when Gabe Burdett was waiting for his
Features: ['id', 'Article', 'Heading', 'Summary'] father Bob at their pre-designated location for some mountain
biking at the Riverside State Park when he received a text
alert from his dad's Apple Watch, saying it had detected a
Two main text attributes "hard fall".Burdett, from city of Spokane in Washington State
later received another update from the Watch, saying his father
had reached Sacred Heart Medical Center."We drove straight
Long text: input sequence for the LLM there but he was gone when we arrived. I get another (...)
'Article' in the example
example['Summary']
Summarized text: target, training label
'Summary' in the example Dad flipped his bike at the bottom of Doomsday, hit his head
and was knocked out until sometime during the ambulance ride.
The watch had called 911 with his location and EMS had him
scooped up and to the hospital in under a 1/2hr.
INTRODUCTION TO LLMS IN PYTHON
Loading a pre-trained LLM for summarization
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Import and use AutoModelForSeq2SeqLM
model_name = "t5-small" Load t5-small : versatile for various tasks
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
Add a task-specific prefix to the input text:
input_ids = tokenizer.encode( "summarize:""
"summarize: " + example["Article"],
return_tensors="pt", max_length=512, truncation=True .generate() passes the tokenized input to
)
the model
summary_ids = model.generate(input_ids, max_length=150)
summary = tokenizer.decode( .decode() post-processes output
summary_ids[0], skip_special_tokens=True)
embedding back into text
print("Original Text:")
print(example["Article"])
print("\nGenerated Summary:")
print(summary)
INTRODUCTION TO LLMS IN PYTHON
Loading a pre-trained LLM for summarization
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Original Text:
This is how an Apple Watch saved a man's life after detecting
model_name = "t5-small" accident. It all started when Gabe Burdett was waiting for his
tokenizer = AutoTokenizer.from_pretrained(model_name) father Bob at their pre-designated location for some mountain
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) biking at the Riverside State Park when he received a text
alert from his dad's Apple Watch, saying it had detected a
input_ids = tokenizer.encode( "hard fall".Burdett, from city of Spokane in Washington State
"summarize: " + example["Article"], later received another update from the Watch, saying his father
return_tensors="pt", max_length=512, truncation=True had reached Sacred Heart Medical Center."We drove straight
) there but he was gone when we arrived. I get another (...)
summary_ids = model.generate(input_ids, max_length=150)
Generated Summary:
summary = tokenizer.decode(
a man was waiting for his father when he received a text alert
summary_ids[0], skip_special_tokens=True)
from his dad's apple watch. the watch notified 911 with the
location and within 30 minutes, emergency medical services took
print("Original Text:")
the injured Bob to the hospital. the watch notified 911 with
print(example["Article"])
the location and within 30 minutes, emergency medical services
print("\nGenerated Summary:")
took the injured Bob to the hospital.
print(summary)
1 Due to space limitations, only the first 50% of the original input text is shown in the slide
INTRODUCTION TO LLMS IN PYTHON
Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context
Inputs: text in source language
Target (labels): target language translation
INTRODUCTION TO LLMS IN PYTHON
Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context
Inputs: text in source language
Target (labels): target language translation
Encode source language sequence
INTRODUCTION TO LLMS IN PYTHON
Inside language translation
Goal: produce translated version of a text,
conveying same meaning and context
Inputs: text in source language
Target (labels): target language translation
Encode source language sequence
Decode into target language sequence, using
learned language patterns and associations
INTRODUCTION TO LLMS IN PYTHON
Exploring a language translation dataset
from datasets import load_dataset Load English-Welsh bilingual dataset
dataset = load_dataset("techiaith/legislation-gov-uk_en-cy")
Dataset object
sample_data = dataset["train"] Extract a training example
source : English sequences
input_example = sample_data.data['source'][0]
target_example = sample_data.data['target'][0]
target : Welsh sequences
print("Input (English):", input_example)
print("Target (Welsh):", target_example)
Input (English): 2 Regulations under section 1: supplementary
Target (Welsh): 2 Rheoliadau o dan adran 1: atodol
INTRODUCTION TO LLMS IN PYTHON
Loading a pre-trained LLM for translation
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM Import and use AutoModelForSeq2SeqLM
model_name = "Helsinki-NLP/opus-mt-en-cy" Load Helsinki-NLP model for English-
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) Welsh translation
input_seq = "2 Regulations under section 1: supplementary"
input_ids = tokenizer.encode(input_seq, return_tensors="pt")
translated_ids = model.generate(input_ids)
translated_text = tokenizer.decode(
Tokenize English sequence ( .encode() )
translated_ids[0], skip_special_tokens=True) and pass it to the model ( .generate() )
print("Predicted (Welsh):", translated_text)
Decode and print Welsh translation
Predicted (Welsh):
2 Rheloiad o dan adran 1:aryary " means "i
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLMs for question
answering
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Types of question answering (QA) tasks
QA task type Architecture Extractive QA: The LLM extracts the answer
Extractive Encoder-only to a question from a provided context
Open Generative Encoder-Decoder
Closed generative Decoder-only
Open Generative QA: The LLM generates the Closed Generative QA: The LLM fully
answer based on a context generates the answer, no context provided
INTRODUCTION TO LLMS IN PYTHON
Exploring a QA dataset
from datasets import load_dataset Load English subset of the xtreme dataset
mlqa = load_dataset( for extractive QA
"xtreme", name="MLQA.en.en")
DatasetDict object.
print(mlqa)
Test and validation Dataset objects.
DatasetDict({
test: Dataset({
features: ['id', 'title', 'context',
'question', 'answers'], Relevant features:
num_rows: 11590 'context'
})
validation: Dataset({ 'question'
features: ['id', 'title', 'context',
'question','answers'], 'answers'
num_rows: 1148
})
})
INTRODUCTION TO LLMS IN PYTHON
Exploring a QA dataset
Example instance in the dataset:
print("Question:" , mlqa["test"]["question"][53])
print("Answer:" , mlqa["test"]["answers"][53])
print("Context:" , mlqa["test"]["context"][53])
Question: what is a kimchi?
Answer: {'answer_start': [271], 'text': ['a fermented, usually spicy vegetable dish']}
Context: Korean cuisine is largely based on rice, noodles, tofu, vegetables, fish and meats. Traditional Korean
meals are noted for the number of side dishes, banchan, which accompany steam-cooked short-grain rice. Every
meal is accompanied by numerous banchan. Kimchi, a fermented, usually spicy vegetable dish is commonly served
at every meal and is one of the best known Korean dishes. Korean cuisine usually involves heavy seasoning with
sesame oil, doenjang, a type of fermented soybean paste, soy sauce, salt, garlic, ginger, and gochujang, a hot
pepper paste. Other well-known dishes are Bulgogi, grilled marinated beef, Gimbap, and Tteokbokki , a spicy
snack consisting of rice cake seasoned with gochujang or a spicy chili paste.
INTRODUCTION TO LLMS IN PYTHON
Extractive QA: framing the problem
Supervised learning: span classification
INTRODUCTION TO LLMS IN PYTHON
Extractive QA: framing the problem
Supervised learning: span classification
INTRODUCTION TO LLMS IN PYTHON
Extractive QA: framing the problem
Supervised learning: span classification
Prediction result: answer span given by: [start position, end position]
Answer span obtained from most likely raw outputs (logits)
INTRODUCTION TO LLMS IN PYTHON
Extractive QA: tokenizing inputs
from transformers import AutoTokenizer Tokenization results:
model_ckp = "deepset/minilm-uncased-squad2"
Tensor Description
tokenizer = AutoTokenizer.from_pretrained(model_ckp)
input_ids Integer
question = "How is the taste of wasabi?"
attention_mask Boolean
context = """Japanese cuisine captures the essence of \
a harmonious fusion between fresh ingredients and \ token_type_ids 0: Question, 1: Context
traditional culinary techniques, all heightened \
by the zesty taste of the aromatic green condiment \
known as wasabi."""
inputs = tokenizer(question, context,
return_tensors="pt")
INTRODUCTION TO LLMS IN PYTHON
Extractive QA: loading and using model
from transformers import AutoModelForQuestionAnswering Custom model class:
AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.
from_pretrained(model_ckp) Inference on example input:
with torch.no_grad():
**inputs unpacks and extracts
outputs = model(**inputs) tokenized inputs
start_idx = torch.argmax(outputs.start_logits) Raw outputs post-processing
end_idx = torch.argmax(outputs.end_logits) + 1
start_logits , end_logits answer
answer_span = inputs["input_ids"][0] start/end likelihoods per input token
[start_idx:end_idx]
answer = tokenizer.decode(answer_span)
start_idx , end_idx : positions of input
tokens delimiting answer span
INTRODUCTION TO LLMS IN PYTHON
Managing long context sequences
long_exmp = tokenizer(example_qt, example_ct,
return_overflowing_tokens=True,
max_length=100, stride=25)
for idx, window in enumerate(long_exmp["input_ids"]):
print("Tokens in window ", idx, ": ", len(window))
No. tokens in window 0 : 100
No. tokens in window 1 : 100
[...]
Sliding window parameters:
No. tokens in window 8 : 50
max_length : sliding window size
for window in long_exmp["input_ids"]:
print(tokenizer.decode(window), "\n")
stride : stride size between windows
[CLS] what is a kimchi? [SEP] Korean cuisine is l[...]
[CLS] what is a kimchi? [SEP] steam-cooked short-[...]
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON
LLM fine-tuning and
transfer learning
INTRODUCTION TO LLMS IN PYTHON
Iván Palomares Carrascosa, PhD
Senior Data Science & AI Manager
Revisiting the LLM lifecycle
INTRODUCTION TO LLMS IN PYTHON
Revisiting the LLM lifecycle
Full fine-tuning: The entire model weights are updated; more computationally expensive
INTRODUCTION TO LLMS IN PYTHON
Revisiting the LLM lifecycle
Partial fine-tuning: Lower (body) layers fixed; only task-specific layers (head) are updated
INTRODUCTION TO LLMS IN PYTHON
Demystifying transfer learning
Transfer learning: a model trained on one task is adapted for a different but related task
In pre-trained LLMs, fine-tune on a smaller dataset for a specific task
Zero-shot learning: perform tasks never "seen" during training
One-shot, few-shot learning: adapt a model to a new task with one or a few examples only
INTRODUCTION TO LLMS IN PYTHON
Fine-tuning a pre-trained Hugging Face LLM
import torch Load BERT-based model for text
from transformers import AutoModelForSequenceClassification,
AutoTokenizer
classification and associated tokenizer
from datasets import load_dataset
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
Tokenize dataset used for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2) IMDB reviews dataset
def tokenize_function(examples): truncation=True truncates input
return tokenizer(
examples["text"], padding="max_length", truncation=True) sequences beyond model's max_length
data = load_dataset("imdb") batched=True to process examples in
tokenized_data = data.map(tokenize_function, batched=True)
batches rather than individually
INTRODUCTION TO LLMS IN PYTHON
Fine-tuning a pre-trained Hugging Face LLM
from transformers import Trainer, TrainingArguments TrainingArguments class: customize training
training_args = TrainingArguments(
settings
output_dir="./smaller_bert_finetuned",
per_device_train_batch_size=8,
Output directory, batch size per GPU,
num_train_epochs=3,
evaluation_strategy="steps", epochs, etc.
eval_steps=500,
save_steps=500,
logging_dir="./logs",
)
Trainer class: manage training and
trainer = Trainer(
validation loop
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"], Specify model, training arguments, training
eval_dataset=tokenized_datasets["test"],
)
and validation sets
trainer.train()
trainer.train() : execute training loop
INTRODUCTION TO LLMS IN PYTHON
Inference and saving a fine-tuned LLM
example_input = tokenizer("I am absolutely amazed with this After fine-tuning, inference is performed as
new and revolutionary AI device",
return_tensors="pt")
usual
output = model(**example_input) Tokenize inputs, pass them to the LLM,
predicted_label = torch.argmax(output.logits, dim=1).item()
print("Predicted Label:", predicted_label)
obtain and post-process outputs
Predicted Label: 0
Fine-tuned model and tokenizer can be
model.save_pretrained("./my_bert_finetuned")
tokenizer.save_pretrained("./my_bert_finetuned")
saved using .save_pretrained()
INTRODUCTION TO LLMS IN PYTHON
Let's practice!
INTRODUCTION TO LLMS IN PYTHON