Getting Started With The Model Architecture of The Transformer
Getting Started With The Model Architecture of The Transformer
Language modeling
Chatbots
Personal assistants
Question answering
Text summarization
Speech-to-text
Sentiment analysis
Machine translation
The Transformer architecture marks a break from past approaches using Recurrent
Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
Input Embedding
The book then moves on to discuss input embeddings.
Page 1
Created by Turbolearn AI
Positional Encoding
The book then moves on to discuss positional encoding.
Page 2
Created by Turbolearn AI
The book then moves on to provide some concluding remarks before the end of the
chapter.
Summary
The book then provides a summary of the chapter.
Questions
The book then provides some questions for the reader to answer.
References
The book then provides a list of references.
NLP's Evolution
The Transformer's rise marks a significant shift in NLP, overcoming limitations of
earlier methods. Key figures and concepts that paved the way include:
Page 3
Created by Turbolearn AI
Emergence of Attention
The concept of attention, which involves "peeking" at other tokens in a sequence, was
added to RNN and CNN models to improve their performance.
Page 4
Created by Turbolearn AI
Multi-Head Attention
The Transformer model uses multi-head attention, running eight attention
mechanisms in parallel. This provides:
Page 5
Created by Turbolearn AI
LayerNormalization (x + Sublayer(x))
The layers are structurally identical but learn different associations of tokens in the
sequence.
Constant Dimensionality
The output of every sub-layer has a constant dimension, denoted as dmodel, which
equals 512 in the original Transformer architecture. This consistency optimizes
calculations and information flow.
Input Embedding
The input embedding sub-layer converts input tokens into vectors of dimension
dmodel = 512.
Skip-Gram Model
Skip-gram: Focuses on a center word in a window of words and predicts
context words. For example, given the sentence "The black cat sat on the
couch and the brown dog slept on the rug," the word embeddings of
"black" and "brown" should be similar.
Page 6
Created by Turbolearn AI
Cosine Similarity
Cosine similarity is used to verify the similarity between word embeddings.
Positional Encoding
Positional Encoding: Adds information about the position of a word in a
sequence to the word embedding vector.
Page 7
Created by Turbolearn AI
pc(black)=y1+pe(2)
To ensure that the word embedding information is not lost, the word embedding
vector can be scaled:
y1*math.sqrt(dmodel)
The final positional encoding vector is obtained by adding the scaled word embedding
vector to the positional encoding vector:
The cosine similarity of the final positional encoding vectors reflects the combined
effect of word embedding and positional encoding:
Page 8
Created by Turbolearn AI
Each word is mapped to all other words to determine its fit in a sequence. For
example, in the sequence "The cat sat on the rug and it was dry-cleaned," the model
trains to determine if "it" relates to "cat" or "rug."
these dimensions into 8 heads, each with d = 64 dimensions. This allows running the
k
Z = (z 0 , z 1 , z 2 , z 3 , z 4 , z 5 , z 6 , z 7 )
Page 9
Created by Turbolearn AI
Query vector (Q): Dimension of d = 64. Activated and trained when a word
q
vector x seeks all key-value pairs of other word vectors, including itself (self-
n
attention).
Key vector (K): Dimension of d = 64. Trained to provide an attention value.
k
value.
import numpy as np
from scipy.special import softmax
Output:
[[1. 0. 1. 0.]
[0. 2. 0. 2.]
[1. 1. 1. 1.]]
Page 10
Created by Turbolearn AI
print("w_query:\n", w_query)
print("w_key:\n", w_key)
print("w_value:\n", w_value)
Q = np.matmul(x, w_query)
K = np.matmul(x, w_key)
V = np.matmul(x, w_value)
print("Query:\n", Q)
print("Key:\n", K)
print("Value:\n", V)
Implement the scaled dot-product attention equation sof tmax( . For this model,
QK
)
√d k
√d k = √ 3 ≈ 1.
Page 11
Created by Turbolearn AI
k_d = 1
attention_scores = (Q @ K.transpose()) / k_d
print(attention_scores)
attention_scores[0] = softmax(attention_scores[0])
attention_scores[1] = softmax(attention_scores[1])
attention_scores[2] = softmax(attention_scores[2])
print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])
.
QK
Attention(Q, K, V ) = sof tmax( )V
√d k
print("V[0]:\n", V[0])
print("V[1]:\n", V[1])
print("V[2]:\n", V[2])
Page 12
Created by Turbolearn AI
MultiHead(output) = Concat(z0, z1, z2, z3, z4, z5, z6, z7)W0 = x, dmodel
Page 13
Created by Turbolearn AI
Post-Layer Normalization
Each attention sub-layer and each feedforward sub-layer of the Transformer is
followed by post-layer normalization (Post-LN). The Post-LN contains an add function
and a layer normalization process.
The input of LayerNorm is a vector v resulting from x + Sublayer(x). d = 512 for model
every input and output of the Transformer, which standardizes all the processes. The
basic concept for v = x + Sublayer(x) can be defined by LayerN orm(v):
v−μ
LayerN orm(v) = γ + β
σ
Variable Definition
Mean of v of dimension d: μ = 1 d
μ ∑ vi
d i=1
γ Scaling parameter
β Bias vector
Page 14
Created by Turbolearn AI
Decoder Stack
The layers of the decoder of the Transformer model are stacks of layers like the
encoder layers. Each layer of the decoder stack has the following structure:
Attention Layers
The Transformer is an auto-regressive model. It uses the previous output sequences
as an additional input. The multi-head attention layers of the decoder use the same
process as the encoder.
Page 15
Created by Turbolearn AI
Metric Value
Page 16
Created by Turbolearn AI
BERT Architecture
BERT introduces bidirectional attention to transformer models, utilizing only the
encoder blocks.
Encoder Stack
Page 17
Created by Turbolearn AI
Dimensions of a head: z = d
A
/ A = 512 / 8 = 64
model
BERT Models:
BERTBASE: N=12 encoder layers, d = H = 768, A=12 heads
model
(N) H (A) )
Original
6 512 8 64
Transformer
BERTBASE 12 768 12 64
BERTLARGE 24 1024 16 64
Page 18
Created by Turbolearn AI
Example:
[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]
Input Embeddings
Input embeddings are obtained by summing:
Token embeddings
Segment (sentence, phrase, word) embeddings
Positional encoding embeddings
Page 19
Created by Turbolearn AI
1. Pretraining:
Defining the model's architecture.
Training the model on MLM and NSP tasks.
2. Fine-Tuning:
Initializing the downstream model with pretrained parameters.
Fine-tuning parameters for specific tasks.
3. Importing Modules:
Page 20
Created by Turbolearn AI
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, Sequ
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_s
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
sentences = df.sentence.values
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
Page 21
Created by Turbolearn AI
MAX_LEN = 128
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncati
attention_masks = []
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)
batch_size = 32
Page 22
Created by Turbolearn AI
Batch Size
The batch size is set to 32.
batch_size = 32
This variable determines how many data samples are processed in each iteration of
training.
Data Loaders
DataLoaders are used to efficiently manage the data during training:
Page 23
Created by Turbolearn AI
configuration = BertConfig()
model = BertModel(configuration)
configuration = model.config
print(configuration)
Key Parameters
The BERT configuration includes several important parameters:
Page 24
Created by Turbolearn AI
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
model.cuda()
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
Page 25
Created by Turbolearn AI
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_d
]
Weight Decay
The optimizer includes a weight decay rate to prevent overfitting. Parameters are
filtered to apply different weight decay rates:
Setting Hyperparameters
The hyperparameters are crucial for training.
Learning rate (lr) and warm-up rate (warmup) should be small initially and
gradually increased to avoid large gradients and overshooting.
Accuracy Measurement
Page 26
Created by Turbolearn AI
Training Loop
Training Process
The training loop involves standard learning processes.
epochs = 4
train_loss_set = []
for _ in trange(epochs, desc="Epoch"):
# Training steps here
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
The number of epochs is set to 4, with measurements for loss and accuracy at each
epoch. The train_loss_set stores loss and accuracy values for plotting.
Page 27
Created by Turbolearn AI
Data Preparation
The program makes predictions using the holdout dataset. The data preparation
process from the training data is repeated:
Batch Predictions
The program runs batch predictions using the dataloader:
logits = logits['logits'].detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
predictions.append(logits)
true_labels.append(label_ids)
What is MCC?
The Matthews Correlation Coefficient (MCC) measures the quality of
binary classifications and can be modified into a multi-class correlation
coefficient.
Page 28
Created by Turbolearn AI
T P ×T N −F P ×F N
ϕ =
√ (T P +F P )(T P +F N )(T N +F P )(T N +F N )
MCC Implementation
MCC is imported from sklearn.metrics:
matthews_set = []
for i in range(len(true_labels)):
matthews = matthews_corrcoef(true_labels[i], np.argmax(predictions[i],
matthews_set.append(matthews)
matthews_set
Aggregating Results
Page 29
Created by Turbolearn AI
Summary
Fine-tuning a pretrained model requires fewer resources than training from scratch.
Questions
1. BERT stands for Bidirectional Encoder Representations from Transformers.
(True)
2. BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning.
(True)
3. Fine-tuning a BERT model implies training parameters from scratch. (False)
4. BERT only pretrains using all downstream tasks. (False)
5. BERT pretrains with Masked Language Modeling (MLM). (True)
6. BERT pretrains with Next Sentence Predictions (NSP). (True)
7. BERT pretrains mathematical functions. (False)
8. A question-answer task is a downstream task. (True)
9. A BERT pretraining model does not require tokenization. (False)
10. Fine-tuning a BERT model takes less time than pretraining. (True)
Page 30
Created by Turbolearn AI
Steps
KantaiBERT will be built in 15 steps. The first step is to load the dataset.
Three books by Immanuel Kant are compiled into a text file named kant.txt:
Page 31
Created by Turbolearn AI
Training a Tokenizer
A tokenizer is trained using Hugging Face's ByteLevelBPETokenizer():
vocab.json, which contains the indices of The program first creates the
KantaiBERT directory:
import os
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')
Page 32
Created by Turbolearn AI
tokenizer = ByteLevelBPETokenizer(
"./KantaiBERT/vocab.json",
"./KantaiBERT/merges.txt",
)
!nvidia-smi
Page 33
Created by Turbolearn AI
config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=512,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
```## Re-creating the Tokenizer and Initializing the Model
This step involves importing the RobertaTokenizer from the transformers library and
initializing it with a pre-trained model, in this case, "./KantaiBERT", while setting the
maximum sequence length to 512.
Page 34
Created by Turbolearn AI
The LEGO® type building blocks of transformers make it fun to analyze. For example,
you will note that dropout regularization is present throughout the sub layers.
The parameters are stored in a list, and you can examine their shapes and values.
lp = len(LP)
print(lp)
The output displays all the parameters as shown in the following excerpt output:
The number of parameters is calculated by taking all parameters in the model adding
them up; for example:
Page 35
Created by Turbolearn AI
You will note that dmodel = 768. There are 12 heads for each head will thus be = 12
= 64. Lego concept of the building blocks of a transformer.
np=0
for p in range(0,lp):
#number of tensors
PL2=True
try:
L2=len(LP[p][0]) #check if 2D
except:
PL2=False
defined by:
L1=len(LP[p])
L3=L1*L2
The parameters are matrices and vectors of different sizes; for example:
768 x 768
768 x 1
768
The output shows the number of parameters calculated for tensors in the model:
Page 36
Created by Turbolearn AI
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./kant.txt",
block_size=128,
)
Page 37
Created by Turbolearn AI
data_collator = DataCollatorForLanguageModeling(
mlm=True,
mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir="./KantaiBERT",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
Page 38
Created by Turbolearn AI
To start the training process, call the train() method on the trainer object.
%%time
trainer.train()
The output displays the training process in real time showing rate, epoch, and steps:
trainer.save_model("./KantaiBERT")
Page 39
Created by Turbolearn AI
fill_mask = pipeline(
"fill-mask",
model="./KantaiBERT",
tokenizer="./KantaiBERT"
)
The output will likely change after each run because we are pretrainingfrom scratch
with a limited amount of data. However, is interesting because it introduces
conceptional
The goal here was to see how to train a transformer model. We can see that
interesting humanlike predictions can be
Page 40
Created by Turbolearn AI
1. Accuracy Score:
2. F1-Score:
Precision (P): The ratio of true positives to the total predicted positives.
TP
P recision =
T P +F P
Recall (R): The ratio of true positives to the total actual positives.
TP
Recall =
T P +F N
Considers true positives (TP), true negatives (TN), false positives (FP), and
false negatives (FN).
A model
A task
A metric
SuperGLUE Benchmark
SuperGLUE is a benchmark designed to evaluate the performance of NLU models on
more difficult tasks.
Page 41
Created by Turbolearn AI
SuperGLUE Tasks
The eight SuperGLUE tasks are presented in a ready-to-use list:
Page 42
Created by Turbolearn AI
Page 43
Created by Turbolearn AI
Requires the NLU model to choose the most plausible cause or effect
related to a given premise.
Premise: I knocked on my neighbor's door. What happened as a result?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his
Requires the model to read a premise and then examine a hypothesis built
on the premise.
The model must label the hypothesis as entailment, contradiction, or
neutral.
{"premise": "\"Did I ever tell you that's where Paul and I met?\"",
"hypothesis": "Susweca is where she and Paul met," "label": "entailment",
"idx": 77}
A question answering task where the model must select the correct
answer from multiple possible answers.
Involves filling in a blank in a query with the correct answer.
The sample contains four questions. To illustrate the task, we will just look into
of them. The model has to predict the correct labels. Notice how the information
the model is asked to obtain is distributed throughout the text:
Page 44
Created by Turbolearn AI
Page 45
Created by Turbolearn AI
Page 46
Created by Turbolearn AI
Query: The model must answer a question by finding the appropriate value for a
placeholder.
Answers:
"answers":[{"start":263,"end":271,"text":"Ashaninka"},{"start":601,"e
Examine a premise.
Examine a hypothesis.
Predict the label of entailment for the hypothesis.
Page 47
Created by Turbolearn AI
"word": "place"
The model has to read two sentences containing the target word:
The training data specifies the sample index, the label value, and the start and
end indices for the target word in both sentences:
Page 48
Created by Turbolearn AI
Text:
I poured water from the bottle into the cup until it was full.
"target":{"span2_index":
Example:
Classification = 1 for 'we yelled ourselves hoarse.'
Classification = 0 for 'we yelled ourselves.'
Page 49
Created by Turbolearn AI
model =BertForSequenceClassification.from_pretrained("bert-baseuncased",
Paraphrase Recognition
This task involves determining whether two sentences in a sequence are paraphrases
of each other.
Page 50
Created by Turbolearn AI
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetunedmrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bertbase-case
paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1
For example, translating an English sentence with a pronoun into French, where
pronouns have grammatical genders, tests the Transformer's understanding of
the pronoun's reference.
Machine Translation
Machine translation is the process of reproducing human translation by
machine transductions and outputs.
Page 51
Created by Turbolearn AI
The image illustrates the steps from the initial sentence to translate, through learning
parameters and machine transduction, to the final candidate translation.
Page 52
Created by Turbolearn AI
1. Download the data: Use the French-English dataset from the European
Parliament Proceedings Parallel Corpus 1996-2011.
2. Load the data: Use standard Python libraries to load the raw text files.
import pickle
from pickle import dump
def load_doc(filename):
file = open(filename, mode='rt', encoding='utf-8')
text = file.read()
file.close()
return text
def to_sentences(doc):
return
4. Clean the data: Normalize the text, tokenize, convert to lowercase, remove
punctuation, and filter out non-alphabetic tokens.
import re
import string
import unicodedata
def clean_lines(lines):
cleaned = list()
re_print = re.compile('[^%s]' % re.escape(string.printable))
for line in lines:
line = unicodedata.normalize('NFD', line).encode('ascii', 'ig
line = line.decode('UTF-8')
line = line.split()
line = [word.lower() for word in line]
line = [word.translate(table) for word in line]
line = [re_print.sub('', w) for w in line]
line = [word for word in line if word.isalpha()]
cleaned.append(' '.join(line))
return cleaned
5. Save the cleaned data: Use pickle to serialize the cleaned data into files.
Page 53
Created by Turbolearn AI
filename = 'English.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()
6. Create a vocabulary: Generate a frequency table for all words in the dataset.
7. Reduce vocabulary size: Remove infrequent words to avoid wasting the training
model's time.
Page 54
Created by Turbolearn AI
Geometric Evaluations
∏ =1
Chencherry Smoothing
Chen and Cherry (2014) introduced smoothing techniques to improve BLEU scores.
Smoothing is applied to softmax outputs in the Transformer.
Label Smoothing
Label smoothing introduces a value epsilon = ɛ to reduce overconfidence in
predictions.
model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512
)
Page 55
Created by Turbolearn AI
With OpenAI's GPT-2 and GPT-3 models, we discover another way of assembling
1. With OpenAI GPT-2 and GPT-3 Models, we discover another way of assembli
2. There (True/False)
3. BLEU is the French word for blue and is the acronym of an NLP metric (T
4. Smoothing techniques enhance BERT. (True/False)
5. German-English is the same as English-German for machine translation.
6. The original Transformer multi-head attention sub-layer has 2 heads. (F
7. The original Transformer encoder has 6 layers. (True/False)
8. The original Transformer encoder has 6 layers but only 2 decoder layers
9. You can train transformers without decoders. (True/False)
To understand how such evolution happened, we will look first at one aspect
Page 56
Created by Turbolearn AI
Before using GPT models, we need to stop and look at transformers from a p
### Reformer
Kitaev, Kaiser, and Anselmi (2020) designed the Reformer to address the att
The Reformer solves the attention issue with **Locality Sensitivity Hashing
As seen in the figure above, LSH bucketing and chunking considerably reduce
Schick and Schütze (2020) contend that a 223 million parameter transformer
> PET relies on the reformulation of training tasks to optimize the trainin
PET maps inputs to outputs via verbalizer pairs (PVPs). Each PVP pair conta
What will a project manager's decision be? We have seen the limits of the o
Page 57
Created by Turbolearn AI
* Refuse the limits of the original Transformer and tweak its architectu
* Use different training methods such as PET, an efficient knowledge dist
* Use a combination of these approaches.
* Design your own training methods and model architecture.
From the start, research teams, led by Radford et al., transitioned transfo
The goal was to generalize this concept to any type of downstream task once
* **Few-Shot (FS)**: The GPT is trained. When the model needs to make inf
* **One-Shot (1S)**: The trained GPT model is presented with only one dem
* **Zero-Shot (ZS)**: The trained GPT model is presented with no demonst
GPT models have the same structure as the decoder stacks of the original T
Radford et al. (2019) presented no less than four GPT models, and Brown et
This section will clone the OpenAI GPT-2 repository and download the 345M p
Page 58
Created by Turbolearn AI
Click on `src`, and you will see that the Python files we need from OpenAI to
import os
```python
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)
The expected output should confirm that TensorFlow 1.x is selected, specifically
version 1.15.2. If you encounter any TensorFlow errors, it's advisable to rerun the cell,
restart the VM, and rerun the cell again. This ensures the correct version is active, as
the default version on the VM is often tf.2.
Page 59
Created by Turbolearn AI
import os
os.chdir("/content/gpt-2")
This image shows the file directory where the GPT-2 model is located. Inside the 345M
folder, you'll find the following files:
checkpoint
encoder.json
hparams.json
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
vocab.bpe
These files are crucial for the GPT-2 model to function correctly. The encoder.json and
vocab.bpe files contain tokenized vocabulary pairs. The checkpoint file stores the
trained parameters and is accompanied by:
Page 60
Created by Turbolearn AI
!export PYTHONIOENCODING=UTF-8
import os
os.chdir("/content/gpt-2")
import json
import numpy as np
import tensorflow as tf
These steps are essential for defining and activating the model.
Page 61
Created by Turbolearn AI
Page 62
Created by Turbolearn AI
model_name: "345M"
seed: None
nsamples: 1
batch_size: 1
length: 300
temperature: 1
top_k: 0
This will prompt you to enter some context. For example, you can use a sentence by
Emmanuel Kant:
Observations
The entered context conditions the output generated by the model.
The model learns from the context without modifying its parameters.
Text completion is conditioned by transformer models without fine-tuning.
The grammatical structure of the output is usually convincing.
Page 63
Created by Turbolearn AI
2. Upload Python Files: Upload the following files to Google Colaboratory using
the file manager:
train.py
load_dataset.py
encode.py
accumulate.py
memory_saving_gradients.py
These files can be sourced from N Shepperd's GitHub repository or the book's
GitHub repository.
2. Install Requirements:
import os
os.chdir("/content/gpt-2")
!pip3 install -r requirements.txt
!pip install toposort
Page 64
Created by Turbolearn AI
%tensorflow_version 1.x
print(tf.__version__)
Restart the VM and rerun the cell to ensure you are running the VM with TensorFlow
1.x.
import os
os.chdir("/content/gpt-2")
!python download_model.py '117M'
import os
!cp /content/train.py /content/gpt-2/src/
!cp /content/load_dataset.py /content/gpt-2/src/
!cp /content/encode.py /content/gpt-2/src/
!cp /content/accumulate.py /content/gpt-2/src/
!cp /content/memory_saving_gradients.py /content/gpt-2/src/
Page 65
Created by Turbolearn AI
import numpy as np
np.savez(args.out_file, *chunks)
import os
model_name = "117M"
import os
os.chdir("/content/gpt-2/src/")
!python train.py --model_name=117M
The training will continue until you manually stop it, with checkpoints saved in
/content/gpt-2/src/checkpoint/run1 after every 1,000 steps.
import os
run_dir = '/content/gpt-2/models/tgmodel'
if not os.path.exists(run_dir):
os.makedirs(run_dir)
Page 66
Created by Turbolearn AI
import os
!mv /content/gpt-2/models/117M /content/gpt-2/models/117M_OpenAI
!mv /content/gpt-2/models/tgmodel /content/gpt-2/models/117M
import os
!python generate_unconditional_samples.py --model_name=117M
Page 67
Created by Turbolearn AI
import os
!python interactive_conditional_samples.py --model_name=117M
Enter a context, such as the Emmanuel Kant paragraph from before, and observe the
generated text.
Examples of T5 Prefixes
"translate English to German: + [sequence]" for translations
"cola sentence: + [sequence]" for The Corpus of Linguistic Acceptability (CoLA)
This unified input format leads to a transformer model that produces a result
sequence no matter which problem it has to solve in the Text-To-Text Transfer
Transformer (T5).
The T5 model unifies the input and output of many NLP tasks.
T5 Model Architecture
The T5 model utilizes the original Transformer model architecture.
Page 68
Created by Turbolearn AI
Models
Datasets: Used for training and testing.
Metrics
Page 69
Created by Turbolearn AI
Base: Baseline model, similar to BERTBASE with 12 layers and ~220 million
parameters.
Small: Smaller model with 6 layers and ~60 million parameters.
3B and 11B: Use 24 layer encoders and decoders with ~2.8 and 11 billion
parameters, respectively.
This image demonstrates how to use a model directly from the Transformers library.
Here is how to import the T5-large conditional generation model and the tokenizer:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
device = torch.device('cpu')
Page 70
Created by Turbolearn AI
if (display_architecture==True):
print(model.config)
This will output the model's basic parameters, such as the number of heads and
layers. The T5 transformer in this case has 16 heads and 24 layers.
{
"early_stopping": true,
"length_penalty": 2.0,
"num_beams": 4
}
The JSON snippet shows a configuration for a summarization task, including settings
for early stopping, length penalty, and beam search.
if(display_architecture==True):
print(model)
This allows you to examine the encoder and decoder stacks, attention sub-layers, and
feedforward sub-layers.
Page 71
Created by Turbolearn AI
The image visually represents a PyTorch neural network architecture, detailing the
layers and their configurations, including self-attention mechanisms and feed-forward
networks.
Summarizing Documents
To create a summarization function:
def summarize(text,ml):
preprocess_text = text.strip().replace("\\n","")
task_prefix = "summarize: " + preprocess_text
input_ids = tokenizer.encode(task_prefix, return_tensors="pt").to(devic
summary_ids = model.generate(input_ids,
num_beams=4,
no_repeat_ngram_size=2,
min_length=30,
max_length=ml,
early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return output
The summarize function preprocesses the input text, applies the T5 task prefix
"summarize: ", encodes the text to token IDs, generates a summary, and decodes the
output.
Page 72
Created by Turbolearn AI
Summarization Examples
Example Usage with the Declaration of Independence:
text = """The United States Declaration of Independence was the first Etext
summary = summarize(text, 50)
print("Number of characters:",len(text))
print ("\n\nSummarized text: \n",summary)
Observations
T5 may sometimes shorten the input text instead of providing a comprehensive
summary. This highlights the challenges NLP models face with certain texts. To
improve results, consider using longer texts, different parameters, larger models, or
modifying the T5 model's structure.
Step 1: Preprocessing
Page 73
Created by Turbolearn AI
Step 2: Post-processing
Word2Vec Tokenization
Polysemy is when a word can have several meanings.
Sometimes, pretrained tokenizers miscalculate word pairs because some word pairs
just don't fit together.
Page 74
Created by Turbolearn AI
Word2Vec Tokenization
Let's start by tokenizing text from text.txt and training a Word2Vec model.
#@title Pre-Requisistes
!pip install --upgrade gensim
import nltk
nltk.download('punkt')
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action = 'ignore')
The dataset (text.txt) contains the American Declaration of Rights, the Magna Carta,
and the works of Emmanuel Kant.
Page 75
Created by Turbolearn AI
Here, window=5 limits the distance between the current and predicted word, and sg=1
uses the skip-gram training algorithm. The output shows a vocabulary size,
embedding dimensionality of 512, and a learning rate of 0.025:
Word2Vec(vocab=10816, size=512, alpha=0.025).
Page 76
Created by Turbolearn AI
This function returns the cosine similarity value, which ranges between 0 and 1. If a
word is unknown, it prints a KeyError message and returns 0.
This case is acceptable, but results may vary based on the dataset's content, size, and
Gensim versions.
Page 77
Created by Turbolearn AI
In this case, the word "corporations" is not in the dictionary, leading to an unknown
token [unk] and a similarity score of 0.
word1="etext";word2="declaration"
print("Similarity",similarity(word1,word2),word1,word2)
The cosine similarity exceeds 0.8. While seemingly good, "etext" refers to Project
Gutenberg's preface and could produce erroneous natural language inferences.
Rare words can be medical, legal, or engineering terms, slang, or words from older
texts. Managing rare words is crucial for applications beyond trivial uses.
word1="judiciaire";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)
Page 78
Created by Turbolearn AI
word1="justiciar";word2="judge"
print("Similarity",similarity(word1,word2),word1,word2)
Page 79
Created by Turbolearn AI
#Bill of Rights,V
text ="""No person shall be held to answer for a capital, or otherwise infa
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
Since the grammatical structure of the Bill of Rights is outdated, it helps to modernize
it:
text =""" A person must be indicted by a Grand Jury for a capital or infamo
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)
Key Takeaways
Page 80
Created by Turbolearn AI
It doesn't matter if the dataset is enormous. If the tokenization process fails, even
partly, the transformer model we are running will miss critical tokens.
Questions
1. False: A tokenized dictionary does not contain every word that exists in a
language.
2. True
3. False: It is not always good to have obscene data in datasets.
4. True
5. False: A standard pretrained tokenizer does not contain the English vocabulary
of the past 700 years.
6. True
7. True
8. True
Page 81
Created by Turbolearn AI
This format is sufficient for training a BERT model to identify and label roles in a
sentence.
Page 82
Created by Turbolearn AI
This chapter is self-contained, allowing you to read through it or run the samples as
described.
Basic Samples
Sample 1
Consider the following sentence:
Did Bob really think he could prepare a meal for 50 people in only a few
hours?
The transformer identifies the verb "think". The raw output excerpt shows:
"verbs": [{"verb": "think", "description": "Did [ARG0: Bob] really] [V: thi
When run in the AllenNLP online tool, it provides a visual representation of the SRL
task.
The transformer then moved to the verb "prepare," labeled it, and analyzed the
context:
Page 83
Created by Turbolearn AI
The transformer:
think: Did [ARG0: Bob] [ARGM-ADV: really] [V: think] [ARG1: he could prepare a
could: Did Bob really think he [V: could] in only a few hours ?
prepare: Did Bob really think [ARG0: he] [ARGM-MOD: could] [V: prepare] a meal
Sample 2
Consider the sentence:
Mrs. And Mr. Tomaso went to Europe for vacation and visited Paris and
first went to visit the Eiffel Tower.
To test this, run the following code in the Sample 2 cell of the SRL.ipynb notebook:
!echo '{"sentence": "Mrs. And Mr. Tomaso went to Europe for vacation and vi
allennlp predict https://storage.googleapis.com/allennlp-public-models/ be
"verbs": [{"verb": "went", "description": "[ARG0: Mrs. and Tomaso] [V: went
Page 84
Created by Turbolearn AI
It correctly identified the purpose of the trip as the modifier of the verb "went" and
associated "went" with "Europe." It also identified the verb "visit" as being related to
"Paris".
The transformer correctly split the sequence and produced an excellent result:
It found that "first" was a temporal modifier of the verb "went." The AllenNLP
interface provides the following output:
went: [ARG0: Mrs. and Mr. Tomaso] [V: went] [ARG4: to Europe] [ARGMPRP: for va
visited: [ARG0: and Tomaso] went to Europe for vacation and[V: visited] [ARG1:
went: [ARG0: Mrs. and Mr. Tomaso] went to Europe for vacation andEiffel Tower]
Sample 3
Consider the sentence:
John wanted to drink tea, Mary likes to drink coffee but Karim drank some
cool water and Faiza would like to drink tomato juice.
Page 85
Created by Turbolearn AI
To test this, run the following code in the Sample 3 cell of the SRL.ipynb notebook:
!echo '{"sentence": "John wanted to drink tea, Mary likes to drink coffee b
allennlp predict https://storage.googleapis.com/allennlp-public-models/ be
When run on the AllenNLP online tool, the first representation is perfect, identifying
the verb "wanted" correctly:
The presence of "some cool water and" is not an argument of like, only "Faiza" is. The
AllenNLP output confirms the problem:
Page 86
Created by Turbolearn AI
wanted: [ARG0: John] [V: wanted] [ARG1: to drink tea] , Mary likes drink coffe
drink: [ARG0: John] wanted to [V: drink] [ARG1: tea] , Mary likes drink coffee
likes: John wanted to drink tea , [ARG0: Mary] [V: likes] [ARG1: to drink coff
drank: John wanted to drink tea , Mary likes to drink coffee but [ARG0: Karim]
would: John wanted to drink tea , Mary likes to drink coffee but Karim drank s
like: John wanted to drink tea , Mary likes to drink coffee but Karim drank [A
drink: John wanted to drink tea , Mary likes to drink coffee but Karim drank s
One of the arguments for the verb "like" is "Karim drank some cool water and Faiza,"
which is confusing.
Difficult Samples
Sample 4
Consider the complex sentence:
!echo '{"sentence": "Alice, whose husband went jogging every Sunday, liked
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert
[ARG0: Alice , whose husband went jogging every Sunday] , [V: liked]
Page 87
Created by Turbolearn AI
The verb "jogging" was identified and related to "whose husband" with the temporal
modifier "every Sunday." The transformer then detects:
[ARG0: Alice , whose husband went jogging every Sunday] , [V: liked] [ARG1
The temporal modifier "in the meantime" was also identified. Finally, the transformer
identifies the last verb, "dancing":
went: [ARG1: whose husband] [V: went] [ARG2: jogging] [ARGM-TMP: every Sunday]
liked: [ARG0: Alice , whose husband went jogging every Sunday] , [V: liked] [A
go: [ARG0: Alice , whose husband went jogging every Sunday] , liked to [V: go]
dancing: Alice , whose husband went jogging every Sunday , liked to go to a [V
Sample 5
Consider the sentence:
The bright sun, the blue sky, the warm sand, the palm trees, everything
round off.
Page 88
Created by Turbolearn AI
!echo '{"sentence": "The bright sun, the blue sky, the warm sand, the palm
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert
"words": ["The", "bright", "sun", ",", "the", "blue", "sky", ",", "the", "w
!echo '{"sentence": "The bright sun, the blue sky, the warm sand, the palm
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert
"verbs": [{"verb": "rounds", "description": "[ARG1: The bright sun, the blu
Sample 6
Consider the sentence:
Page 89
Created by Turbolearn AI
Summary
SRL tasks are difficult for both humans and transformer models.
Transformers can reach human baselines.
A transformer trained with a "sentence + predicate" input can solve simple and
complex problems.
The limits are reached with rare verb forms.
The Allen Institute AI has made many free AI resources available, emphasizing
that explaining AI is essential.
Transformers will continue to improve NLP standardization through distributed
architecture and input formats.
NER Method
nlp_ner = pipeline("ner")
Page 90
Created by Turbolearn AI
The traffic began to slow about five miles out of Los Angeles, making it
difficult to get onto Pioneer Boulevard. WBGO was playing some cool jazz,
and the weather was cool, making it rather pleasant to be making it out of
the city on this Friday afternoon. Nat King Cole was singing as Jo and Maria
slowly made their way out of Pasadena and drove toward Barstow. They
planned to get to Las Vegas early in the evening to have a nice dinner and
go see a show.
print(nlp_ner(sequence))
The output:
Page 91
Created by Turbolearn AI
Page 92
Created by Turbolearn AI
Templates like "Where is [I-LOC]?" or "Where is [I-LOC] located?" are then used to
generate questions automatically.
nlp_qa = pipeline('question-answering')
print("Question 1.", nlp_qa(context=sequence, question='Where is Pioneer Bo
print("Question 2.", nlp_qa(context=sequence, question='Where is Los Angele
The output shows the score, start, and end positions of the answer, as well as the
answer itself.
Page 93
Created by Turbolearn AI
1. Easy Project:
Creating a website for an elementary school.
Displaying the answers to automatically generated questions on a
webpage.
Allowing a teacher to finalize a multiple-choice questionnaire.
2. Intermediate Project:
Encapsulating the transformer's automatic questions and answers in a
program.
Using an API to check and correct the answers automatically.
Storing wrong answers for further analysis.
3. Difficult Project:
Implementing an intermediate project in a chatbot with follow-up
questions.
Example: If the transformer identifies that Pioneer Boulevard is in Los
Angeles, the chatbot could ask, "near where in LA?"
The transformer is honest enough to vary from one calculation to the next but it is
clear that the transformer faced issues with the person entity questions. It is possible
to see what went wrong and find an explanation. The sequence is run on AllenNLP in
the Semantic Role Labeling section to obtain a visual representation.
Here is an example of how the sentence "they drove slowly toward Barstow" can be
broken down using semantic role labeling. This particular diagram focuses on the
verb "drove".
Page 94
Created by Turbolearn AI
The diagram shows the verb "drove" with its arguments: "they" (ARG0), "slowly"
(manner), and "toward Barstow" (ARG1).
Page 95
Created by Turbolearn AI
AllenNLP is used to rerun the sequence in the Semantic Role Labeling demo. The
BERT-base model identifies predicates in the sequence. For example: verbs={"began,"
"slow," "making"(1), "playing," "making"(2), "making"(3), "singing," "made," "drove,"
"planned," go," see"}
Page 96
Created by Turbolearn AI
Coreference Resolution
Coreference resolution is introduced as a method to help models identify the main
subjects in a sequence. This can be added as a pretraining or postprocessing task.
Key Takeaways
Question-answering isn't as easy as it seems initially.
Designing a question generator is a productive solution.
NER and SRL are useful for finding and extracting content.
Implementing transformers requires well-prepared multi-task training and
heuristics implemented in classical code.
Review Questions
Page 97
Created by Turbolearn AI
Introduction
Sentiment analysis relies on the principle of compositionality. The lecture explores
how transformer models handle sentiment analysis, especially with complex
sentences.
Page 98
Created by Turbolearn AI
The output displays the architecture of the RoBERTa-large model, the output logits,
the tokens themselves, and the final output label.
Page 99
Created by Turbolearn AI
def classify(sequence,M):
nlp_cls = pipeline('sentiment-analysis')
if M==1:
print(nlp_cls.model.config)
return nlp_cls(sequence)
Page 100
Created by Turbolearn AI
The image above depicts the Hugging Face website, a platform for natural language
processing and machine learning models. Hugging Face allows you to search for and
test various pretrained models. The default sort mode is based on the number of
downloads. Let's explore some transformer models for text classification.
Page 101
Created by Turbolearn AI
Though the customer seemed unhappy, she was, in fact, satisfied but
thinking of something else at the time, which gave a false impression.
The image above shows the output of a complex sequence classification task. The
model may produce a false negative, which doesn't necessarily indicate a
malfunction. It might suggest the need for a different model or further training.
Though the customer seemed unhappy</s></s> she was, in fact satisfied thi
Page 102
Created by Turbolearn AI
You can find this model on the Hugging Face website or implement it using the
following code:
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-u
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-ba
Cognitive Dissonance
Cognitive Dissonance: The mental discomfort experienced when holding
conflicting beliefs, values, or attitudes.
This state arises when tensions build between contradictory thoughts, leading to
nervousness and agitation.
Page 103