Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views103 pages

Getting Started With The Model Architecture of The Transformer

Uploaded by

soulofhades0000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views103 pages

Getting Started With The Model Architecture of The Transformer

Uploaded by

soulofhades0000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Created by Turbolearn AI

Getting Started with the Model Architecture of


the Transformer

Background of the Transformer


The Transformer architecture has revolutionized Natural Language Understanding
(NLU), a subset of Natural Language Processing (NLP). This shift is pivotal in the
expanding digital economy, where AI-driven language understanding supports
various applications, including:

Language modeling
Chatbots
Personal assistants
Question answering
Text summarization
Speech-to-text
Sentiment analysis
Machine translation

The Transformer architecture marks a break from past approaches using Recurrent
Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).

The Rise of the Transformer: Attention Is All You Need


The Transformer's architecture is revolutionary because it changed the way we think
about NLP and AI. It's also disruptive, as it progressively replaces NLP as it was
known before its arrival.

The Encoder Stack


The book then moves on to discuss the encoder stack.

Input Embedding
The book then moves on to discuss input embeddings.

Page 1
Created by Turbolearn AI

Positional Encoding
The book then moves on to discuss positional encoding.

Sub-Layer 1: Multi-Head Attention


The book then moves on to discuss multi-head attention.

Sub-Layer 2: Feedforward Network


The book then moves on to discuss feedforward networks.

The Decoder Stack


The book then moves on to discuss the decoder stack.

Output Embedding and Position Encoding


The book then moves on to discuss output embedding and position encoding.

The Attention Layers


The book then moves on to discuss the attention layers.

The FFN Sub-Layer, the Post-LN, and the Linear Layer


The book then moves on to discuss the FFN sub-layer, Post-LN, and the linear layer.

Training and Performance


The book then moves on to discuss training and performance.

Before We End the Chapter

Page 2
Created by Turbolearn AI

The book then moves on to provide some concluding remarks before the end of the
chapter.

Summary
The book then provides a summary of the chapter.

Questions
The book then provides some questions for the reader to answer.

References
The book then provides a list of references.

Overview of the Transformer Model


Architecture
This study guide delves into the architecture of the Transformer model, a
groundbreaking innovation in Natural Language Processing (NLP). It covers the
historical context leading to the Transformer, its key components, and the mechanisms
that allow it to outperform previous models like Recurrent Neural Networks (RNNs)
and Convolutional Neural Networks (CNNs).

Background of the Transformer

NLP's Evolution
The Transformer's rise marks a significant shift in NLP, overcoming limitations of
earlier methods. Key figures and concepts that paved the way include:

Page 3
Created by Turbolearn AI

Andrey Markov: Introduced the concept of stochastic processes, predicting the


next element in a sequence based on the previous one. His work laid the
foundation for Markov Decision Processes (MDPs), Markov Chains, and Markov
Processes in AI.
Claude Shannon: Developed a probabilistic approach to sequence modeling,
establishing a communication model with a source encoder, transmitter, and
receiver decoder.
Alan Turing: Implemented early forms of artificial intelligence in the 1940s to
decode encrypted messages.
Georgetown-IBM Experiment (1954): Attempted to translate Russian
sentences into English using a rule system.
John Hopfield (1982): Introduced Recurrent Neural Networks (RNNs), inspired
by W.A. Little's work.
Yann Le Cun (1980s-1990s): Designed Convolutional Neural Networks
(CNNs) and LeNet-5, widely used for sequence transduction and modeling.

Limitations of RNNs and CNNs


Despite their prevalence, RNNs and CNNs have drawbacks:

High computational cost for long sequences.


Difficulties in capturing long-term dependencies.

Emergence of Attention
The concept of attention, which involves "peeking" at other tokens in a sequence, was
added to RNN and CNN models to improve their performance.

The Transformer's Impact


In December 2017, the Transformer model was introduced, revolutionizing the field
by achieving superior results on standard datasets.

The Rise of the Transformer: Attention Is All


You Need

Page 4
Created by Turbolearn AI

The Original Transformer Model


The original Transformer model, introduced by Vaswani et al., consists of a stack of 6
layers.

Encoder Stack: Processes input through attention and feedforward network


sub-layers.
Decoder Stack: Processes target outputs through attention and feedforward
network sub-layers.

The key innovation is the replacement of recurrence with attention mechanisms,


which relate each word to all other words in the sequence.

Attention Mechanism: A "word-to-word" operation that determines the


relationships between words in a sequence by running dot products
between word vectors.

Multi-Head Attention
The Transformer model uses multi-head attention, running eight attention
mechanisms in parallel. This provides:

In-depth sequence analysis.


Reduced calculation operations by precluding recurrence.
Parallelization, reducing training time.
Diverse perspectives of the same input sequence.

The Encoder Stack


Each layer of the encoder stack consists of:

A multi-headed attention mechanism


A fully connected position-wise feedforward network

Residual connections surround each sub-layer, ensuring key information is preserved.

Residual Connection: Transports the unprocessed input of a sub-layer to a


layer normalization function. The normalized output of each layer is
calculated as:

Page 5
Created by Turbolearn AI

LayerNormalization (x + Sublayer(x))

The layers are structurally identical but learn different associations of tokens in the
sequence.

Constant Dimensionality
The output of every sub-layer has a constant dimension, denoted as dmodel, which
equals 512 in the original Transformer architecture. This consistency optimizes
calculations and information flow.

Sub-Layers and Mechanisms

Input Embedding
The input embedding sub-layer converts input tokens into vectors of dimension
dmodel = 512.

1. Tokenization: A tokenizer transforms a sentence into tokens. For example, the


sentence "the Transformer is an innovative NLP model!" becomes ['the',
'transform', 'er', 'is', 'a', 'revolutionary', 'n', 'l', 'p', 'model',
'!'].
2. Integer Representation: Tokens are represented as integers. For example, the
sentence "The cat slept on the couch.It was too tired to get up." becomes [1996,
4937, 7771, 2006, 1996, 6411, 1012, 2009, 2001, 2205, 5458, 2000, 2131,
2039, 1012].
3. Embedding: The Transformer uses a learned embedding sub-layer, such as the
skip-gram architecture of word2vec.

Skip-Gram Model
Skip-gram: Focuses on a center word in a window of words and predicts
context words. For example, given the sentence "The black cat sat on the
couch and the brown dog slept on the rug," the word embeddings of
"black" and "brown" should be similar.

Example word embeddings:

Page 6
Created by Turbolearn AI

black=[[-0.01206071 0.11632373 ... -0.04639162]]


brown=[[ 1.35794589e-02 -2.18823571e-02 ... -4.90022525e-02]]

Cosine Similarity
Cosine similarity is used to verify the similarity between word embeddings.

cosine_similarity(black, brown)= [[0.9998901]]

Positional Encoding
Positional Encoding: Adds information about the position of a word in a
sequence to the word embedding vector.

Sine and Cosine Functions


The Transformer uses sine and cosine functions to generate positional encodings:

$PE(pos, 2i) = sin(pos / (10000^(2i / dmodel)))$


$PE(pos, 2i+1) = cos(pos / (10000^(2i / dmodel)))$

pos is the position of the word in the sequence.


i is the dimension of the word embedding vector.
dmodel is the dimension of the embedding vector (e.g., 512).

Example positional encoding vectors:

PE(2)= [[ 9.09297407e-01 -4.16146845e-01 ... 2.14921574e-08 1.00000000e+00]]


PE(10)= [[-5.44021130e-01 -8.39071512e-01 ... 1.07460785e-07 1.00000000e+00]]

Cosine Similarity of Positional Encoding


The cosine similarity between positional encodings of different positions is lower than
that of word embeddings:

Page 7
Created by Turbolearn AI

cosine_similarity(pos(2), pos(10)= [[0.8600013]]

Adding Positional Encoding to Word Embedding


The positional encoding vector is added to the word embedding vector:

pc(black)=y1+pe(2)

To ensure that the word embedding information is not lost, the word embedding
vector can be scaled:

y1*math.sqrt(dmodel)

The final positional encoding vector is obtained by adding the scaled word embedding
vector to the positional encoding vector:

for i in range(0, 512,2):


pe[0][i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
pc[0][i] = (y[0][i]*math.sqrt(d_model))+ pe[0][i]
pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))
pc[0][i+1] = (y[0][i+1]*math.sqrt(d_model))+ pe[0][i+1]

Example final positional encoding vector:

pc(black)= [[ 9.09297407e-01 -4.16146845e-01 ... 2.14921574e-08 1.00000000e+00

The cosine similarity of the final positional encoding vectors reflects the combined
effect of word embedding and positional encoding:

cosine_similarity(pc(black), pc(brown)= [[0.9627094]]

Summary of Cosine Similarity

Page 8
Created by Turbolearn AI

Category Cosine Similarity Value

Word Similarity [[0.99987495]]


Positional Encoding Similarity [[0.8600013]]
Final Positional Encoding Similarity [[0.9627094]]

Multi-Head Attention Sub-Layer


The output of positional encoding is the multi-head attention sub-layer. This sub-
layer contains eight heads and is followed by post-layer normalization, which adds
residual connections to the output and normalizes it.

Architecture of Multi-Head Attention


The input of the multi-attention sub-layer of the first layer of the encoder stack is a
vector that contains the embedding and the positional encoding of each word. The
dimension of each word x in an input sequence is d
n = 512. model

pe(x n ) = [d 1 = 9.09297407e − 01, d 2 = 9.09297407e − 01, . . . , d 512 = 1.00000000e + 00]

Each word is mapped to all other words to determine its fit in a sequence. For
example, in the sequence "The cat sat on the rug and it was dry-cleaned," the model
trains to determine if "it" relates to "cat" or "rug."

Instead of using the d = 512 dimensions directly, a better approach is to divide


model

these dimensions into 8 heads, each with d = 64 dimensions. This allows running the
k

8 "heads" in parallel, speeding up training and obtaining 8 different representation


subspaces of how each word relates to another.

The output of each head is a matrix z with a shape of x ∗ d . The output of a


i k

multiattention head is Z defined as:

Z = (z 0 , z 1 , z 2 , z 3 , z 4 , z 5 , z 6 , z 7 )

Z must be concatenated so that the output is a single x ∗ d model matrix.

MultiHead(output) = Concat(z_0, z_1, z_2, z_3, z_4, z_5, z_6, z_7) = x,


d_{model}

Each head h has three representations:


n

Page 9
Created by Turbolearn AI

Query vector (Q): Dimension of d = 64. Activated and trained when a word
q

vector x seeks all key-value pairs of other word vectors, including itself (self-
n

attention).
Key vector (K): Dimension of d = 64. Trained to provide an attention value.
k

Value vector (V): Dimension of d = 64. Trained to provide another attention


v

value.

Attention is defined as "Scaled Dot-Product Attention":


T
QK
Attention(Q, K, V ) = sof tmax( )V
√d
k

To obtain Q, K, and V, the model is trained with respective weight matrices Q , K , w w

and V , which have d = 64 columns and d


w k = 512 rows. For example, Q is
model

obtained by a dot-product between x and Q , resulting in a dimension of d = 64.


w k

Multi-Head Attention Implementation in Python


To visualize the model in code, basic Python code using numpy and a softmax function
is used in 10 steps to run the key aspects of the attention mechanism.

Step 1: Represent the Input


The input is scaled down to d model = 4 instead of d model = 512 for easier visualization.

import numpy as np
from scipy.special import softmax

x = np.array([[1.0, 0.0, 1.0, 0.0], # Input 1


[0.0, 2.0, 0.0, 2.0], # Input 2
[1.0, 1.0, 1.0, 1.0]]) # Input 3
print(x)

Output:

[[1. 0. 1. 0.]
[0. 2. 0. 2.]
[1. 1. 1. 1.]]

Page 10
Created by Turbolearn AI

Step 2: Initialize the Weight Matrices


Each input has three weight matrices: Q to train queries, K to train keys, and V to
w w w

train values. The matrices are scaled down to d = 3.k

w_query = np.array([[1, 0, 1],


[1, 0, 0],
[0, 0, 1],
[0, 1, 1]])

w_key = np.array([[0, 0, 1],


[1, 1, 0],
[0, 1, 0],
[1, 1, 0]])

w_value = np.array([[0, 2, 0],


[0, 3, 0],
[1, 0, 3],
[1, 1, 0]])

print("w_query:\n", w_query)
print("w_key:\n", w_key)
print("w_value:\n", w_value)

Step 3: Matrix Multiplication to Obtain Q, K, V


Multiply the input vectors by the weight matrices to obtain the query, key, and value
vectors.

Q = np.matmul(x, w_query)
K = np.matmul(x, w_key)
V = np.matmul(x, w_value)

print("Query:\n", Q)
print("Key:\n", K)
print("Value:\n", V)

Step 4: Scaled Attention Scores


T

Implement the scaled dot-product attention equation sof tmax( . For this model,
QK
)
√d k

√d k = √ 3 ≈ 1.

Page 11
Created by Turbolearn AI

k_d = 1
attention_scores = (Q @ K.transpose()) / k_d
print(attention_scores)

Step 5: Scaled Softmax Attention Scores for Each Vector


Apply a softmax function to each intermediate attention score.

attention_scores[0] = softmax(attention_scores[0])
attention_scores[1] = softmax(attention_scores[1])
attention_scores[2] = softmax(attention_scores[2])

print(attention_scores[0])
print(attention_scores[1])
print(attention_scores[2])

Step 6: The Final Attention Representations


Finalize the attention equation by plugging in V:
T

.
QK
Attention(Q, K, V ) = sof tmax( )V
√d k

print("V[0]:\n", V[0])
print("V[1]:\n", V[1])
print("V[2]:\n", V[2])

attention1 = attention_scores[0][0] * V[0]


attention2 = attention_scores[0][1] * V[1]
attention3 = attention_scores[0][2] * V[2]

print("Attention 1:\n", attention1)


print("Attention 2:\n", attention2)
print("Attention 3:\n", attention3)

Step 7: Summing Up the Results


Sum the 3 attention values of input #1 to obtain the first line of the output matrix.

Page 12
Created by Turbolearn AI

attention_input1 = attention1 + attention2 + attention3


print(attention_input1)

Step 8: Steps 1 to 7 for All the Inputs


Assume the transformer produces attention values of input #2 and input #3 using the
same method from Step 1 to Step 7. It is assumed that we have 3 attention values
with learned weights with d = 64
model

attention_head1 = np.random.random((3, 64))


print(attention_head1)

Step 9: The Output of the Heads of the Attention Sub-Layer


Assume we have trained the 8 heads of the attention sub-layer. The transformer now
has 3 output vectors (of the 3 input vectors that are words or word pieces) of
dmodel= 64 dimensions each.

z0h1 = np.random.random((3, 64))


z1h2 = np.random.random((3, 64))
z2h3 = np.random.random((3, 64))
z3h4 = np.random.random((3, 64))
z4h5 = np.random.random((3, 64))
z5h6 = np.random.random((3, 64))
z6h7 = np.random.random((3, 64))
z7h8 = np.random.random((3, 64))

print("shape of one head", z0h1.shape, "dimension of 8 heads", 64*8)

Step 10: Concatenation of the Output of the Heads


The Transformer concatenates the 8 elements of Z:

MultiHead(output) = Concat(z0, z1, z2, z3, z4, z5, z6, z7)W0 = x, dmodel

Page 13
Created by Turbolearn AI

output_attention = np.hstack((z0h1, z1h2, z2h3, z3h4, z4h5, z5h6, z6h7, z7h


print(output_attention)

Post-Layer Normalization
Each attention sub-layer and each feedforward sub-layer of the Transformer is
followed by post-layer normalization (Post-LN). The Post-LN contains an add function
and a layer normalization process.

Post-LN (Layer Normalization) can be described as: LayerNorm(x +


Sublayer(x)) Where:

Sublayer(x) is the sub-layer itself.


x is the information available at the input step of Sublayer(x).

The input of LayerNorm is a vector v resulting from x + Sublayer(x). d = 512 for model

every input and output of the Transformer, which standardizes all the processes. The
basic concept for v = x + Sublayer(x) can be defined by LayerN orm(v):
v−μ
LayerN orm(v) = γ + β
σ

Variable Definition

Mean of v of dimension d: μ = 1 d
μ ∑ vi
d i=1

Standard deviation of v of dimension d: σ 2 1 d 2


σ = ∑ (v i − μ)
d i=1

γ Scaling parameter
β Bias vector

Feedforward Network Sub-Layer


The input of the FFN is the d model
= 512 output of the Post-LN of the previous
sublayer.

Page 14
Created by Turbolearn AI

The FFNs in the encoder and decoder are fully connected.


The FFN is a position-wise network. Each position is processed separately and in
an identical way.
The FFN contains two layers and applies a ReLU activation function.
The input and output of the FFN layers is d = 512, but the inner layer is
model

larger with d = 2048


ff

The FFN can be viewed as performing two kernel size 1 convolutions.

The optimized and standardized FFN can be described as follows:

FFN(x) = max(0, xW1 + b1 )W2 + b2

Decoder Stack
The layers of the decoder of the Transformer model are stacks of layers like the
encoder layers. Each layer of the decoder stack has the following structure:

Each layer contains three sub-layers: a multiheaded masked attention mechanism, a


multi-headed attention mechanism, and a fully connected position-wise feedforward
network.

Output Embedding and Position Encoding


The structure of the sub-layers of the decoder is mostly the same as the sub-layers of
the encoder. The output embedding layer and position encoding function are the same
as in the encoder stack.

Attention Layers
The Transformer is an auto-regressive model. It uses the previous output sequences
as an additional input. The multi-head attention layers of the decoder use the same
process as the encoder.

Training and Performance


The original Transformer was trained on a 4.5-million-sentence-pair English-German
dataset and a 36-million-sentence English-French dataset.

Page 15
Created by Turbolearn AI

Metric Value

Training Time (Base Model) 12 hours (100,000 steps)


Training Time (Big Model) 3.5 days (300,000 steps)
BLEU Score (English-to-French) 41.8

BLEU (Bilingual Evaluation Understudy) is an algorithm that evaluates


the quality of the results of machine translations.

Hugging Face and Transformer Architectures


Hugging Face simplifies machine translation, allowing implementation in just three
lines of code.

!pip -qq install transformers


from transformers import pipeline
translator = pipeline("translation_en_to_fr")
print(translator("It is easy to translate languages with transformers", max

Key Takeaways from Chapter 1


Transduction Expansion: Transformers enhance the ability to convert written
and oral sequences into meaningful representations.
Implementation Simplification: AI implementation is made easier, exemplified
by Hugging Face and Google Brain.
RNN/LSTM/CNN Removal: Transformers boldly remove recurrent networks for
transduction and sequence modeling.
Symmetrical Design: Standardized dimensions in the encoder and decoder
ensure seamless flow.
Parallelized Layers: Transformers introduce parallelized layers that reduce
training time.
Innovations: Positional encoding and masked multi-headed attention are key
innovations.
Flexibility: The Transformer architecture supports many variations, enabling
more powerful transduction and language modeling.

Chapter 1 Review Questions and Answers

Page 16
Created by Turbolearn AI

1. NLP transduction can encode and decode text representations. (True)


2. Natural Language Understanding (NLU) is a subset of Natural Language
Processing (NLP). (True)
3. Language modeling algorithms generate probable sequences of words based on
input sequences. (True)
4. A transformer is a customized LSTM with a CNN layer. (False)
5. A transformer does not contain an LSTM or CNN layers. (True)
6. Attention examines all of the tokens in a sequence, not just the last one. (True)
7. A transformer uses a positional vector, not positional encoding. (False)
8. A transformer contains a feedforward network. (True)
9. The masked multi-headed attention component of the decoder of a transformer
prevents the algorithm parsing a given position from seeing the rest of a
sequence that is being processed. (True)
10. Transformers can analyze long-distance dependencies better than LSTMs. (True)

Fine-Tuning BERT Models


Chapter 2 explores Bidirectional Encoder Representations from Transformers
(BERT).

BERT Architecture
BERT introduces bidirectional attention to transformer models, utilizing only the
encoder blocks.

BERT adds a new piece to the Transformer building kit: a bidirectional


multihead attention sub-layer. When we humans are having problems
understanding a sentence, we do not just look at the past words. BERT,
like us, looks at all the words in the same sentence at the same time.

Encoder Stack

Page 17
Created by Turbolearn AI

BERT does not use decoder layers.


The masked tokens are in the attention layers of the encoder.
Original Transformer: N=6 layers, d = 512, A=8 heads
model

Dimensions of a head: z = d
A
/ A = 512 / 8 = 64
model

BERT Models:
BERTBASE: N=12 encoder layers, d = H = 768, A=12 heads
model

Dimensions of each head: z = 768 / 12 = 64


A

Output: output_multi-head_attention={z , z , z ,…, z }


0 1 2 11

BERTLARGE: N=24 encoder layers, d = 1024, A=16 heads


model

Dimensions of each head: z = 1024 / 16 = 64


A

Output: output_multi-head_attention={z , z , z ,…, z }


0 1 2 15

Model Size and Dimensions

Layers d model / Attention Heads Head Dimensions (z


Model
A

(N) H (A) )

Original
6 512 8 64
Transformer
BERTBASE 12 768 12 64
BERTLARGE 24 1024 16 64

Preparing the Pretraining Input Environment


BERT does not use a masked multi-head attention sub-layer. It uses two training
tasks:

1. Masked Language Modeling (MLM)


2. Next Sentence Prediction (NSP)

Masked Language Modeling (MLM)


BERT introduces bidirectional analysis by randomly masking a word in the sentence.

"The cat sat on it [MASK] it was a nice rug."

Input Token Masking Methods:

Page 18
Created by Turbolearn AI

10%: No masking: "The cat sat on it [because] it was a nice rug."


10%: Replace with a random token: "The cat sat on it [often] it was a nice rug."
80%: Replace with [MASK] token: "The cat sat on it [MASK] it was a nice rug."

Next Sentence Prediction (NSP)


The input contains two sentences with added tokens:

[CLS]: Binary classification token at the beginning to predict if the second


sequence follows the first.
[SEP]: Separation token signaling the end of a sequence.

Example:

[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]

Input Embeddings
Input embeddings are obtained by summing:

Token embeddings
Segment (sentence, phrase, word) embeddings
Positional encoding embeddings

Input Embedding and Positional Encoding Summary

Words are broken into WordPiece tokens.


[MASK] tokens are randomly inserted for MLM.
[CLS] token is inserted at the beginning for classification.
[SEP] token separates sentences for NSP.
Sentence embedding is added for distinguishing sentences.
Positional encoding is learned (not sine-cosine).

BERT's Two-Step Framework

Page 19
Created by Turbolearn AI

1. Pretraining:
Defining the model's architecture.
Training the model on MLM and NSP tasks.
2. Fine-Tuning:
Initializing the downstream model with pretrained parameters.
Fine-tuning parameters for specific tasks.

Fine-Tuning a BERT Model


This section details the steps to fine-tune a BERT model for Acceptability
Judgements using the Matthews Correlation Coefficient (MCC) for evaluation.

Steps for Fine-Tuning


1. Activating the GPU:

# Main menu->Runtime->Change Runtime Type


import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

2. Installing Hugging Face PyTorch Interface:

!pip install -q transformers

3. Importing Modules:

Page 20
Created by Turbolearn AI

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, Sequ
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import AdamW, BertForSequenceClassification, get_linear_s
from tqdm import tqdm, trange
import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt

4. Specifying CUDA Device:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

5. Loading the Dataset:

df = pd.read_csv("in_domain_train.tsv", delimiter='\\t', header=None, names


df.shape

6. Creating Sentences and Labels:

sentences = df.sentence.values
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values

7. Activating the BERT Tokenizer:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_cas


tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

8. Processing the Data:

Page 21
Created by Turbolearn AI

MAX_LEN = 128
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncati

9. Creating Attention Masks:

attention_masks = []
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)

10. Splitting Data:

train_inputs, validation_inputs, train_labels, validation_labels = train_te


train_masks, validation_masks, _, _ = train_test_split(attention_masks, inp

11. Converting to Torch Tensors:

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

12. Selecting Batch Size and Creating Iterator:

batch_size = 32

train_data = TensorDataset(train_inputs, train_masks, train_labels)


train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batc

validation_data = TensorDataset(validation_inputs, validation_masks,


validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validatio
```## Data Preparation

Page 22
Created by Turbolearn AI

Batch Size
The batch size is set to 32.

batch_size = 32

This variable determines how many data samples are processed in each iteration of
training.

Data Loaders
DataLoaders are used to efficiently manage the data during training:

train_data = TensorDataset(train_inputs, train_masks, train_labels)


train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size
validation_data = TensorDataset(validation_inputs, validation_masks, valida
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_samp

TensorDataset: Wraps the input data and labels.


RandomSampler: Shuffles the training data.
SequentialSampler: Samples elements sequentially, useful for validation data.
DataLoader: Provides batches of data for training, optimizing memory usage by
loading data in smaller chunks.

BERT Model Configuration

Initializing BERT Configuration


The program initializes a BERT uncased configuration:

Page 23
Created by Turbolearn AI

configuration = BertConfig()
model = BertModel(configuration)
configuration = model.config
print(configuration)

Key Parameters
The BERT configuration includes several important parameters:

attention_probs_dropout_prob: Dropout ratio applied to attention probabilities


(0.1).
hidden_act: Non-linear activation function in the encoder, using "gelu" (Gaussian
Error Linear Units).
hidden_dropout_prob: Dropout probability applied to fully connected layers in
embeddings, encoder, and pooler layers (0.1).
hidden_size: Dimension of the encoded layers and the pooler layer (768).
initializer_range: Standard deviation value for initializing weight matrices (0.02).
intermediate_size: Dimension of the feed-forward layer in the encoder (3072).
layer_norm_eps: Epsilon value for layer normalization layers (1e-12).
max_position_embeddings: Maximum sequence length the model uses (512).
model_type: Name of the model ("bert").
num_attention_heads: Number of attention heads (12).
num_hidden_layers: Number of hidden layers (12).
pad_token_id: ID of the padding token to avoid training on padding tokens (0).
type_vocab_size: Size of the token_type_ids, which identify the sequences (2).
For example, "the dog[SEP] The cat.[SEP]" can be represented with 6
token IDs: [0,0,0, 1,1,1].
vocab_size: Number of different tokens used by the model to represent the
input_ids (30522).

Loading the Pretrained BERT Model

Loading the Model


The program loads the pretrained BERT model:

Page 24
Created by Turbolearn AI

model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
model.cuda()

This loads a pre-trained BERT model specifically configured for sequence


classification tasks with two labels.

Exploring the Architecture


The architecture of the BERT model includes embeddings, encoders, and more, which
can be visualized in detail:

BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)

Page 25
Created by Turbolearn AI

Optimizer Grouped Parameters

Initializing the Optimizer


Fine-tuning a model starts by initializing the pretrained model parameter values:

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.weight']

optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_d
]

Weight Decay
The optimizer includes a weight decay rate to prevent overfitting. Parameters are
filtered to apply different weight decay rates:

Weight parameters receive a weight_decay_rate of 0.1.


Bias parameters receive a weight_decay_rate of 0.0.

Hyperparameters for the Training Loop

Setting Hyperparameters
The hyperparameters are crucial for training.

optimizer = BertAdam(optimizer_grouped_parameters, lr=2e-5, warmup=.1)

Learning rate (lr) and warm-up rate (warmup) should be small initially and
gradually increased to avoid large gradients and overshooting.

Accuracy Measurement

Page 26
Created by Turbolearn AI

Defining Accuracy Function


An accuracy measurement function is defined to compare predictions to labels:

def flat_accuracy(preds, labels):


pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)

Training Loop

Training Process
The training loop involves standard learning processes.

epochs = 4
train_loss_set = []
for _ in trange(epochs, desc="Epoch"):
# Training steps here
tmp_eval_accuracy = flat_accuracy(logits, label_ids)
eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1
print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

The number of epochs is set to 4, with measurements for loss and accuracy at each
epoch. The train_loss_set stores loss and accuracy values for plotting.

Training Evaluation Graph


The image below depicts a graph that shows that the training process went well and
was efficient:

Training loss per batch

Prediction and Evaluation with Holdout


Dataset

Page 27
Created by Turbolearn AI

Data Preparation
The program makes predictions using the holdout dataset. The data preparation
process from the training data is repeated:

df = pd.read_csv("out_of_domain_dev.tsv", delimiter='\\t', header=None, nam


sentences = df.sentence.values
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

Batch Predictions
The program runs batch predictions using the dataloader:

for batch in prediction_dataloader:


batch = tuple(t.to(device) for t in batch)
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_i

logits = logits['logits'].detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
predictions.append(logits)
true_labels.append(label_ids)

Evaluating with Matthews Correlation


Coefficient

What is MCC?
The Matthews Correlation Coefficient (MCC) measures the quality of
binary classifications and can be modified into a multi-class correlation
coefficient.

The formula for MCC is:

Page 28
Created by Turbolearn AI

T P ×T N −F P ×F N
ϕ =
√ (T P +F P )(T P +F N )(T N +F P )(T N +F N )

Value between -1 and +1


+1: Maximum positive value of a prediction
-1: Inverse prediction
0: Average random prediction

MCC Implementation
MCC is imported from sklearn.metrics:

from sklearn.metrics import matthews_corrcoef

MCC values are calculated and stored:

matthews_set = []
for i in range(len(true_labels)):
matthews = matthews_corrcoef(true_labels[i], np.argmax(predictions[i],
matthews_set.append(matthews)

Individual Batch Scores

Displaying Batch Scores


MCC values between -1 and +1 are produced:

matthews_set

Matthews Evaluation on the Whole Dataset

Aggregating Results

Page 29
Created by Turbolearn AI

True values are aggregated for the whole dataset:

flat_predictions = [item for sublist in predictions for item in sublist]


flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
matthews_corrcoef(flat_true_labels, flat_predictions)

The output confirms a positive MCC, indicating a correlation: 0.45439842471680725.

Summary

BERT Architecture and Process


BERT uses bidirectional attention, attending to all tokens in a sequence
simultaneously. It operates in two steps:

1. Pretraining: Initial model training.


2. Fine-tuning: Adapting the model for specific tasks.

Fine-tuning a pretrained model requires fewer resources than training from scratch.

Questions
1. BERT stands for Bidirectional Encoder Representations from Transformers.
(True)
2. BERT is a two-step framework. Step 1 is pretraining. Step 2 is fine-tuning.
(True)
3. Fine-tuning a BERT model implies training parameters from scratch. (False)
4. BERT only pretrains using all downstream tasks. (False)
5. BERT pretrains with Masked Language Modeling (MLM). (True)
6. BERT pretrains with Next Sentence Predictions (NSP). (True)
7. BERT pretrains mathematical functions. (False)
8. A question-answer task is a downstream task. (True)
9. A BERT pretraining model does not require tokenization. (False)
10. Fine-tuning a BERT model takes less time than pretraining. (True)

Pretraining a RoBERTa Model from Scratch

Page 30
Created by Turbolearn AI

RoBERTa and DistilBERT


The next chapter will focus on building a pretrained transformer model from scratch.
KantaiBERT, a RoBERTa-like model, will be created. The model will be trained using
masked language modeling with 6 layers, 12 heads, and 84,095,008 parameters. It
will use byte-level byte-pair encoding, similar to GPT-2.

Building KantaiBERT from Scratch

Steps
KantaiBERT will be built in 15 steps. The first step is to load the dataset.

Loading the Dataset


The works of Immanuel Kant are used for training.

Immanuel Kant (1724-1804) was a German philosopher.

Three books by Immanuel Kant are compiled into a text file named kant.txt:

The Critique of Pure Reason


The Critique of Practical Reason
Fundamental Principles of the Metaphysic of Morals

The dataset is downloaded automatically using curl:

!curl -L https://raw.githubusercontent.com/PacktPublishing/... --output "ka

Here is a screenshot of a file management interface:

Page 31
Created by Turbolearn AI

Installing Hugging Face Transformers


Hugging Face Transformers is installed using pip:

!pip list | grep -E 'transformers|tokenizers'

Training a Tokenizer ‍
A tokenizer is trained using Hugging Face's ByteLevelBPETokenizer():

This bytelevel tokenizer breaks a string or word down into a sub-string or


sub-word. The tokenizer can break words into minimal parts. The chunks
of strings classified as unknown unk_token, using WorkPiece level
encoding, will practically disappear.

files=paths is the path to the dataset.


vocab_size=52_000 is the size of the tokenizer's model length.
min_frequency=2 is the minimum frequency threshold. special_tokens=[].

The tokenizer is trained to generate merged frequency.

Saving Files to Disk


The tokenizer generates two files:

vocab.json, which contains the indices of The program first creates the
KantaiBERT directory:

import os
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

Loading the Trained Tokenizer Files

Page 32
Created by Turbolearn AI

The trained tokenized dataset can be loaded as follows:

from tokenizers.implementations import ByteLevelBPETokenizer


from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
"./KantaiBERT/vocab.json",
"./KantaiBERT/merges.txt",
)

The tokenizer can encode a sequence:

tokenizer.encode("The Critique of Pure Reason.").tokens

Checking Resource Constraints


The program checks for GPU and CUDA availability:

!nvidia-smi

Defining the Configuration of the Model


The configuration of the model is defined using RobertaConfig:

Page 33
Created by Turbolearn AI

from transformers import RobertaConfig

config = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=512,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
```## Re-creating the Tokenizer and Initializing the Model

### Step 8: Re-creating the Tokenizer


To re-create the tokenizer in Transformers, you can load a pre-trained toke
```python
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512

This step involves importing the RobertaTokenizer from the transformers library and
initializing it with a pre-trained model, in this case, "./KantaiBERT", while setting the
maximum sequence length to 512.

Step 9: Initializing a Model From Scratch


To initialize a RoBERTa model from scratch, you can use the following code:

from transformers import RobertaForMaskedLM


model = RobertaForMaskedLM(config=config)

Here, RobertaForMaskedLM is imported from transformers, and the model is initialized


with a configuration (config) defined earlier.

The initialized model includes several key components:

Page 34
Created by Turbolearn AI

RobertaEmbeddings: Converts tokens into embeddings.


word_embeddings: Embedding layer for words.
position_embeddings: Embedding layer for positional information.
token_type_embeddings: Embedding layer for token types.
LayerNorm: Normalizes the embeddings.
Dropout: Applies dropout regularization.
BertEncoder: Encodes the embeddings using multiple layers.
BertLayer: Each layer consists of attention and intermediate sub-layers.
BertAttention: Attention mechanism.
BertSelfAttention: Computes self-attention.
query, key, value: Linear layers for attention computation.
dropout: Dropout regularization.
BertSelfOutput: Combines and normalizes the attention output.
BertIntermediate: Intermediate layer with a dense connection.
BertOutput: Output layer.

The LEGO® type building blocks of transformers make it fun to analyze. For example,
you will note that dropout regularization is present throughout the sub layers.

Exploring the Parameters


The number of parameters in a transformer model can be substantial. For instance, a
transformer might have approximately 84,095,008 parameters.

The parameters are stored in a list, and you can examine their shapes and values.

lp = len(LP)
print(lp)

The output displays all the parameters as shown in the following excerpt output:

Parameter containing:[ 0.0020, -0.0354, -0.0221, ..., 0.0220, -0.0060, -0.0032

The number of parameters is calculated by taking all parameters in the model adding
them up; for example:

Page 35
Created by Turbolearn AI

The vocabulary (52,000) x dimensions (768)


The size of many vectors is 1 x 768
The many other dimensions found

You will note that dmodel = 768. There are 12 heads for each head will thus be = 12
= 64. Lego concept of the building blocks of a transformer.

Counting the Parameters


To count the parameters of each tensor, you can iterate through the list of parameters
and calculate the size of each tensor.

np=0
for p in range(0,lp):
#number of tensors
PL2=True
try:
L2=len(LP[p][0]) #check if 2D
except:
PL2=False

defined by:
L1=len(LP[p])
L3=L1*L2

We can now add the parameters up at each step of the loop:


np+=L3

We will obtainnumber of parameters of a transformer model is calculated


if PL2==True:
print(p,L1,L2,L3) # displaying the sizes
if PL2==False:
print(p,L1,L3) # displaying the sizes of the parameters
print(np) # total number of parameters

The parameters are matrices and vectors of different sizes; for example:

768 x 768
768 x 1
768

The output shows the number of parameters calculated for tensors in the model:

Page 36
Created by Turbolearn AI

0 52000 768 39936000


1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768

Building the Dataset and Initializing the Trainer

Step 10: Building the Dataset


To build the dataset, you can use LineByLineTextDataset from the transformers library.
This dataset reads text files line by line, tokenizes them, and prepares them for
training.

%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./kant.txt",
block_size=128,
)

tokenizer: The tokenizer object used to tokenize the text.


file_path: The path to the text file.
block_size: The maximum sequence length.

Step 11: Defining a Data Collator


A data collator is used to prepare batches of data for training. For masked language
modeling (MLM), DataCollatorForLanguageModeling is used to mask some of the
tokens in the input sequences.

Page 37
Created by Turbolearn AI

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
mlm=True,
mlm_probability=0.15
)

mlm: Whether to perform masked language modeling.


mlm_probability: The probability of masking each token.

Step 12: Initializing the Trainer


The trainer is responsible for managing the training loop. It takes the model, dataset,
data collator, and training arguments as input.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir="./KantaiBERT",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)

output_dir: The directory to save the trained model.


overwrite_output_dir: Whether to overwrite the output directory.
num_train_epochs: The number of training epochs.
per_device_train_batch_size: The batch size for training.
save_steps: How often to save the model.
save_total_limit: The maximum number of saved models to keep.

Training the Model

Page 38
Created by Turbolearn AI

To start the training process, call the train() method on the trainer object.

%%time
trainer.train()

The output displays the training process in real time showing rate, epoch, and steps:

Epoch: 100% 1/1


Iteration: 100% 2672/2672 [17:59<00:00, 2.47it/s]
5. 6455852394104005, "learning_rate": 4.06437125748503e-05, "epoch": 0.187125
6. 940259679794312, "learning_rate": 3.12874251497006e-05, "epoch": 0.3742514
7. 639936000347137, "learning_rate": 2.1931137724550898e-05, "epoch": 0.56137
8. 361462069988251, "learning_rate": 1.2574850299401197e-05, "epoch": 0.74850
9. 228510192394257, "learning_rate": 3.218562874251497e-06, "epoch": 0.935628
CPU times: user 11min 36s, sys: 6min 25s, total: 18min 2s
Wall time: 17min 59s
TrainOutput(global_step=2672, training_loss=4.7226536670130885)

Saving and Utilizing the Model

Step 14: Saving the Final Model


After training, the model (along with the tokenizer and configuration) is saved to disk.

trainer.save_model("./KantaiBERT")

Step 15: Language Modeling with the FillMaskPipeline


To use the trained model for language modeling, you can use the pipeline function
from the transformers library. This creates a pipeline that takes text as input and
returns the predicted masked tokens.

Page 39
Created by Turbolearn AI

from transformers import pipeline

fill_mask = pipeline(
"fill-mask",
model="./KantaiBERT",
tokenizer="./KantaiBERT"
)

fill_mask("Human thinking involves human <mask>.")

The output will likely change after each run because we are pretrainingfrom scratch
with a limited amount of data. However, is interesting because it introduces
conceptional

[{'score':'sequence': '<s> Human thinking involves human reason.</s>', 'token'

The goal here was to see how to train a transformer model. We can see that
interesting humanlike predictions can be

Transformers and Downstream NLP Tasks

The Quest to Outperform Human Baseline


The ultimate goal in Natural Language Understanding (NLU) is to surpass human
performance. This involves measuring how well machines can understand and
process language compared to human capabilities.

Measuring Transformer Performances


Measuring NLP tasks involves using accuracy scores to evaluate the performance of
transformers on benchmark tasks and datasets.

Key Measurement Methods

Page 40
Created by Turbolearn AI

1. Accuracy Score:

A straightforward evaluation method that calculates true or false values


for each result.
Indicates whether the model's prediction matches the actual value.

2. F1-Score:

A more flexible metric, especially useful for datasets with uneven


class distributions. It balances precision and recall.
(precision∗recall)
F 1 − score = 2 ∗
(precision+recall)

Precision (P): The ratio of true positives to the total predicted positives.
TP
P recision =
T P +F P

Recall (R): The ratio of true positives to the total actual positives.
TP
Recall =
T P +F N

3. Matthews Correlation Coefficient (MCC):

Provides an excellent binary metric, especially for imbalanced


datasets.

Considers true positives (TP), true negatives (TN), false positives (FP), and
false negatives (FN).

Benchmark Tasks and Datasets


To demonstrate the effectiveness of transformers, three prerequisites are required:

A model
A task
A metric

SuperGLUE Benchmark
SuperGLUE is a benchmark designed to evaluate the performance of NLU models on
more difficult tasks.

From GLUE to SuperGLUE

Page 41
Created by Turbolearn AI

GLUE (General Language Understanding Evaluation): A benchmark designed


to encourage NLU models to solve a set of tasks.
SuperGLUE: Designed to address the limitations of GLUE by introducing more
challenging NLU tasks.

SuperGLUE Tasks
The eight SuperGLUE tasks are presented in a ready-to-use list:

Name: The name of the task.


Identifier: A unique identifier for the task.
Download: The download link to the datasets designed the dataset-driven
task(s)
More Info: Links to papers or websites describing the task.
Metric: The measurement score used to evaluate the model

Defining the SuperGLUE Benchmark Tasks


The tasks can be either pretraining tasks or downstream tasks for fine-tuning. Multi-
task models demonstrate the versatility and thinking capabilities of transformers. The
Transformer model now leads in all of the GLUE and SuperGLUE tasks.

Page 42
Created by Turbolearn AI

Notable SuperGLUE Tasks

Page 43
Created by Turbolearn AI

1. Choice Plausible Answers (COPA):

Requires the NLU model to choose the most plausible cause or effect
related to a given premise.
Premise: I knocked on my neighbor's door. What happened as a result?
Alternative 1: My neighbor invited me in.
Alternative 2: My neighbor left his

2. BoolQ (Boolean Questions):

A question answering task where the model must answer a question


about a given passage with either "true" or "false".
{"question":"passage": "Windows Movie Maker -- Windows Movie Maker
(formerly as Windows Live Movie Maker in Windows 7) is a discontinued
video editing software by Microsoft. It is a part of Windows Essentials
software suite and offers the ability to create and edit videosas to publish
them on OneDrive, Facebook, Vimeo, YouTube, and Flickr.", "idx": 2, "label":
true}

3. Commitment Bank (CB):

Requires the model to read a premise and then examine a hypothesis built
on the premise.
The model must label the hypothesis as entailment, contradiction, or
neutral.
{"premise": "\"Did I ever tell you that's where Paul and I met?\"",
"hypothesis": "Susweca is where she and Paul met," "label": "entailment",
"idx": 77}

4. Record and Sense (ReCoRD):

A question answering task where the model must select the correct
answer from multiple possible answers.
Involves filling in a blank in a query with the correct answer.

The sample contains four questions. To illustrate the task, we will just look into
of them. The model has to predict the correct labels. Notice how the information
the model is asked to obtain is distributed throughout the text:

Page 44
Created by Turbolearn AI

"question": "When was Kayla Rolland shot?"


"answers": [
{"text": "February 17", "idx": 168, "label": 0},
{"text": "February 29", "idx": 169, "label": 1}
]

Here's a flowchart to visually represent the processes of human and machine


perception:

As illustrated, both humans and machines undergo processes of language acquisition,


fine-tuning, and training, highlighting the similarities and differences in their
approaches to understanding language.

For humans, transduction goes through a trial-and-error process. Transduction that


we take structures we perceive and represent them with patterns, for example. We
make representations of the world that we apply to our inductive thinking. inductive
thinking relies on the quality of our transductions.Transformers go through the same
process but in a different way.

Page 45
Created by Turbolearn AI

Reading Comprehension with Commonsense


Reasoning
The Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD)
presents a challenging task, containing over 120,000 queries based on more than
70,000 news articles. The Transformer model must leverage commonsense
reasoning to solve these problems.

Here's an example of a training sample:

Page 46
Created by Turbolearn AI

Source: Daily Mail

Passage: Includes text and indications of entities.

A mountainous nation's Amazon region, but their settlements are so


that they now make up less than one per cent of Peru's 30 million
population. Ever since they battled rival tribes for territory and during
native rule in the rainforests of South America, the Ashaninka have
rarely known peace.\n@highlight\nThe Ashaninka tribe once shared
the Amazon with the like of the Incas hundreds of years
ago\n@highlight\nThey have been forced to share their land after
years of conflict forced rebels and drug dealers into the
forest\n@highlight\nDespite settling in valleys rich with valuable
coca, they live a poor pre-industrial existence

Entities: Indicated by start and end indices within the text.

"entities": [{"start": 2,"end": 9}, …,"start": 711,"end": 715}]

Query: The model must answer a question by finding the appropriate value for a
placeholder.

Innocence of youth: Many of the @placeholder's generations have


turned their backs on tribal life and moved cities where living
conditions are better

Answers:

"answers":[{"start":263,"end":271,"text":"Ashaninka"},{"start":601,"e

Recognizing Textual Entailment (RTE)


In Recognizing Textual Entailment (RTE), the Transformer model must:

Examine a premise.
Examine a hypothesis.
Predict the label of entailment for the hypothesis.

Page 47
Created by Turbolearn AI

RTE requires understanding and logic.

Words in Context (WiC)


The Words in Context (WiC) task requires the model to process an ambiguous word.

The target word is specified, for example:

"word": "place"

The model has to read two sentences containing the target word:

"sentence1": "Do yougroups."

The training data specifies the sample index, the label value, and the start and
end indices for the target word in both sentences:

"idx": 0, "label":"start1": 31, "start2": 27,36, "end2": 32,

Winograd Schema Challenge (WSC)


The Winograd Schema Challenge (WSC), named after Terry Winograd, tests the
model's ability to solve disambiguation problems, especially with pronouns.

The dataset contains sentences targeting slight pronoun differences.


The task is to determine if the pronoun is coreferent with a specified occupation
or participant.

Here's a sample from the training data:

Page 48
Created by Turbolearn AI

Text:

I poured water from the bottle into the cup until it was full.

The model needs to identify the target:

"target":{"span2_index":

Determine if "it" refers to "the cup"

"span1_index": 7, "span1_text": "the cup", "span2_text": "it"},

The label indicates if the statement is true:

"idx": 4, "label": true

Running Downstream Tasks


A downstream task is a fine-tuned task that inherits the model and parameters from
a pre-trained Transformer model. If a task wasn't used to pre-train the model, it is
considered a downstream task.

Corpus of Linguistic Acceptability (CoLA)


The Corpus of Linguistic Acceptability (CoLA) evaluates an NLP model's ability to
judge the linguistic acceptability of a sentence, classifying sentences as grammatical
(1) or ungrammatical (0).

Example:
Classification = 1 for 'we yelled ourselves hoarse.'
Classification = 0 for 'we yelled ourselves.'

The following python code can be used to load the dataset:

Page 49
Created by Turbolearn AI

df = pd.read_csv("in_domain_train.tsv", delimiter='\\t', header=None, names

We can also load a pretrained BERT model:

model =BertForSequenceClassification.from_pretrained("bert-baseuncased",

Stanford Sentiment TreeBank (SST-2)


The Stanford Sentiment TreeBank (SST-2) contains movie reviews used for
sentiment analysis, often in a binary classification task (positive or negative).

Here is some python code to perform SST-2 binary classification:

from transformers import pipeline


nlp = pipeline("sentiment-analysis")
print(nlp("If you sometimes like Wasabi is a good place to start."),"If
print(nlp("Effective but too-tepidThe

The above code will output:

[{'label': 'POSITIVE', 'score': 0.999825656414032}] If you sometimes like to g

The SST-2 task is evaluated using the Accuracy metric.

Paraphrase Recognition
This task involves determining whether two sentences in a sequence are paraphrases
of each other.

Each pair of sentences has been annotated by humans to indicate equivalence.


Properties include paraphrase equivalence and semantic equivalence.

Here is a code snippet to show sequence classification for paraphrase:

Page 50
Created by Turbolearn AI

from transformers import AutoTokenizer,TFAutoModelForSequenceClassification


import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetunedmrpc")
model = TFAutoModelForSequenceClassification.from_pretrained("bertbase-case

classes = ["not paraphrase", "is paraphrase"]

sequence_A = "The DVD-CCA then appealed to the state Supreme Court."


sequence_B = "The DVD CCA appealed that decision to the U.S. Supreme"

paraphrase = tokenizer(sequence_A, sequence_B, truncation=True, padding='lo

paraphrase_classification_logits = model(paraphrase)[0]
paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1

print(sequence_B, "should be a paraphrase")


print(f"{classes[i]}: {round(paraphrase_results[i] * 100)}%")

The MRPC task is measured with the F1/Accuracy score method.

Winograd Schemas and Translation


Winograd schemas can also be used in translation tasks to test a Transformer's
disambiguation abilities across languages.

For example, translating an English sentence with a pronoun into French, where
pronouns have grammatical genders, tests the Transformer's understanding of
the pronoun's reference.

Here is a python snippet to show winograd translation:

from transformers import pipeline


translator = pipeline("translation_en_to_fr")
print(translator("The car could not go in the garage because- max_length=40

Machine Translation
Machine translation is the process of reproducing human translation by
machine transductions and outputs.

Here's a simplified view of the machine translation process:

Page 51
Created by Turbolearn AI

The image illustrates the steps from the initial sentence to translate, through learning
parameters and machine transduction, to the final candidate translation.

The process involves:

Learning how words relate to each other using parameters.


Using machine transduction to transfer learned parameters to sequences.
Choosing a candidate translation for a word or sequence.

Human translation involves building a cognitive representation of the sentence's


meaning and then transforming that into another language.

Preprocessing a WMT Dataset


The Workshop on Machine Translation (WMT) provides datasets for machine
translation tasks.

Steps to preprocess a WMT dataset:

Page 52
Created by Turbolearn AI

1. Download the data: Use the French-English dataset from the European
Parliament Proceedings Parallel Corpus 1996-2011.
2. Load the data: Use standard Python libraries to load the raw text files.

import pickle
from pickle import dump
def load_doc(filename):
file = open(filename, mode='rt', encoding='utf-8')
text = file.read()
file.close()
return text

3. Split into sentences: Divide the loaded document into sentences.

def to_sentences(doc):
return

4. Clean the data: Normalize the text, tokenize, convert to lowercase, remove
punctuation, and filter out non-alphabetic tokens.

import re
import string
import unicodedata
def clean_lines(lines):
cleaned = list()
re_print = re.compile('[^%s]' % re.escape(string.printable))
for line in lines:
line = unicodedata.normalize('NFD', line).encode('ascii', 'ig
line = line.decode('UTF-8')
line = line.split()
line = [word.lower() for word in line]
line = [word.translate(table) for word in line]
line = [re_print.sub('', w) for w in line]
line = [word for word in line if word.isalpha()]
cleaned.append(' '.join(line))
return cleaned

5. Save the cleaned data: Use pickle to serialize the cleaned data into files.

Page 53
Created by Turbolearn AI

filename = 'English.pkl'
outfile = open(filename,'wb')
pickle.dump(cleanf,outfile)
outfile.close()

6. Create a vocabulary: Generate a frequency table for all words in the dataset.

from collections import Counter


def to_vocab(lines):
vocab = Counter()
for line in lines:
tokens = line.split()
vocab.update(tokens)
return vocab

7. Reduce vocabulary size: Remove infrequent words to avoid wasting the training
model's time.

def trim_vocab(vocab, min_occurance):


tokens = [k for k,c in vocab.items() if c >= min_occurance]
return set(tokens)

8. Handle OOV words: Replace Out-Of-Vocabulary (OOV) words with a special


token (e.g., "unk").

def update_dataset(lines, vocab):


new_lines = list()
for line in lines:
new_tokens = [token if token in vocab else 'unk' for token in
new_line = ' '.join(new_tokens)
new_lines.append(new_line)
return new_lines

Evaluating Machine Translation with BLEU


BLEU (Bilingual Evaluation Understudy Score) is a method to evaluate candidate
translations produced by machine translation models.

Page 54
Created by Turbolearn AI

Geometric Evaluations

∏ =1

Chencherry Smoothing
Chen and Cherry (2014) introduced smoothing techniques to improve BLEU scores.
Smoothing is applied to softmax outputs in the Transformer.

Label Smoothing
Label smoothing introduces a value epsilon = ɛ to reduce overconfidence in
predictions.

Reduce the value of the candidate softmax.


Increase the 0 values by 0 + ɛ − 1.

Translations with Trax


Trax is a library developed by Google Brain for end-to-end deep learning, making it
easier to implement translations.

!pip install -U trax

Create the original Transformer Started with the Model Architecture

model = trax.models.Transformer(
input_vocab_size=33300,
d_model=512
)

Finally, the program will de-tokenize and display the translation:

Page 55
Created by Turbolearn AI

tokenized_translation = tokenized_translation[0][:-1] # Remove batch and EO


translation = trax.data.detokenize(tokenized_translation,vocab_dir='gs://t
print("The sentence:",sentence)
print("The translation:",translation)
```## GPT Models and Transformers

With OpenAI's GPT-2 and GPT-3 models, we discover another way of assembling

### True/False Questions Review

Here are some true/false questions to test your understanding:

1. With OpenAI GPT-2 and GPT-3 Models, we discover another way of assembli
2. There (True/False)
3. BLEU is the French word for blue and is the acronym of an NLP metric (T
4. Smoothing techniques enhance BERT. (True/False)
5. German-English is the same as English-German for machine translation.
6. The original Transformer multi-head attention sub-layer has 2 heads. (F
7. The original Transformer encoder has 6 layers. (True/False)
8. The original Transformer encoder has 6 layers but only 2 decoder layers
9. You can train transformers without decoders. (True/False)

### The Rise of Transformer Models

The rise of **transformer models** that require little to no fine-tuning is

* Vaswani et al. (2017) introduced the **Transformer**, which surpassed C


* Radford et al. (2018) introduced **Generative Pre-Training (GPT)** that
* Devlin et al. (2019) perfected fine-tuning with the **BERT** model.
* Radford et al. (2019) went further, and Brown et al. (2020) defined a *
* Wang et al. (2019) created **GLUE** to benchmark NLP models. Transforme
* Wang et al. (2019, 2020) rapidly created **SuperGLUE**, set the human b

### Evolution of Transformer Model Size

To understand how such evolution happened, we will look first at one aspect

**Table 6.1: The evolution of the number of parameters of transformers**

| Transformer Model | Authors | Parameters |


| :----------------------- | :------------------------- | :--------- |
| Transformer Base | Vaswani et al. (2017) | 65M |
| Transformer Big | Vaswani et al. (2017) | 213M |
| BERT-Large | Devlin et al. (2019) | 340M |
| GPT-2 | Radford et al. (2019) | 117M |
| GPT-2 | Radford et al. (2019) | 345M |
| GPT-2 | Radford et al. (2019) | 1.5B |
| GPT-3 | Brown et al. (2020) | 175B |

The size of the architecture evolved at the same time:

* The number of **layers** of a model went from 6 layers in the original


* The number of **heads** of a layer went from 8 in the original Transfo
* The **context size** went from 512 tokens in the original Transformer m

### Context Size and Maximum Path Length

Page 56
Created by Turbolearn AI

The cornerstone of transformer models resides in the attention sub-layers.

**Table 6.2: Maximum path length**

| Layer Type | Maximum Path Length | Context Size |


| :------------- | :------------------ | :----------- |
| Self-Attention | O(1) | |
| Recurrent | O(n) | 12288 |

### Factors Influenced by Transformer Architecture

The flexible and optimized architecture of transformers has led to an impac

* Vaswani et al. (2017) trained a state-of-the-art Transformer model. Bro


* Training large transformer models requires machine power that is availa
* Designing the architecture of transformers requires highly qualified te

### Transformers, Reformers, PET, or GPT?

Before using GPT models, we need to stop and look at transformers from a p

In this section, before using a GPT model, we will examine:

* The limits of the original Transformer model.


* The Reformer solution to the possible limits of the architecture of the
* The PET solution to training a model.

### Limits of the Original Transformer Architecture

The limits of the original Transformer architecture include problems leadin

### Reformer

Kitaev, Kaiser, and Anselmi (2020) designed the Reformer to address the att

The Reformer solves the attention issue with **Locality Sensitivity Hashing

![LSH attention heads](https://api-turbo.ai/e11b57a7-a667-4243-bdde-7bcee87

As seen in the figure above, LSH bucketing and chunking considerably reduce

### PET (Pattern-Exploiting Training)

Schick and Schütze (2020) contend that a 223 million parameter transformer

> PET relies on the reformulation of training tasks to optimize the trainin

PET maps inputs to outputs via verbalizer pairs (PVPs). Each PVP pair conta

* A pattern P that converts (maps) inputs to cloze questions containing m


* A verbalizer v that maps outputs y to a token ___ ∈ T.

### Decision Time: Choosing the Right Approach

What will a project manager's decision be? We have seen the limits of the o

* Accept the limits of the original Transformer models, requiring huge ma

Page 57
Created by Turbolearn AI

* Refuse the limits of the original Transformer and tweak its architectu
* Use different training methods such as PET, an efficient knowledge dist
* Use a combination of these approaches.
* Design your own training methods and model architecture.

The efficiency of each solution relies on:

* The human and machine resources to implement the project

### The Architecture of OpenAI GPT

Transformers went from training, fine-tuning, and finally zero-shot in 2017

#### From Fine-Tuning to Zero-Shot Models

From the start, research teams, led by Radford et al., transitioned transfo

OpenAI opted for a 12-layer decoder-only transformer. The promising results

The goal was to generalize this concept to any type of downstream task once

#### Conditioning Approaches

* **Few-Shot (FS)**: The GPT is trained. When the model needs to make inf
* **One-Shot (1S)**: The trained GPT model is presented with only one dem
* **Zero-Shot (ZS)**: The trained GPT model is presented with no demonst

#### Decoder-Layer-Only GPT Model

GPT models have the same structure as the decoder stacks of the original T

![GPT decoder-only architecture](https://api-turbo.ai/e11b57a7-a667-4243-bd

Radford et al. (2019) presented no less than four GPT models, and Brown et

$N_p = 175.0B, N_l = 96, N_c = 12288, h = 96$

#### Text Completion

This section will clone the OpenAI GPT-2 repository and download the 345M p

##### Step 1: Activating the GPU

Activating the GPU is a prerequisite for better performance that give us ac

##### Step 2: Cloning the OpenAI GPT-2 Repository

We will clone OpenAI's GitHub directory on our VM:

!git clone https://github.com/openai/gpt-2.git

Page 58
Created by Turbolearn AI

When the cloning is over, you should see the repository

![Cloned GPT-2 repository](https://api-turbo.ai/e11b57a7-a667-4243-bdde-7bcee8

Click on `src`, and you will see that the Python files we need from OpenAI to

##### Step 3: Installing the requirements

import os

when the VM restarts


import os.chdir("/content/gpt-2") !pip3 install -r requirements.txt

The requirements for this notebook are:

* Fire 0.1.3 to generate command-line interfaces (CLIs)


* regex 2017.4.5 for regex usage
* Requests 2.21.0, an HTTP

##### Step 4: Checking the TensorFlow Version

The GPT-2 transformer 345M transformer model is provided by TensorFlow.


## TensorFlow Version and GPT-2 Model Download

### Checking TensorFlow Version


To ensure compatibility, especially when using Colab, it's important to verify

```python
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

The expected output should confirm that TensorFlow 1.x is selected, specifically
version 1.15.2. If you encounter any TensorFlow errors, it's advisable to rerun the cell,
restart the VM, and rerun the cell again. This ensures the correct version is active, as
the default version on the VM is often tf.2.

Downloading the GPT-2 Model (345M Parameter)


A pre-trained GPT-2 model with 345M parameters can be downloaded for further
use. This involves importing the os module and changing the directory to the location
where GPT-2 is installed.

Page 59
Created by Turbolearn AI

import os
os.chdir("/content/gpt-2")

The model directory path is /content/gpt-2/models/345M.

This image shows the file directory where the GPT-2 model is located. Inside the 345M
folder, you'll find the following files:

checkpoint
encoder.json
hparams.json
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
vocab.bpe

These files are crucial for the GPT-2 model to function correctly. The encoder.json and
vocab.bpe files contain tokenized vocabulary pairs. The checkpoint file stores the
trained parameters and is accompanied by:

model.ckpt.meta: Describes the graph structure of the model, including GraphDef


and SaverDef. This can be retrieved using
tf.train.import_meta_graph([path]+'model.ckpt.meta').

Intermediate Steps Before Activating the


Model

Page 60
Created by Turbolearn AI

Printing UTF Encoded Text


To ensure proper text encoding, especially when printing to the console, set the
PYTHONIOENCODING environment variable to UTF-8:

!export PYTHONIOENCODING=UTF-8

Project Source Code Directory


Navigate to the source code directory to interact with the GPT-2 model:

import os
os.chdir("/content/gpt-2")

Interactive Conditional Samples (src)


To prepare for interacting with the model, import necessary modules from the
/content/gpt-2/src/interactive_conditional_samples.py file:

import json
import numpy as np
import tensorflow as tf

These steps are essential for defining and activating the model.

Important Modules for Interacting with the


Model
Before diving into the activation of the GPT-2 model, it's crucial to understand the
roles of three key modules located in /content/gpt-2/src:

Page 61
Created by Turbolearn AI

model.py: Defines the model's structure.


sample.py: Generates samples and refines token meaning.
It uses a temperature variable to sharpen probability values, enhancing
higher probabilities and softening lower ones.
It can activate Top-k sampling, which sorts and filters the probability
distribution, excluding low-quality tokens.
It can also activate Top-p sampling, selecting high-probability words until
the sum of their probabilities exceeds a threshold p.
encoder.py: Encodes sample sequences using encoder.json and vocab.bpe,
containing both a BPE encoder and a text decoder.

Interactive Conditional Samples Explained


The interactive_conditional_samples.py script initializes essential information for
interacting with the model:

Hyperparameters: Defined in model.py.


Sample sequence parameters: Defined in sample.py.
Encoding and decoding: Uses functions from encode.py.
Checkpoint data: Restored from the downloaded GPT-2 model (Step 5).

It also defines several parameters for generating text:

models_dir: The directory containing the models (e.g., '/content/gpt-2/models').


seed: A random integer for reproducible results.
nsamples: The number of samples to return. Set to 0 to generate samples
continuously.
batch_size: Affects memory and speed.
length: The number of tokens of generated text. If set to None, it relies on the
model's hyperparameters.
temperature: Controls the randomness of completions. High values yield more
random results, while low values produce more deterministic outputs.
top_k: Controls the number of tokens considered by Top-k sampling. A value of
0 means no restrictions; 40 is recommended.
top_p: Controls Top-p sampling.

For the interactive program, example parameters include:

Page 62
Created by Turbolearn AI

model_name: "345M"
seed: None
nsamples: 1
batch_size: 1
length: 300
temperature: 1
top_k: 0

Interacting with GPT-2

Activating the Interactive Model


To interact with the GPT-2 model, use the interact_model function:

interact_model('345M', None, 1, 1, 300, 1, 0, '/content/gpt-2/models')

This will prompt you to enter some context. For example, you can use a sentence by
Emmanuel Kant:

in one sphere of its cognition, consider questions, which it cannot decline,


as they are presented its own nature, but which it cannot answer, as they
transcend faculty of the mind.

The output will be generated based on this context.

Observations
The entered context conditions the output generated by the model.
The model learns from the context without modifying its parameters.
Text completion is conditioned by transformer models without fine-tuning.
The grammatical structure of the output is usually convincing.

Training a GPT-2 Language Model

Setting Up the Training Environment

Page 63
Created by Turbolearn AI

To train a GPT-2 model on a custom dataset, use the Training_OpenAI_GPT_2.ipynb


notebook. This involves several steps:

1. Activate GPU: Ensure GPU acceleration is enabled in Colab.

2. Upload Python Files: Upload the following files to Google Colaboratory using
the file manager:

train.py
load_dataset.py
encode.py
accumulate.py
memory_saving_gradients.py

These files can be sourced from N Shepperd's GitHub repository or the book's
GitHub repository.

3. Upload Dataset: Upload dset.txt to Google Colaboratory. This dataset can be


replaced with your own customized inputs.

Initial Steps of the Training Process


Follow these initial steps:

1. Clone OpenAI's GPT-2 Repository:

!git clone https://github.com/openai/gpt-2

2. Install Requirements:

import os
os.chdir("/content/gpt-2")
!pip3 install -r requirements.txt
!pip install toposort

Checking TensorFlow Version

Page 64
Created by Turbolearn AI

%tensorflow_version 1.x
print(tf.__version__)

Restart the VM and rerun the cell to ensure you are running the VM with TensorFlow
1.x.

Downloading the 117M Parameter GPT-2 Model

import os
os.chdir("/content/gpt-2")
!python download_model.py '117M'

Copying the N Shepperd Training Files

import os
!cp /content/train.py /content/gpt-2/src/
!cp /content/load_dataset.py /content/gpt-2/src/
!cp /content/encode.py /content/gpt-2/src/
!cp /content/accumulate.py /content/gpt-2/src/
!cp /content/memory_saving_gradients.py /content/gpt-2/src/

Encoding the Dataset


Before training, the dataset must be encoded using encoder.py. This involves loading
the dataset with load_dataset.py:

from load_dataset import load_dataset


chunks = load_dataset(enc, args.in_text, args.combine, encoding=args.encodi
enc = encoder.get_encoder(args.model_name, models_dir)

The encoded dataset is saved as a NumPy array in a .npz archive:

Page 65
Created by Turbolearn AI

import numpy as np
np.savez(args.out_file, *chunks)

Run the following cell to encode the dataset:

import os
model_name = "117M"

Training the Model


To train the GPT-2 117M model, run the following cell:

import os
os.chdir("/content/gpt-2/src/")
!python train.py --model_name=117M

The training will continue until you manually stop it, with checkpoints saved in
/content/gpt-2/src/checkpoint/run1 after every 1,000 steps.

Creating a Training Model Directory


Create a temporary directory for the model and copy the trained parameters:

import os
run_dir = '/content/gpt-2/models/tgmodel'
if not os.path.exists(run_dir):
os.makedirs(run_dir)

Copying Training Files

Page 66
Created by Turbolearn AI

!cp /content/gpt-2/src/checkpoint/run1/checkpoint /content/gpt-2/models/tgm


!cp /content/gpt-2/src/checkpoint/run1/model-1000.index /content/gpt-2/mode
!cp /content/gpt-2/src/checkpoint/run1/model-1000.meta /content/gpt-2/model

Copying OpenAI GPT-2 117M Model Files

!cp /content/gpt-2/models/117M/hparams.json /content/gpt-2/models/tgmodel


!cp /content/gpt-2/models/117M/vocab.bpe /content/gpt-2/models/tgmodel

Renaming the Model Directories

import os
!mv /content/gpt-2/models/117M /content/gpt-2/models/117M_OpenAI
!mv /content/gpt-2/models/tgmodel /content/gpt-2/models/117M

Generating and Interacting with the Trained


Model

Generating Unconditional Samples


To generate unconditional samples from the trained model, run:

import os
!python generate_unconditional_samples.py --model_name=117M

This will produce text without any context input.

Interactive Context and Completion Examples

Page 67
Created by Turbolearn AI

To interact with the trained model using context, run:

import os
!python interactive_conditional_samples.py --model_name=117M

Enter a context, such as the Emmanuel Kant paragraph from before, and observe the
generated text.

Text-to-Text Format and T5 Model


The Google T5 team introduced a clever solution to standardize input formats for
various NLP tasks: adding a prefix to the input sequence. This prefix contains the
essence of the task the transformer needs to solve.

Examples of T5 Prefixes
"translate English to German: + [sequence]" for translations
"cola sentence: + [sequence]" for The Corpus of Linguistic Acceptability (CoLA)

This unified input format leads to a transformer model that produces a result
sequence no matter which problem it has to solve in the Text-To-Text Transfer
Transformer (T5).

The T5 model unifies the input and output of many NLP tasks.

T5 Model Architecture
The T5 model utilizes the original Transformer model architecture.

Page 68
Created by Turbolearn AI

Key components of the T5 model:

Self-attention: Order-independent, operates on sets.


Positional encoding: Uses relative position embeddings instead of adding
arbitrary positions to the input.
Positional embeddings: Shared and re-evaluated through all layers.

Text Summarization with T5


Let's explore how to use T5 for text summarization with Hugging Face's framework.
Hugging Face provides three primary resources: models, datasets, and metrics.

Hugging Face Resources


Hugging Face offers:

Models
Datasets: Used for training and testing.
Metrics

Available T5 Models on Hugging Face

Page 69
Created by Turbolearn AI

Base: Baseline model, similar to BERTBASE with 12 layers and ~220 million
parameters.
Small: Smaller model with 6 layers and ~60 million parameters.
3B and 11B: Use 24 layer encoders and decoders with ~2.8 and 11 billion
parameters, respectively.

This image demonstrates how to use a model directly from the Transformers library.

Initializing the T5-Large Transformer Model


First, install the necessary libraries:

pip install transformers==4.0.0


pip install sentencepiece==0.1.94

Here is how to import the T5-large conditional generation model and the tokenizer:

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained('t5-large')
tokenizer = T5Tokenizer.from_pretrained('t5-large')
device = torch.device('cpu')

Exploring the Architecture


If display_architecture is set to True, the configuration of the model will be displayed:

Page 70
Created by Turbolearn AI

if (display_architecture==True):
print(model.config)

This will output the model's basic parameters, such as the number of heads and
layers. The T5 transformer in this case has 16 heads and 24 layers.

{
"early_stopping": true,
"length_penalty": 2.0,
"num_beams": 4
}

The JSON snippet shows a configuration for a summarization task, including settings
for early stopping, length penalty, and beam search.

Vocabulary size is also an important parameter, with "vocab_size": 32128.

Additionally, you can inspect the model's architecture:

if(display_architecture==True):
print(model)

This allows you to examine the encoder and decoder stacks, attention sub-layers, and
feedforward sub-layers.

Page 71
Created by Turbolearn AI

The image visually represents a PyTorch neural network architecture, detailing the
layers and their configurations, including self-attention mechanisms and feed-forward
networks.

Summarizing Documents
To create a summarization function:

def summarize(text,ml):
preprocess_text = text.strip().replace("\\n","")
task_prefix = "summarize: " + preprocess_text
input_ids = tokenizer.encode(task_prefix, return_tensors="pt").to(devic
summary_ids = model.generate(input_ids,
num_beams=4,
no_repeat_ngram_size=2,
min_length=30,
max_length=ml,
early_stopping=True)
output = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return output

The summarize function preprocesses the input text, applies the T5 task prefix
"summarize: ", encodes the text to token IDs, generates a summary, and decodes the
output.

Page 72
Created by Turbolearn AI

Summarization Examples
Example Usage with the Declaration of Independence:

text = """The United States Declaration of Independence was the first Etext
summary = summarize(text, 50)
print("Number of characters:",len(text))
print ("\n\nSummarized text: \n",summary)

Observations
T5 may sometimes shorten the input text instead of providing a comprehensive
summary. This highlights the challenges NLP models face with certain texts. To
improve results, consider using longer texts, different parameters, larger models, or
modifying the T5 model's structure.

Best Practices for Datasets and Tokenizers

Key Quality Controls for Datasets

Step 1: Preprocessing

Page 73
Created by Turbolearn AI

Sentences with punctuation marks


Remove bad words
Remove code
Language detection
Removing references to discrimination
Logic check

Step 2: Post-processing

Check input text in real-time


Real-time messages
Language conversions

Continuous Human Quality Control ‍


Human intervention remains essential. Train a transformer, implement it, control the
output, and feed significant results back into the training set to continuously improve
the model.

Word2Vec Tokenization
Polysemy is when a word can have several meanings.

Sometimes, pretrained tokenizers miscalculate word pairs because some word pairs
just don't fit together.

Page 74
Created by Turbolearn AI

Tokenizing Text and Training Word2Vec


Models
We will explore tokenizing text and training a Word2Vec model. Tokenizing datasets
that are irrelevant or have critical words missing can confuse embedding algorithms
and produce "poor results," especially in strategic AI projects.

Word2Vec Tokenization
Let's start by tokenizing text from text.txt and training a Word2Vec model.

First, we install the prerequisites:

#@title Pre-Requisistes
!pip install --upgrade gensim
import nltk
nltk.download('punkt')
import math
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action = 'ignore')

The dataset (text.txt) contains the American Declaration of Rights, the Magna Carta,
and the works of Emmanuel Kant.

Page 75
Created by Turbolearn AI

#@title Word2Vec Tokenization


#'text.txt' file
data = []
f = open('text.txt').read()
f = f.replace("\\n", " ")
for i in sent_tokenize(f):
temp = []
for j in word_tokenize(i):
temp.append(j.lower())
data.append(temp)

# Creating Skip Gram model


model2 = gensim.models.Word2Vec(data, window=5, sg=1, size=512)
print(model2)

Here, window=5 limits the distance between the current and predicted word, and sg=1
uses the skip-gram training algorithm. The output shows a vocabulary size,
embedding dimensionality of 512, and a learning rate of 0.025:
Word2Vec(vocab=10816, size=512, alpha=0.025).

Cosine Similarity Function


Now, we'll create a cosine similarity function named similarity(word1, word2) to
compute the cosine similarity between two words.

Page 76
Created by Turbolearn AI

#@title Cosine Similarity


def similarity(word1,word2):
cosine=False #default value
try:
a=model2[word1]
cosine=True
except KeyError: #The KeyError exception is raised
print(word1, ":[unk] key not found in dictionary")
return 0#False implied

# compute cosine similarity


if(cosine==True):
b=model2[word2]
dot = np.dot(a, b)
norma = np.linalg.norm(a)
normb = np.linalg.norm(b)
cos = dot / (norma * normb)
aa = a.reshape(1,512)
ba = b.reshape(1,512)
cos_lib = cosine_similarity(aa, ba)
if(cosine==False):
cos_lib=0
return cos_lib

This function returns the cosine similarity value, which ranges between 0 and 1. If a
word is unknown, it prints a KeyError message and returns 0.

Case Studies: Cosine Similarity Examples


Let's explore several cases to understand how this function works.

Case 0: Words in Text and Dictionary

#@title Case 0: Words in text and dictionary


word1="freedom";word2="liberty"
print("Similarity",similarity(word1,word2),word1,word2)

This case is acceptable, but results may vary based on the dataset's content, size, and
Gensim versions.

Case 1: Words Not in Text or Dictionary

Page 77
Created by Turbolearn AI

#@title Word(s) Case 1: Word not in text or dictionary


word1="corporations";word2="rights"
print("Similarity",similarity(word1,word2),word1,word2)

In this case, the word "corporations" is not in the dictionary, leading to an unknown
token [unk] and a similarity score of 0.

Case 2: Noisy Relationships

word1="etext";word2="declaration"
print("Similarity",similarity(word1,word2),word1,word2)

The cosine similarity exceeds 0.8. While seemingly good, "etext" refers to Project
Gutenberg's preface and could produce erroneous natural language inferences.

Case 3: Rare Words

#@title Case 3: Rare words


word1="justiciar";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Rare words can be medical, legal, or engineering terms, slang, or words from older
texts. Managing rare words is crucial for applications beyond trivial uses.

Case 4: Replacing Rare Words


We can replace rare words for specific tasks. For instance, tracing "justiciar" to its
French Normand origin, related to "judicaire":

word1="judiciaire";word2="judgement"
print("Similarity",similarity(word1,word2),word1,word2)

Page 78
Created by Turbolearn AI

Or, using the modern replacement "judge":

word1="justiciar";word2="judge"
print("Similarity",similarity(word1,word2),word1,word2)

Case 5: Entailment Verification

#@title Case 5: Entailment


word1="pay";word2="debt"
print("Similarity",similarity(word1,word2),word1,word2)

Matching Pretrained Tokenizers with NLP Tasks


Let's examine how well pretrained tokenizers match with NLP tasks.

First, load the necessary files.

#@title Step 11: Generating Unconditional Samples


import os # import after runtime is restarted
os.chdir("/content/gpt-2/src")
!python generate_unconditional_samples.py --model_name '117M'

Next, control the tokenized data.

# @title Additional Tools : Controlling Tokenized Data


# Unzip
import zipfile
with zipfile.ZipFile('/content/gpt-2/src/out.npz', 'r') as zip_ref:
zip_ref.extractall('/content/gpt-2/src/')

# Load arr_0.npy which contains encoded dset


import numpy as np
f=np.load('/content/gpt-2/src/arr_0.npy')
print(f)
for i in range(0,10):
print(f[i])

Page 79
Created by Turbolearn AI

# We first import encoder.json


import json
with open("/content/gpt-2/models/117M/encoder.json", "r") as read_file:
developer = json.load(read_file) #converts the encoded data into Python
i=0
for key, value in developer.items(): #we parse the decoded json data
i+=1
if(i>10):
break;
print(key, ":", value)

# @title Step 12: Interactive Context and Completion Examples


import os # import after runtime is restarted
os.chdir("/content/gpt-2/src")
!python interactive_conditional_samples.py --temperature 0.8 --top_k 50 --m

T5 Bill of Rights Sample


Next, we will summarize a sample from the Bill of Rights using the T5 transformer.

#Bill of Rights,V
text ="""No person shall be held to answer for a capital, or otherwise infa
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)

Since the grammatical structure of the Bill of Rights is outdated, it helps to modernize
it:

text =""" A person must be indicted by a Grand Jury for a capital or infamo
print("Number of characters:",len(text))
summary=summarize(text,50)
print ("\n\nSummarized text: \n",summary)

Key Takeaways

Page 80
Created by Turbolearn AI

Pretraining transformer models on vast amounts of random web crawl data


teaches the transformer language.
A transformer also needs to be trained on specific topics to become a specialist
in that field.
AI specialists are here for quite some time!

It doesn't matter if the dataset is enormous. If the tokenization process fails, even
partly, the transformer model we are running will miss critical tokens.

Questions
1. False: A tokenized dictionary does not contain every word that exists in a
language.
2. True
3. False: It is not always good to have obscene data in datasets.
4. True
5. False: A standard pretrained tokenizer does not contain the English vocabulary
of the past 700 years.
6. True
7. True
8. True

BERT Model Configuration


You can directly access the BERT model configuration section of the chapter to
understand the usage parameters. These parameters define how the BERT model
operates, influencing its ability to process and understand language.

Here are some key parameters:

Page 81
Created by Turbolearn AI

BertForMaskedLM: Specifies the BERT model is used for masked language


modeling.
attention_probs_dropout_prob: 0.1
hidden_act: "gelu"
hidden_dropout_prob
hidden_size: 768
initializer_range: 0.02
intermediate_size: 3072
layer_norm_eps: 1e-12
max_position_embeddings: 512
model_type: "bert"
num_attention_heads: 12
num_hidden_layers: 12
pad_token_id: 0
type_vocab_size: 2
vocab_size: 30522

Predicate Identification Format


The input format for predicate identification, as defined by Shi and Lin, demonstrates
the advancements transformers have made in understanding language in a
standardized way:

[CLS] Marvin walked in the park.[SEP] walked [SEP]

[CLS] indicates the start of a classification task.


The first [SEP] marks the end of the sentence.
The second [SEP] is followed by the predicate.

This format is sufficient for training a BERT model to identify and label roles in a
sentence.

Setting up the BERT SRL Environment


To set up the BERT Semantic Role Labeling (SRL) environment, follow these steps:

1. Open SRL.ipynb, install AllenNLP, and execute each sample.


2. Display the raw output of the SRL run.
3. Visualize the output.

Page 82
Created by Turbolearn AI

This chapter is self-contained, allowing you to read through it or run the samples as
described.

Basic Samples

Sample 1
Consider the following sentence:

Did Bob really think he could prepare a meal for 50 people in only a few
hours?

The transformer identifies the verb "think". The raw output excerpt shows:

"verbs": [{"verb": "think", "description": "Did [ARG0: Bob] really] [V: thi

When run in the AllenNLP online tool, it provides a visual representation of the SRL
task.

Taking a closer look, the simple BERT-based transformer:

Identified the verb.


Detected an adverb and labeled it.

The transformer then moved to the verb "prepare," labeled it, and analyzed the
context:

Page 83
Created by Turbolearn AI

The transformer:

Identified the noun and labeled it as an argument.


Correctly related both arguments to the verb "prepare."
Identified "in only a few hours" as a temporal modifier of "prepare."
Recognized "could" as a modal modifier indicating the verb's modality.

The AllenNLP text output summarizes the analysis:

think: Did [ARG0: Bob] [ARGM-ADV: really] [V: think] [ARG1: he could prepare a
could: Did Bob really think he [V: could] in only a few hours ?
prepare: Did Bob really think [ARG0: he] [ARGM-MOD: could] [V: prepare] a meal

Sample 2
Consider the sentence:

Mrs. And Mr. Tomaso went to Europe for vacation and visited Paris and
first went to visit the Eiffel Tower.

To test this, run the following code in the Sample 2 cell of the SRL.ipynb notebook:

!echo '{"sentence": "Mrs. And Mr. Tomaso went to Europe for vacation and vi
allennlp predict https://storage.googleapis.com/allennlp-public-models/ be

The transformer correctly identified the verbs in the sentence:

"verbs": [{"verb": "went", "description": "[ARG0: Mrs. and Tomaso] [V: went

Page 84
Created by Turbolearn AI

It correctly identified the purpose of the trip as the modifier of the verb "went" and
associated "went" with "Europe." It also identified the verb "visit" as being related to
"Paris".

The transformer correctly split the sequence and produced an excellent result:

It found that "first" was a temporal modifier of the verb "went." The AllenNLP
interface provides the following output:

went: [ARG0: Mrs. and Mr. Tomaso] [V: went] [ARG4: to Europe] [ARGMPRP: for va
visited: [ARG0: and Tomaso] went to Europe for vacation and[V: visited] [ARG1:
went: [ARG0: Mrs. and Mr. Tomaso] went to Europe for vacation andEiffel Tower]

Sample 3
Consider the sentence:

John wanted to drink tea, Mary likes to drink coffee but Karim drank some
cool water and Faiza would like to drink tomato juice.

This sample contains the verb "drink" four times.

Page 85
Created by Turbolearn AI

To test this, run the following code in the Sample 3 cell of the SRL.ipynb notebook:

!echo '{"sentence": "John wanted to drink tea, Mary likes to drink coffee b
allennlp predict https://storage.googleapis.com/allennlp-public-models/ be

The transformer's output contains the verbs:

"verbs": [{"verb": "wanted," "description": "[ARG0: John] [V: wanted] [ARG1


{"verb": "likes," "description": "John wanted to drink tea , [ARG0: Mary]
{"verb": "drank," "description": "John wanted to drink tea , Mary likes to
{"verb": "would," "description": "John wanted to drink tea , Mary likes to

When run on the AllenNLP online tool, the first representation is perfect, identifying
the verb "wanted" correctly:

However, the verb "like" has some issues:

The presence of "some cool water and" is not an argument of like, only "Faiza" is. The
AllenNLP output confirms the problem:

Page 86
Created by Turbolearn AI

wanted: [ARG0: John] [V: wanted] [ARG1: to drink tea] , Mary likes drink coffe
drink: [ARG0: John] wanted to [V: drink] [ARG1: tea] , Mary likes drink coffee
likes: John wanted to drink tea , [ARG0: Mary] [V: likes] [ARG1: to drink coff
drank: John wanted to drink tea , Mary likes to drink coffee but [ARG0: Karim]
would: John wanted to drink tea , Mary likes to drink coffee but Karim drank s
like: John wanted to drink tea , Mary likes to drink coffee but Karim drank [A
drink: John wanted to drink tea , Mary likes to drink coffee but Karim drank s

One of the arguments for the verb "like" is "Karim drank some cool water and Faiza,"
which is confusing.

Difficult Samples

Sample 4
Consider the complex sentence:

Alice, whose husband went jogging every Sunday, liked to go to a dancing


class in the meantime.

To test this, run the following code:

!echo '{"sentence": "Alice, whose husband went jogging every Sunday, liked
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert

The model correctly identifies "liked":

[ARG0: Alice , whose husband went jogging every Sunday] , [V: liked]

The transformer first identifies Alice's husband:

Page 87
Created by Turbolearn AI

The verb "jogging" was identified and related to "whose husband" with the temporal
modifier "every Sunday." The transformer then detects:

The argument describing Alice is long but correct:

[ARG0: Alice , whose husband went jogging every Sunday] , [V: liked] [ARG1

The temporal modifier "in the meantime" was also identified. Finally, the transformer
identifies the last verb, "dancing":

The AllenNLP output confirms the analysis:

went: [ARG1: whose husband] [V: went] [ARG2: jogging] [ARGM-TMP: every Sunday]
liked: [ARG0: Alice , whose husband went jogging every Sunday] , [V: liked] [A
go: [ARG0: Alice , whose husband went jogging every Sunday] , liked to [V: go]
dancing: Alice , whose husband went jogging every Sunday , liked to go to a [V

Sample 5
Consider the sentence:

The bright sun, the blue sky, the warm sand, the palm trees, everything
round off.

To test this, run the following code:

Page 88
Created by Turbolearn AI

!echo '{"sentence": "The bright sun, the blue sky, the warm sand, the palm
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert

The transformer fails to find the verb:

"words": ["The", "bright", "sun", ",", "the", "blue", "sky", ",", "the", "w

Let's change the sentence to the present tense:

!echo '{"sentence": "The bright sun, the blue sky, the warm sand, the palm
allennlp predict https://storage.googleapis.com/allennlp-public-models/bert

The transformer finds the predicate:

"verbs": [{"verb": "rounds", "description": "[ARG1: The bright sun, the blu

Here is the visual explanation:

Sample 6
Consider the sentence:

Now, ice pucks guys!

Page 89
Created by Turbolearn AI

To test this, run the following code:

!echo '{"sentence": "Now, ice pucks guys!"}' | \


allennlp predict https://storage.googleapis.com/allennlp-public-models/bert

The transformer fails to find the verb:

"verbs": [], "words": ["Now", ",", "ice", "pucks", "guys", "!"]}

Summary
SRL tasks are difficult for both humans and transformer models.
Transformers can reach human baselines.
A transformer trained with a "sentence + predicate" input can solve simple and
complex problems.
The limits are reached with rare verb forms.
The Allen Institute AI has made many free AI resources available, emphasizing
that explaining AI is essential.
Transformers will continue to improve NLP standardization through distributed
architecture and input formats.

NER Method

Using NER to find questions


To use NER to help find ideas for good questions, initialize the pipeline with the NER
task to perform with the default model and tokenizer:

nlp_ner = pipeline("ner")

Using the sequence:

Page 90
Created by Turbolearn AI

The traffic began to slow about five miles out of Los Angeles, making it
difficult to get onto Pioneer Boulevard. WBGO was playing some cool jazz,
and the weather was cool, making it rather pleasant to be making it out of
the city on this Friday afternoon. Nat King Cole was singing as Jo and Maria
slowly made their way out of Pasadena and drove toward Barstow. They
planned to get to Las Vegas early in the evening to have a nice dinner and
go see a show.

Run the nlp_ner cell in QA.ipynb:

print(nlp_ner(sequence))

The output:

[{'word': 'Los', 'score': 0.99, 'entity': 'I-LOC', 'index': 11}, {'word':

Hugging Face uses the following labels:

I-PER: Name of a person


I-ORG: Name of an organization
I-LOC: Name of a location

The result is correct. Barstow is also correctly identified.

Here is the AllenNLP representation of the sequence:

Page 91
Created by Turbolearn AI

Question Answering with Transformers

Named Entity Recognition (NER) and Question Generation


The lecture explores techniques for generating questions automatically from text
using Named Entity Recognition (NER). We use the following example sentence:
""Pioneer Boulevard" (LOC), "Los Angeles" (LOC), "Barstow" (LOC), and "Las Vegas"
(LOC), as well as people like "Nat King Cole" (PER) and songs like "Jo" and "Maria"
(PER)." The highlighted entities are categorized into types like LOC (location) and
PER (person).

Questions are generated based on these entities.


Examples include questions related to locations and persons.

Applying Heuristics to NER Output


Heuristics are applied to the output of NER to create questions.

Heuristics are methods or rules of thumb used to solve problems or make


decisions, often based on experience or incomplete information.

The process involves:

Page 92
Created by Turbolearn AI

1. Merging the locations: Combining adjacent location entities.


2. Applying a template: Using a predefined structure to form questions.

For example, the NER output might be:

I-LOC, Pioneer Boulevard


I-LOC, Los Angeles
I-LOC, LA
I-LOC, Las Vegas

Templates like "Where is [I-LOC]?" or "Where is [I-LOC] located?" are then used to
generate questions automatically.

Examples of Automatically Generated Questions


Based on the NER output and templates, questions are automatically created:

Where is Pioneer Boulevard?


Where is Los Angeles located?
Where is LA?
Where is Barstow?
Where is Las Vegas located?

Using Pipelines for Question Answering


The pipeline function from the transformers library can be used for question
answering. For example:

nlp_qa = pipeline('question-answering')
print("Question 1.", nlp_qa(context=sequence, question='Where is Pioneer Bo
print("Question 2.", nlp_qa(context=sequence, question='Where is Los Angele

The output shows the score, start, and end positions of the answer, as well as the
answer itself.

Project Management in Transformer Applications

Page 93
Created by Turbolearn AI

The lecture discusses managing transformers and hard-coded functions


automatically. Four project levels are examined:

1. Easy Project:
Creating a website for an elementary school.
Displaying the answers to automatically generated questions on a
webpage.
Allowing a teacher to finalize a multiple-choice questionnaire.
2. Intermediate Project:
Encapsulating the transformer's automatic questions and answers in a
program.
Using an API to check and correct the answers automatically.
Storing wrong answers for further analysis.
3. Difficult Project:
Implementing an intermediate project in a chatbot with follow-up
questions.
Example: If the transformer identifies that Pioneer Boulevard is in Los
Angeles, the chatbot could ask, "near where in LA?"

Addressing Transformer Mistakes


The lecture gives an example of where the transformer makes a mistake when trying
to answer: nlp_qa(context=sequence, question='Who drove to Las Vegas?')

The transformer is honest enough to vary from one calculation to the next but it is
clear that the transformer faced issues with the person entity questions. It is possible
to see what went wrong and find an explanation. The sequence is run on AllenNLP in
the Semantic Role Labeling section to obtain a visual representation.

Here is an example of how the sentence "they drove slowly toward Barstow" can be
broken down using semantic role labeling. This particular diagram focuses on the
verb "drove".

Page 94
Created by Turbolearn AI

The diagram shows the verb "drove" with its arguments: "they" (ARG0), "slowly"
(manner), and "toward Barstow" (ARG1).

Question-Answering with ELECTRA


The lecture introduces the ELECTRA transformer model, which improves upon the
Masked Language Modeling pretraining method used in BERT.

ELECTRA uses a generator network to introduce plausible alternatives instead


of random tokens.
The model is trained as a discriminator to predict whether a masked token was
generated or original.

ELECTRA's architecture and hyperparameters are similar to BERT. The ELECTRA


model can be implemented using the following code:

nlp_qa = pipeline('question-answering', tokenizer='google/electra-small-gen

Project Management Constraints and Solutions


The lecture discusses the challenges and options when the ELECTRA transformer
model does not produce the expected results. The three main options among other
solutions:

1. Train DistilBERT and ELECTRA or other models with additional datasets.


2. Try ready-to-use transformers, although they might not fit your need, such as
the Hugging Face model.
3. Use SRL to extract the predicates and their arguments.

Page 95
Created by Turbolearn AI

Using Semantic Role Labeling (SRL) to Find Questions


Semantic Role Labeling (SRL) is used to extract predicates and their arguments,
which can then be used to generate questions automatically.

AllenNLP is used to rerun the sequence in the Semantic Role Labeling demo. The
BERT-base model identifies predicates in the sequence. For example: verbs={"began,"
"slow," "making"(1), "playing," "making"(2), "making"(3), "singing," "made," "drove,"
"planned," go," see"}

Here is an example of how SRL can be applied to the verb "slow":

Rules for Automatic Question Generation


The lecture outlines steps for an automatic question generator:

Run NER automatically.


Parse the results with classical code.
Generate entity-only questions.
Filter the results with rules.
Generate SRL-only questions using the NER results to determine the template
to use.

Examples of Applying SRL to Verbs


SRL is applied to verbs like "playing" to generate questions. For example, using the
sequence: "The traffic began to slow down on Pioneer Boulevard in Los Angeles
making it difficult to get of the city_out [ARG0: WBGO] was [V: playing] [ARG1:_".
The following question is generated: What is playing?

The default pipeline provides a satisfactory answer:

Page 96
Created by Turbolearn AI

{'answer': 'cool jazz,,' 'end': 153, 'score':'start': 143}

Coreference Resolution
Coreference resolution is introduced as a method to help models identify the main
subjects in a sequence. This can be added as a pretraining or postprocessing task.

Exploring Haystack with a RoBERTa Model


Haystack is introduced as a question-answering framework with interesting
functionality.

Key Takeaways
Question-answering isn't as easy as it seems initially.
Designing a question generator is a productive solution.
NER and SRL are useful for finding and extracting content.
Implementing transformers requires well-prepared multi-task training and
heuristics implemented in classical code.

Review Questions

Page 97
Created by Turbolearn AI

1. A question generator is a bad way to produce questions. (True/False)


Answer: False
2. NER can recognize a location and label it as I-LOC. (True/False)
Answer: True
3. Implementing question answering does not require project management.
(True/False)
Answer: False

Sentiment Analysis with Transformers

Introduction
Sentiment analysis relies on the principle of compositionality. The lecture explores
how transformer models handle sentiment analysis, especially with complex
sentences.

Compositionality is the principle that the meaning of a complex expression


is determined by the meanings of its constituent parts and the rules that
combine them.

The Stanford Sentiment Treebank (SST)


The Stanford Sentiment Treebank (SST) is introduced as a resource for analyzing
complex sentences. It provides datasets with sentences that are challenging to
analyze.

For example, understanding a sentence that mentions Jacques Derrida requires


understanding compositionality:

Page 98
Created by Turbolearn AI

Sentiment Analysis with RoBERTa-large


The lecture uses AllenNLP resources to run a RoBERTa-large transformer. The steps
include:

1. Installing allennlp and allennlp-models.


2. Running the RoBERTa-large model on a sample sentence.

The output displays the architecture of the RoBERTa-large model, the output logits,
the tokens themselves, and the final output label.

Predicting Customer Behavior with Sentiment Analysis


The lecture discusses using transformer models to predict customer behavior based
on sentiment analysis.

Sentiment Analysis with DistilBERT


The lecture guides the reader through setting up a DistilBERT model using the
transformers library in python.

Page 99
Created by Turbolearn AI

1. Install the transformers library:

!pip install -q transformers


from transformers import pipeline

2. Create a classify function to perform sentiment analysis:

def classify(sequence,M):
nlp_cls = pipeline('sentiment-analysis')
if M==1:
print(nlp_cls.model.config)
return nlp_cls(sequence)

The classify function outputs a sentiment analysis prediction:

[{'label': 'NEGATIVE', 'score': 0.9997098445892334}]

Here is an image of the config for the DistilBERT model:

Page 100
Created by Turbolearn AI

Predicting Customer Behavior Based on Sentiment


Analysis
Several conclusions can be drawn from this result to predict customer behavior by
writing a function that would:

Store the predictions in the customer management system.


Count the number of times a customer complains about a service or product in a
given period (week, month, year).
Detect the products and services that keep occurring in negative feedback
messages.

Transformer Models for Text Classification

Hugging Face and Pretrained Models

The image above depicts the Hugging Face website, a platform for natural language
processing and machine learning models. Hugging Face allows you to search for and
test various pretrained models. The default sort mode is based on the number of
downloads. Let's explore some transformer models for text classification.

Analyzing Complex Sequences


When dealing with complex sequences, transformer models can sometimes produce
unexpected results. Consider the following sentence:

Page 101
Created by Turbolearn AI

Though the customer seemed unhappy, she was, in fact, satisfied but
thinking of something else at the time, which gave a false impression.

This sentence is challenging for a transformer to analyze and requires logical


compositionality.

The image above shows the output of a complex sequence classification task. The
model may produce a false negative, which doesn't necessarily indicate a
malfunction. It might suggest the need for a different model or further training.

Trying Different Models


Several models can be tested for sentiment analysis, including:

MiniLM-L12-H384-uncased: This model provides a careful split score.

RoBERTa-large-mnli: Useful for entailment tasks, helping to determine


sequence relationships. Requires specific input formatting with sequence
splitting tokens:

Though the customer seemed unhappy</s></s> she was, in fact satisfied thi

Multilingual Sentiment Analysis


Transformer models also support multilingual sentiment analysis. For instance, the
nlptown/bert-base-multilingual-uncased-sentiment model can be used for sentiment
analysis in multiple languages, such as English and French.

Page 102
Created by Turbolearn AI

English Example: Demonstrates sentiment analysis in English.


French Example: Demonstrates sentiment analysis in French.

You can find this model on the Hugging Face website or implement it using the
following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-u
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-ba

Key Considerations for Sentiment Analysis


Computational Resources: Sentiment analysis requires significant transformer
model training, powerful machines, and human resources.
Task Specificity: Models often need additional training for specific tasks, even if
pretrained.
Experimentation: Try different tasks on the same model or the same task on
different models to find the best fit.
Real-World Applications: Sentiment analysis can be used to improve customer
relations by detecting dissatisfaction and anticipating problems.

Cognitive Dissonance
Cognitive Dissonance: The mental discomfort experienced when holding
conflicting beliefs, values, or attitudes.

This state arises when tensions build between contradictory thoughts, leading to
nervousness and agitation.

Example Scenario: Consider conflicting information around COVID-19, such as the


effectiveness of masks or the safety of vaccines. Conflicting information exacerbates
cognitive dissonance.

Page 103

You might also like