ASSIGNMENT-COM-802(B)
INDEX
S.NO QUESTION PAGE REMARKS
NO.
01 Analyze and Compare PixelCNN and 1-6
PixelRNN in terms of model
architecture, training complexity, and
output quality.
02 Analyze and Compare the architecture 7-12
and use cases of LSTM and GRU.
Which would you choose for a real-
time text prediction system and why?
03 Explain the concepts of self-attention 12-15
and multi-head attention. How do they
contribute to the success of
Transformers?
04 Describe the architecture of GPT (any 16-19
version) and its training objective. How
is it different from BERT?
05 Describe the motivation behind 19-21
Transformers. Why are they preferred
over RNNs for language tasks?
Q1. Analyze and Compare PixelCNN and PixelRNN in terms of model architecture, training
complexity, and output quality.
Ans: PixelCNN is a type of autoregressive generative model developed by researchers at Google
DeepMind, primarily used for modeling images pixel by pixel. It was introduced as an improvement over
PixelRNN, offering a more parallelizable and computationally efficient approach while maintaining
high-quality image modeling.
PixelCNN models the joint distribution of pixel values in an image using a product of conditional
distributions:
P(x)=i=1∏nP(xi∣x1,x2,…,xi−1)
Here, each pixel xi is conditioned on all previous pixels (usually in raster scan order: left to right, top to
bottom).
The PixelCNN model aims to model the pixel distribution of an image by treating it as a sequence of
dependent pixels. Using an autoregressive approach, it applies the chain rule to represent the joint
distribution of pixels: p(x)=p(x0)∏n1p(xi|xi<).
This means each pixel is conditionally dependent on previous ones, starting with the first pixel being
independent. Inference is sequential, meaning pixels are generated one at a time, with each new pixel
depending on the ones generated before it. This contrasts with convolutional neural networks, where
processing happens in parallel across the entire image.
Architecture
Here is a diagram used for the PixelCNN model implemented. h is the number of hidden channels, d is
the number of output hidden channels.
01
Figure 1.1
(PixelCNN left , residual block looks right)
As illustrated it has the first convolution layer with mask type ‘A’ (more about masks later) that means the
center pixel in the mask is zeroed, i.e. we guarantee the model won’t get access to the pixel it is about to
predict. This is really obvious: if we allow pixel-to-be-predicted to be connected to our model then the
best way to predict its value in the last layer is to learn to mimic it (think making center weight equal to
one and all others to zero). Zeroing the center pixel in the first layer mask breaks this convenient
information flow and forces the model to learn to predict the pixel based on previous inputs.
1. Masked Convolutions: Central to PixelCNN is the use of masked convolutions to ensure causal
modeling, i.e., a pixel is only influenced by pixels above and to the left of it. Two types of masks
are used:
Mask A: used in the first layer to prevent looking at the current pixel.
Mask B: used in subsequent layers to allow the current pixel but still prevent future pixels.
02
Fig 1.2 Masking
2. Stack of Convolutional Layers:
Typically, a stack of masked 2D convolutions is used.
These are narrow receptive field filters that grow deeper with more layers.
3. Residual and Skip Connections: Added to enable deeper networks and improve training
stability and convergence.
PixelRNN is a type of autoregressive generative model designed for modeling images pixel by pixel. It
was introduced in the paper "Pixel Recurrent Neural Networks" by Aaron van den Oord et al., 2016.
PixelRNN learns the joint distribution of pixels in an image, using RNNs to model the dependencies
between pixels in a sequential manner. PixelRNN models the image as a sequence of pixels and predicts
each pixel conditioned on all previously generated pixels (usually in a raster-scan order: left to right, top
to bottom). For color images, the color channels (e.g., R, G, B) of each pixel are also predicted
sequentially.
Input Representation
● An image of size H×WH \times WH×W with 3 color channels (RGB) is modeled as a sequence
of H×WH \times WH×W steps.
● Each pixel's channel values are modeled as discrete variables (typically 256 categories per
channel)
03
Masked Convolutions
-To prevent the model from seeing future pixels during training, masking is used.PixelRNN ensures that
the prediction of a pixel depends only on previously seen pixels.
Recurrent Layers
There are two main types of recurrent layers used in PixelRNN:
1.Row LSTM: A unidirectional LSTM that processes the image one row at a time.Can model
dependencies along the row but has limited vertical context.
2. Diagonal BiLSTM: Processes the image diagonally, enabling better vertical and horizontal context
sharing.Uses a clever reordering and reshaping of the image to allow for parallel computation.
(Fig 1.3)
Pixel-by-pixel Prediction :The model outputs a softmax distribution over 256 possible values for each
color channel.Predictions are autoregressive: Red is predicted first, then Green conditioned on Red, and
Blue conditioned on both.
04
Output
- For each pixel, the model outputs a distribution over possible values.The model can be sampled
sequentially to generate new images.
Training Complexity Comparison
Aspect PixelCNN PixelRNN
Parallelization Highly parallelizable -uses Limited parallelization –
convolutions that can be recurrent layers (LSTMs) are
computed across the entire inherently sequential, especially
image simultaneously (except across rows or diagonals.
masked parts).
Computation per Step Lower – Convolutions are more Higher – LSTMs require more
efficient on GPUs. computation per step due to
gating mechanisms
Training Speed Faster – Better GPU utilization Slower - RNN-based models are
and batching harder to batch efficiently.
Memory Usage Lower – Convolutions require Higher – RNNs maintain hidden
less memory overhead per step. states for each
pixel/row/diagonal.
Gradient Propagation Easier – Shorter paths for Harder – Long sequences can
gradients in feedforward lead to vanishing/exploding
networks gradients
Hardware Efficiency GPU-optimized – well-suited to Less efficient on GPUs due to
modern accelerators. sequential dependencies.
Implementation Simplicity Simpler to implement and More complex due to custom
optimize using standard CNN RNN operations and masking
frameworks logic.
Table 1.1
05
Output Quality Comparison
Aspect PixelCNN PixelRNN
Modeling Capacity Good - but limited by receptive Higher – RNNs can model
field shape long-range dependencies more
naturally
Dependency Modeling Local (horizontal & vertical via Global (better at capturing
masked convolutions). pixel-to-pixel relationships
across the image).
Image Sharpness Sometimes produces slightly Tends to generate sharper and
blurrier or less coherent textures. more coherent images due to
better context modeling.
Consistency Across Regions Can struggle with globally Better at modeling large
coherent structures (e.g., large structures (e.g., shapes,
objects). patterns).
Color and Texture Detail Good, especially in later Slightly better at fine details due
versions (e.g., Gated to sequential prediction of each
PixelCNN). pixel/channel.
Sample Diversity High, though may miss some High, with potentially better
rare structures. coverage of complex structures.
Performance on Benchmarks Slightly lower NLL (Negative Often achieves better
Log-Likelihood) than RNNs in log-likelihood due to better
some variants. context modeling.
Table 1.2
06
Q2 : Analyze and Compare the architecture and use cases of LSTM and GRU. Which
would you choose for a real- time text prediction system and why?
Ans: LSTM - Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of
RNN, capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber
(1997), and were refined and popularized by many people in following work.1 They work tremendously
well on a large variety of problems, and are now widely used. LSTMs are explicitly designed to avoid the
long-term dependency problem. Remembering information for long periods of time is practically their
default behavior, not something they struggle to learn!All recurrent neural networks have the form of a
chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very
simple structure, such as a single tanh layer.
Architecture -
The architecture of an LSTM (Long Short-Term Memory) network is a specialized type of Recurrent
Neural Network (RNN) designed to handle the vanishing gradient problem in traditional RNNs, making it
great for learning long-term dependencies.
Step-by-step architecture of an LSTM cell:
1. Input to the LSTM Cell
Each LSTM cell receives three inputs:
● xt: the input at the current time step t.
● Ht−1 : the hidden state from the previous time step.
● Ct−1 : the cell state from the previous time step.
2. Forget Gate ( ft) - Decides what information to discard from the cell state.
Equation:
● σ\sigma is the sigmoid activation function.
07
● Output is a vector with values between 0 and 1.
● A value of 0 means "completely forget", 1 means "completely keep".
3. Input Gate ( it) - The input gate adds useful information to the cell state by first regulating the inputs
ht−1 and xt using a sigmoid function, which determines which values should be remembered. It then
creates a vector using the tanh function, producing values between -1 and +1, representing possible values
from ht−1and xt . Finally, the regulated values and the vector are multiplied to obtain the useful
information to be added to the cell state.
Equation:
3. Cell State Update ( Ct) - Updates the cell state using , Forget gate ft , Input gate it , Candidate vector
Ct.
Equation:
4.Output Gate (ot): The output gate extracts useful information from the current cell state to generate the
output. It first creates a vector by applying the tanh function to the cell state, then regulates the
information using the sigmoid function, filtering based on the previous hidden state ht−1 and the current
input xt . Finally, the vector and regulated values are multiplied and sent as output to the next cell.
Equation:
08
Figure 2.1
GRU (Gated Recurrent Unit) : is a type of recurrent neural network (RNN) designed to solve the
vanishing gradient problem and improve long-term dependencies in sequential data. It is similar to an
LSTM (Long Short-Term Memory) network but with a simpler architecture.
In a GRU, there are two main gates:
1. Update Gate: Determines how much of the previous memory should be carried forward.
2. Reset Gate: Decides how much of the past information to forget before updating the memory.
These gates allow the GRU to effectively control the flow of information and maintain important features
over time, making it particularly useful for tasks like natural language processing, speech recognition, and
time series forecasting. Unlike LSTMs, GRUs combine the forget and input gates into a single update
gate, which simplifies their implementation and computation while maintaining similar performance.
Step-by-step architecture of GRU:
1. Inputs to the GRU Cell - xt: current input and ht−1: previous hidden state
2. Update Gate(zt) - Controls how much of the previous hidden state should be carried forward.
09
Equation:
If zt is close to 1, keep most of the old hidden state . If it's close to 0, update with new information.
3. Reset Gate(rt)- Controls how much of the previous hidden state to forget when computing the
candidate state.
Equation:
4. Candidate Hidden State (ht~) - This is the new content that could be added to the hidden state.
Equation:
Here, rt∗ht−1 allows selective forgetting of the past.
5. Final Hidden State (ht) - The final output hidden state is a blend of the old and new information.
If zt is high, it favors the new candidate h~t . If zt is low, it retains more of ht−1.
10
Figure 2.2
Comparison of architecture
Aspect LSTM GRU
Gates 3 gates: Forget, Input, Output 2 gates: Update, Reset
Memory Uses separate cell state and Combines memory and hidden
hidden state state into one: hidden state
Complexity More complex: More parameters Simpler: Fewer parameters,
and operations faster to train
Long-Term Memory Better for longer sequences due Performs well, but may not
to explicit memory cell retain very long dependencies as
effectively
Training Time Slower due to extra gates and Faster due to fewer
operations computations
Table 2.1
11
Comparision of Use Cases
Use Case LSTM GRU
Long Sequence Modeling Excellent for very long Good, but may struggle with
sequences (e.g., long texts, extremely long dependencies
music generation)
Real-Time Applications Slower — more suited for Faster — better for real-time
offline or batch processing tasks like live speech
recognition
Text Generation / Language Popular choice due to strong Competitive performance,
Modeling memory capabilities especially when speed is more
important
Speech Recognition Used in systems requiring high Widely used in
accuracy mobile/embedded speech
applications (e.g., Google Voice)
Time-Series Forecasting Performs well on financial, Works well, especially when
weather, or sensor data with long quick training and less data are
patterns factors
Hardware Constraints Heavier on memory and Lighter — better for deployment
computation on edge devices (phones, IoT,
etc.)
Table 2.2
Q3: Explain the concepts of self-attention and multi-head attention. How do they
contribute to the success of Transformers?
Ans: Self Attention - Self-attention is a mechanism that allows a model to weigh the importance of
different words in a sequence when encoding each word. In other words, every word gets to "look at"
every other word and decide how much to pay attention to them.
In traditional sequence models (like RNNs), the understanding of a word heavily depends on its
neighbors. But language is more complex than that. Sometimes, the meaning of a word depends on
something far away in the sentence.
Example:
12
"The cat that the dog chased was scared."
When processing "was scared", it's helpful to know "the cat" is the subject — even though it's several
words away. Self-attention helps capture that long-range dependency.
Working - Consider we have a sentence of n words. Each word is represented as a vector (called an
embedding). Self-attention computes a new representation for each word based on the entire sentence.
Each word vector is used to create three new vectors:
Query (Q) – The Query is a vector that represents the word (or token) currently being processed,
essentially asking the question: "How much should I pay attention to other words in the sequence?"
Each token in the sequence has its own query, which is derived from the input (usually via a learned
weight matrix).
Key (K) – The Key is another vector associated with each word in the sequence, representing how
relevant each word is to the query.The key essentially holds the "signature" of each token that can be
compared to the query to determine how similar or relevant it is to the current token’s focus.Like the
query, each token has its own key.
Value (V) – The Value is a vector that holds the actual information or content associated with the
token.After comparing the queries to keys, the values corresponding to the most relevant keys are
weighted and used to produce the output representation for the query token.The values can be thought of
as the content that gets passed along after attention is applied.
Then, for each word:
1. Compare the query of this word with the keys of all words (including itself).
2. Get attention scores (how much focus to place on each word).
3. Turn those scores into weights using softmax.
4. Multiply those weights by the value vectors of all words.
5. Sum them up → that’s the new representation of the word.
13
Figure 3.1
Multi-Head Attention - Multi-head attention extends the idea of self-attention by running multiple
self-attention operations in parallel, each with different parameters. This allows the model to focus on
different parts of the sequence simultaneously and capture various relationships between tokens at
different levels of abstraction.
Here’s how multi-head attention works:
1. Multiple Attention Heads: Instead of calculating a single set of query, key, and value vectors for
self-attention, multiple sets of queries, keys, and values are generated. Each set corresponds to a
different attention "head."
2. Independent Attention Computation: Each attention head performs its own attention
calculation (i.e., computes its own attention scores and weighted sum of values) independently.
3. Concatenation: The results of all attention heads are concatenated together into a single vector.
14
Figure 2.2
4. Linear Transformation: Finally, the concatenated output is passed through a linear
transformation to generate the final output of the multi-head attention mechanism.
Contribution in success of Transformers
Self-attention and multi-head attention are fundamental to the success of transformers because they allow
the model to efficiently capture complex dependencies and relationships within input sequences,
regardless of their length. Unlike traditional RNNs and LSTMs, which process input sequentially and
struggle with long-range dependencies, transformers process the entire sequence at once, enabling faster
computation and better scalability. Self-attention allows each token in the sequence to dynamically focus
on relevant parts of the input, learning which tokens are important for understanding context. Multi-head
attention enhances this by allowing the model to attend to multiple aspects of the sequence
simultaneously, capturing a diverse set of relationships across the input. Together, these mechanisms
enable transformers to handle intricate patterns and contextual nuances in language, which is crucial for
tasks like machine translation, text generation, and question answering. Their parallelized nature also
leads to significant speed improvements, making transformers highly effective for large-scale tasks,
contributing to their widespread adoption and success in the field of natural language processing.
15
Q4: Describe the architecture of GPT (any version) and its training objective. How
is it different from BERT?
Ans: Architecture of GPT (Generative Pretrained Transformer)
GPT (Generative Pretrained Transformer) is a type of deep learning model based on the Transformer
architecture, which was first introduced by Vaswani et al. in 2017. This architecture revolutionized natural
language processing (NLP) because of its ability to capture long-range dependencies and handle
sequences of varying lengths. GPT specifically focuses on a decoder-only variant of the Transformer
architecture. Here’s a breakdown of its architecture
1. Transformer Decoder Architecture
Layers: GPT consists of multiple stacked Transformer decoder layers. Each layer has two primary
components:
Self-Attention Mechanism: This allows the model to look at all previous tokens in the input sequence to
decide how much weight to give each token. The self-attention mechanism is what enables GPT to
understand long-range dependencies between words in a sentence.
Feed-Forward Neural Networks (FFNs): After the attention mechanism, the output is passed through a
series of fully connected layers (FFNs) that help capture more complex patterns in the data.
Residual Connections: Every Transformer layer includes residual connections around both the attention
and FFN sub-layers, allowing for more efficient training and preventing vanishing gradients.
Layer Normalization: Layer normalization is applied after each sub-layer (i.e., after the self-attention
and FFN layers) to stabilize training.
Positional Encoding: Since the Transformer architecture doesn’t inherently process sequences in order,
GPT adds positional encodings to the input embeddings to help the model understand the order of tokens.
2. Input Embeddings
Tokenization: GPT uses a tokenization process to convert text into numerical input. The input text is split
into tokens (usually using methods like Byte Pair Encoding (BPE) or SentencePiece)which are then
mapped to high-dimensional vectors (embeddings).
Embedding Layer: The embeddings are passed through the Transformer layers after being combined
16
with positional encodings. This representation is learned during training.
3. Output Layer
Language Modeling Objective: The output from the final Transformer layer is passed through a linear
layer followed by a softmax layer to predict the probability distribution of the next token in the
sequence, given the preceding tokens.
Figure 4.1
Training Objective
GPT is trained using a causal language modeling objective, which is sometimes called autoregressive
language modeling. The main goal is to predict the next token in a sequence, given the tokens that
preceded it. Here's how it works:
1. Autoregressive Objective
GPT is trained to predict the probability distribution of the next token in the sequence, conditioned on the
previous tokens. For example, given a sequence of tokens like "The cat sat on the", GPT tries to predict
the next token (e.g., "mat").
17
Mathematically, the model maximizes the likelihood of the correct token at each position in the sequence,
using the following formula:
Where wt is the token at time t , and the model predicts wt based on the context of previous tokens
w1,w2,...,wt−1.
2. Unsupervised Pretraining
GPT is pretrained on vast amounts of text data in an unsupervised fashion. The model learns to predict the
next token by training on large corpora of text, such as books, websites, and other publicly available text
sources.This pretraining allows the model to learn a broad range of language patterns, grammar, facts, and
reasoning capabilities from the data.
3. Fine-tuning (Optional)
After pretraining, GPT can be fine-tuned on a smaller, more specific dataset for a particular task, such as
question answering, summarization, or translation. Fine-tuning typically involves training the model with
labeled data for a supervised learning task.
4. Optimization
The model is trained using stochastic gradient descent (SGD) or variants like Adam optimizer to
minimize the cross-entropy loss between the predicted probability distribution and the true token (next
word) in the sequence. This helps the model learn to predict text more accurately over time.
Feature GPT BERT
Model Type Decoder-only Transformer Encoder-only Transformer
Architecture Direction Unidirectional (left-to-right) Bidirectional (considers context
from both left and right)
Training Objective Causal Language Modeling Masked Language Modeling
(predict next token) (predict masked tokens) + NSP*
Context Usage Only past tokens (causal) Full sentence context
18
(bi-directional attention)
Input Format Single continuous text sequence Can handle sentence pairs (e.g.,
for Q&A or sentence
relationships)
Pretraining Tasks Next word prediction Masked Language Modeling
(MLM), Next Sentence
Prediction (NSP)
Popular Versions GPT, GPT-2, GPT-3, GPT-4 BERT, RoBERTa, DistilBERT,
ALBERT
Output Generates text/token sequence Outputs contextual embeddings
for each token
Best For Text generation, summarization, Sentence classification, Q&A,
creative tasks NER, sentiment analysis
Table 4.1
Q5 : Describe the motivation behind Transformers. Why are they preferred over
RNNs for language tasks?
Ans: The motivation behind Transformers stems from the limitations of earlier sequence models like
Recurrent Neural Networks (RNNs) and their variants (like LSTMs and GRUs), especially when applied
to complex language tasks. Here’s a breakdown of the motivation and why Transformers are preferred.
Motivation
1. RNN Limitations:
Sequential Computation: RNNs process input one token at a time. This limits parallelization and
makes training slow.
Long-Term Dependencies: RNNs struggle to retain information over long sequences due to
vanishing or exploding gradients.
Fixed Memory Bottleneck: The hidden state must capture all prior information, which becomes
19
less effective for long contexts.
2. Need for Better Context Handling:
Language understanding often requires access to both nearby and distant words in a sentence
(e.g., subject-verb agreement, resolving ambiguity, etc.).
Traditional RNNs can miss these long-range dependencies, leading to poorer performance on
tasks like translation, summarization, and question answering.
Importance of Transformers:
Introduced in "Attention is All You Need" (Vaswani et al., 2017), Transformers address these limitations
through the self-attention mechanism, which brings several advantages:
1. Parallelization:
Unlike RNNs, Transformers process all tokens in a sequence simultaneously, not one-by-one. This
massively speeds up training and makes better use of modern hardware like GPUs and TPUs.
2. Self-Attention Mechanism:
Every token can directly “attend to” every other token in the sequence, regardless of position.
Allows the model to capture long-range dependencies more effectively than RNNs.
3. Scalability:
Transformers scale well with large data and model sizes, enabling the development of massive pre-trained
language models (e.g., BERT, GPT, T5).
4. Positional Encoding:
Since Transformers lack recurrence, they use positional encodings to retain information about token order,
allowing them to model sequence structure without sequential processing.
20
Feature RNNs Transformers
Processing Sequential Parallel
Handling Long Contexts Weak Strong via self-attention
Efficiency Slow Fast (especially for training)
Scalability Limited Highly scalable
Table 5.1
21
REFERENCES
1. https://sergeiturukin.com/2017/02/22/pixelcnn.html
2. https://medium.com/a-paper-a-day-will-have-you-screaming-hurray/day-4-pixel-recurrent-neural-
networks-1b3201d8932d
3. https://medium.com/@humble_bee/rnn-recurrent-neural-networks-lstm-842ba7205bbf
4. https://medium.com/@anishnama20/understanding-gated-recurrent-unit-gru-in-deep-learning-2e5
4923f3e2
5. https://rahulrajpvr7d.medium.com/what-are-the-query-key-and-value-vectors-565
6b8ca5fa0
6. https://www.ibm.com/think/topics/gpt
7. https://en.wikipedia.org/wiki/Generative_pre-trained_transformer
8. https://www.pinecone.io/learn/transformers/
22