ASSIGNMENT II (MST II)
By
Sajid Khursheed Bhat
2021A1R175
8th Semester
Computer Science Engineering
Model Institute of Engineering & Technology (Autonomous)
(Permanently Affiliated to the University of Jammu, Accredited by NAAC with “A” Grade)
Jammu, India
2024
Assignment: COM-801 MST II
ASSIGNMENT II (MST II)
Subject Code: COM-801(Generative AI) Due Date: 22 May 2025
Question Course Outcomes Blooms’ Level Maximum Marks
Number Marks Obtained
Q1 CO 4 Understanding 4
Q2 CO 4 Analysing 4
Q3 CO 5 Evaluating 4
Q4 CO 5 Creating 4
Q5 CO 5 Evaluating 4
Total Marks 20
Faculty Signature
Email:-
[email protected]Assignment Objectives:
The assignment aims to deepen students' understanding of sequence modeling and language
generation using modern generative AI techniques. It focuses on the architecture and operational
mechanisms of Transformer models, emphasizing components such as self-attention, multi-head
attention, and positional encoding. Students will analyze the limitations of RNNs, LSTMs, and GRUs in
handling long-term dependencies and compare these with Transformer-based models. The assignment
also explores the design and application of pre-trained models like GPT for conditional text generation.
Additionally, it covers evaluation frameworks for generative models, introducing key metrics like
BLEU, ROUGE, METEOR, and Perplexity to assess model output quality across different NLP tasks.
Assignment Questions:
Q. No. Questions BL CO Marks Total
Marks
1 Explain the key components of the Transformer Understanding
architecture, including the role of self-attention and
4 4 4
feed-forward layers. How does positional encoding
contribute to the model?
2 Compare RNN, LSTM, and GRU in terms of their Analyzing
ability to handle long-term dependencies. Where do
4 4 4
they fall short in modeling context in language
tasks?
3 Analyze and summarize how GPT and BERT differ Evaluating
in their model architecture and training objectives. 5 4 4
Highlight practical use cases where each excels.
4 Draft a step-by-step pipeline for a conditional text Creating
generation system using a pre-trained transformer
5 4 4
(like GPT). Mention how input prompts and
decoding strategies affect output.
5 List and explain evaluation metrics used for Evaluating
generative text models (BLEU, ROUGE, METEOR,
Perplexity). Which metric would you choose for 5 4 4
summarization vs. dialogue generation tasks and
why?
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
Question 1: Explain the key components of the Transformer architecture,
including the role of self-attention and feed-forward layers. How does positional
encoding contribute to the model?
The Transformer architecture represents a significant shift in how natural language processing
tasks are performed. Proposed by Vaswani et al. in 2017, it removed the need for recurrence or
convolution, which were previously central to sequence modelling. The Transformer is
composed of an encoder-decoder structure, although models like BERT and GPT typically use
only one of these components. In a standard Transformer, the encoder processes the input and
the decoder generates the output.
At the heart of this architecture is the self-attention mechanism, which enables the model to
weigh and relate all words in the input sequence simultaneously. For each token, the model
computes three vectors: Query (Q), Key (K), and Value (V). The attention score is calculated
as the dot product of the Query and Key, scaled and passed through a softmax function to
assign weights. These weights are applied to the Value vectors to generate a weighted sum.
This allows each word to "attend" to other words based on relevance. For instance, in the
sentence “The cat sat on the mat,” the word “sat” will strongly attend to “cat” and “mat” to
understand context.
Multiple self-attention layers run in parallel (multi-head attention), allowing the model to learn
various semantic and syntactic relationships. After this, the output passes through a feed-
forward neural network which adds non-linearity and enables the model to learn complex
patterns. Each layer is wrapped with residual connections and layer normalization, which help
in gradient flow and training stability.
However, since Transformers do not process input sequentially like RNNs, they lack an
inherent sense of order. This is where positional encoding comes in. Positional encodings are
vectors added to input embeddings, derived from sine and cosine functions of different
frequencies. These encodings help the model understand the position of each word in a
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
sentence. Without positional encodings, “I love dogs” and “Dogs love I” would appear similar
to the model. Thus, positional encoding is a key enabler for preserving sequence information.
Following the self-attention block is the feed-forward layer, which is a fully connected neural
network applied independently to each position. It consists of two linear transformations with a
ReLU (or sometimes GELU) activation function in between. While self-attention enables the
model to exchange information between different positions in the sequence, the feed-forward
network provides a deeper transformation of each position’s representation, helping the model
generalize and learn complex patterns.
One of the challenges faced by the Transformer is that, unlike RNNs, it processes tokens in
parallel and thus lacks a built-in sense of order or sequence. This is where positional encoding
comes into play. Since the model itself does not inherently know the position of a word in the
sentence, positional encodings are added to the input embeddings to give the model
information about the order of tokens. These encodings are fixed sinusoidal functions or
learned positional vectors that are added element-wise to the word embeddings. This addition
enables the model to distinguish between tokens based not only on their identity but also on
their position in the sequence. As a result, the model can understand that in the sentence “He
went home,” the word “home” comes after “went,” which is crucial for understanding meaning.
In practice, positional encoding is a clever solution to the challenge of modeling sequential data
without using recurrence. The sinusoidal form has the advantage of being able to extrapolate to
longer sequences, while learned positional embeddings may adapt better to specific tasks.
Another critical aspect of the Transformer architecture is the use of residual connections and
layer normalization. Each sub-layer in the Transformer is wrapped with a residual connection
followed by layer normalization. This means the output of each sub-layer is added to its input,
and then normalized. These techniques help stabilize training, speed up convergence, and allow
for the stacking of many layers without the vanishing gradient problem that typically affects
deep neural networks.
To illustrate this architecture in a simple way, a diagram showing the Transformer encoder
layer would be helpful. The diagram would include an input embedding layer followed by
positional encoding, then a self-attention block, followed by a feed-forward layer, all wrapped
in residual connections and layer normalization. For a full Transformer model, stacking several
such layers leads to a powerful architecture capable of handling complex language tasks.
In summary, the Transformer model's key components—self-attention, multi-head attention,
feed-forward layers, and positional encodings—work together to enable powerful sequence
modeling without recurrence. The self-attention mechanism allows the model to focus on
relevant words regardless of their distance, while the feed-forward layers help refine these
representations. Positional encoding ensures the model captures word order, which is vital for
syntactic and semantic understanding. Together, these elements form the foundation of most
modern language models, including BERT, GPT, and many others that power applications
ranging from chatbots to translation systems.
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
Question 2: Compare RNN, LSTM, and GRU in terms of their ability to handle
long-term dependencies. Where do they fall short in modelling context in language
tasks?
Recurrent Neural Networks (RNNs) were among the first architectures used to model
sequential data such as text, speech, or time series. They process input one token at a time
while retaining a hidden state that captures prior information. However, RNNs struggle with
vanishing and exploding gradients during backpropagation through time (BPTT), especially
when sequences are long. This makes learning long-term dependencies extremely difficult,
which is a critical requirement for understanding language.
To overcome this, Long Short-Term Memory networks (LSTMs) were introduced. LSTMs add
a cell state and three gates—input, forget, and output—that regulate the flow of information.
The forget gate decides what to discard from the previous cell state, while the input gate
determines which new information to store. This gating system helps LSTMs maintain
information across longer sequences and mitigates the vanishing gradient issue.
Gated Recurrent Units (GRUs) are a simplified version of LSTMs. Instead of three gates,
GRUs use two: reset and update. The update gate decides how much past information to carry
forward, and the reset gate determines how much of the past to forget. GRUs are
computationally more efficient due to fewer parameters and are often preferred when training
resources or data are limited.
Despite their advantages, both LSTM and GRU still process sequences sequentially, which
makes them slower to train compared to parallelizable architectures like Transformers.
Moreover, they are still limited in their ability to capture very long-range dependencies due to
their inherent step-by-step design. They may also suffer from memory compression, where too
much information gets squeezed into a fixed-size hidden state, causing loss of nuanced context.
In contrast, Transformers use attention mechanisms to directly relate any two words in a
sequence, regardless of their distance. This allows them to capture global dependencies and
bidirectional context more effectively. Therefore, while RNNs, LSTMs, and GRUs are good
for moderate-length sequences, they fall short when modeling long or complex textual data
compared to attention-based models.
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
Question 3: Analyse and summarize how GPT and BERT differ in their model
architecture and training objectives. Highlight practical use cases where each
excels.
Both GPT and BERT are derived from the Transformer architecture but are built and trained
with fundamentally different objectives and components. GPT (Generative Pre-trained
Transformer) is a decoder-only architecture trained in an autoregressive manner. This means it
learns to predict the next word in a sentence given the previous words. It processes data from
left to right, making it well-suited for tasks involving generation, such as text completion or
creative writing.
On the other hand, BERT (Bidirectional Encoder Representations from Transformers) uses
only the encoder part of the Transformer. It is trained using masked language modelling, where
random tokens in a sentence are masked, and the model learns to predict them using the context
from both left and right. This bidirectional nature allows BERT to better understand the full
context, making it ideal for tasks requiring deep comprehension, such as reading
comprehension, named entity recognition, and sentiment analysis.
Architecturally, GPT stacks decoder blocks with masked self-attention, preventing the model
from seeing future tokens. BERT uses encoder blocks with full self-attention, allowing every
token to attend to all others. As a result, GPT is inherently generative, while BERT is
discriminative and contextual.
In terms of use cases, GPT shines in creative writing, conversational agents, story generation,
and even code generation. It’s frequently used in chatbot backends for generating human-like
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
responses. BERT is ideal for classification tasks, such as spam detection or intent recognition,
and extractive question answering (e.g., identifying exact answers in a passage).
The training objectives also impact generalization. GPT is better at open-ended tasks, while
BERT is more precise and consistent in structured understanding. Hybrid models like T5 and
BART try to combine the best of both approaches.
Another key distinction between GPT and BERT lies in how they handle downstream fine-
tuning. BERT typically requires task-specific architecture augmentation for fine-tuning. For
instance, in sentence classification, a classification head is added on top of the [CLS] token
output. In question answering tasks, two additional layers are added to predict the start and end
tokens of the answer span. BERT’s versatility comes from its ability to be fine-tuned with
minimal data across a wide variety of supervised tasks.
GPT, on the other hand, tends to perform well with few-shot or zero-shot learning, especially in
its later versions like GPT-2, GPT-3, and GPT-4. These models are so large and generalized
that they can adapt to new tasks with only a few examples given at inference time. This is
enabled by prompt engineering, where task instructions and examples are embedded into the
input prompt itself. For example, a prompt like: “Translate ‘Hello’ to French: Bonjour.
Translate ‘Thank you’ to French:” is enough for GPT to generate “Merci” without any
additional training. This makes GPT especially powerful in scenarios where labeled training
data is scarce or unavailable.
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
Question 4: Draft a step-by-step pipeline for a conditional text generation system
using a pre-trained transformer (like GPT). Mention how input prompts and
decoding strategies affect output.
Conditional text generation refers to the process of producing coherent and relevant text based
on a specific input or condition. In this setup, the model does not generate text randomly but in
response to a prompt or context provided beforehand. This condition could be a sentence, a
phrase, a question, or even a structured instruction that guides the model toward a specific goal.
In recent years, pre-trained transformer models like GPT (Generative Pre-trained Transformer),
T5 (Text-to-Text Transfer Transformer), and BART have become the go-to choices for
implementing such systems. Among them, GPT is particularly popular due to its strong
performance in generating fluent and context-aware language across diverse tasks.
The process begins with identifying the task and the kind of output desired. Conditional
generation can be used for applications like storytelling, dialogue generation, summarization,
email drafting, code generation, and more. For example, if the objective is to create a product
review from a product description, the description becomes the input condition. This condition
is transformed into a prompt written in natural language. How the prompt is phrased plays a
crucial role in determining the quality of the output. For instance, a generic prompt such as
“Write” is vague and unhelpful. However, a prompt like “Write a short story about a dragon
who protects a village” provides a clear direction to the model. An even better approach might
be: “Story Prompt: A dragon guards a village. Continue the story:”, which sets the tone and
makes the model's task clear.
Once the input prompt is finalized, it is passed through a tokenizer, which breaks the text down
into smaller units called tokens. These tokens are then converted into numerical IDs, as the
transformer model can only process numerical data. Each token ID represents a sub-word or
character piece based on the model’s vocabulary. These token IDs are then fed into the
transformer’s input layer, initiating the generation process.
The text generation itself is an autoregressive process, meaning the model generates one token
at a time, using previously generated tokens as additional context. The quality and style of the
generated output depend largely on the decoding strategy chosen. The most basic method,
greedy decoding, always selects the token with the highest probability at each step. Although
this is fast and easy, it often leads to repetitive or overly simplistic text. To improve upon this,
beam search considers multiple possible continuations of a sentence and selects the most
promising path among them. While beam search provides more coherent results than greedy
decoding, it can still lack diversity.
To generate more creative and varied text, sampling-based methods are preferred. In top-k
sampling, the model selects the next token from a limited set of k most probable options,
introducing randomness while maintaining some control. Another popular method is top-p or
nucleus sampling, where the model dynamically selects tokens from the smallest possible set
whose combined probability exceeds a threshold p. This method is considered more flexible
and tends to produce more natural and diverse outputs. Additionally, temperature control can
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
be applied to influence the randomness of the model. A higher temperature (like 1.0 or above)
makes the output more unpredictable and creative, whereas a lower temperature (around 0.3 to
0.6) makes the text more focused and conservative.
After generating the token sequence, the model’s output is passed through a detokenize, which
converts the numerical token IDs back into human-readable text. The final text may be post-
processed to remove special tokens, adjust punctuation, or format the response as needed. In
real-world systems, this output might be evaluated manually by users or automatically using
metrics like BLEU, ROUGE, METEOR, or newer ones like BERTScore or BLEURT.
To understand this pipeline better, consider an example where a company wants to automate
email replies. The incoming email serves as the condition. Suppose the user sends: “I would
like to schedule a meeting next week regarding the product update.” The prompt might then be
crafted as: “Email: I would like to schedule a meeting next week regarding the product update.
Reply:”, and the model could generate: “Thank you for reaching out. I’d be happy to schedule a
meeting. Please share your availability.” This is an excellent example of how conditional
generation can be practically applied.
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
Question 5: List and explain evaluation metrics used for generative text models
(BLEU, ROUGE, METEOR, Perplexity). Which metric would you choose for
summarization vs. dialogue generation tasks and why?
Evaluating generative models is challenging because multiple correct outputs can exist for a
single input. Therefore, evaluation relies on automatic metrics as well as human judgment.
Among automatic metrics, BLEU (Bilingual Evaluation Understudy) is a precision-based score
that measures n-gram overlap between generated and reference text. It is popular in machine
translation, but often penalizes valid paraphrases that do not use the same words.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is recall-based and widely used
in summarization. It measures how much content from the reference summary is retained in the
generated output. Variants like ROUGE-1, ROUGE-2, and ROUGE-L focus on unigram,
bigram, and longest common subsequence overlap, respectively.
METEOR improves over BLEU by accounting for synonyms, stemming, and word order. It
provides a more balanced and semantically aware evaluation, useful for tasks like dialogue
generation or paraphrasing.
Perplexity measures how well the model predicts the next word in a sequence. It is an internal
measure of fluency, where lower values indicate more confident predictions. However,
perplexity cannot assess content relevance or coherence and thus is insufficient on its own.
For summarization, ROUGE is preferred due to its emphasis on content coverage. For dialogue
generation, METEOR or human evaluation is better, as conversational quality depends more on
relevance, fluency, and diversity than word overlap.
While traditional metrics like BLEU, ROUGE, and METEOR focus on surface-level text
similarity using n-gram overlaps, they often fall short when evaluating the semantic quality of
generated text. To address this, newer metrics like BERTScore and BLEURT have been
proposed, leveraging the power of pre-trained transformer models to assess meaning rather
than just word overlap.
BERTScore uses contextual embeddings from BERT or other similar models to compare each
token in the candidate sentence with tokens in the reference sentence. Instead of looking for
exact word matches, it computes cosine similarity between embeddings. This allows
BERTScore to reward semantically similar words, even if the exact wording differs. For
example, “The boy is running” and “The child is sprinting” would receive a high BERTScore
despite having no n-gram overlap. This metric is particularly useful for tasks like
summarization, paraphrasing, and dialogue systems where semantic fidelity is more important
than surface similarity.
BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) goes one
step further. It is a learned evaluation metric trained on human judgment data. BLEURT fine-
tunes a BERT-like model to predict human evaluation scores. This enables it to detect nuances
Model Institute of Engineering and Technology (Autonomous), Jammu
Assignment: COM-801 MST II
like factual accuracy, grammar, fluency, and coherence better than rule-based metrics. Because
it mimics human scoring patterns, BLEURT has shown strong correlation with human
preferences in benchmark studies.
These modern metrics, however, are computationally intensive and may require GPU
acceleration. Despite this, they are becoming increasingly popular in research and industry
because they align better with human perception of text quality.
In conclusion, for summarization tasks, BERTScore or BLEURT can provide deeper insights
into the semantic adequacy of the output. For dialogue generation, where coherence and
contextual flow matter more than surface matching, BLEURT is a promising choice when
available computationally.
Model Institute of Engineering and Technology (Autonomous), Jammu