Unit 5 Short Notes
Unit 5 Short Notes
Recurrent Neural Networks: Introduc on – Recursive Neural Networks – Bidirec onal RNNs – Deep
Recurrent Networks – Applica ons: Image Genera on, Image Compression, Natural Language
Processing. Complete Auto encoder, Regularized Autoencoder, Stochas c Encoders and Decoders,
Contrac ve Encoders.
Unfolding in Time: The feedback loop in the RNN is "unfolded" over me steps to make learning
possible using algorithms like backpropaga on through me (BPTT).
o Inputs x1, x2, x3 are sequen ally fed into the network.
o At each me step t, the hidden state ht is updated based on the current input xt and the
previous hidden state h{t-1}.
Input vectors going into both forward and backward RNN layers.
Hidden states calculated at each time step in both directions.
Output is formed by combining both directions' hidden states.
Inpu ng a sequence
Dual Processing
Forward direc on: Uses the current input and the previous hidden state (ht−1) to compute the
hidden state at me t.
Backward direc on: Uses the current input and the next hidden state (ht+1) to compute the
hidden state at me t, going in reverse.
So, each me step’s final hidden state combines context from both before and a er the current step.
This mechanism gives the model memory—it can remember informa on from earlier or later steps.
A non-linear ac va on func on
Applied to a weighted sum of:
o The hidden state at that step
o And some output-specific weights
The goal is to minimize the error between predicted and actual output
This is done using backpropaga on through me (BPTT)
However, because forward and backward passes in a BRNN occur simultaneously, upda ng the weights
for the two processes may occur at the same me. This produces inaccurate outcomes. Thus, the
following approach is used to train a BRNN to accommodate forward and backward passes individually
1. Sentiment Analysis
o Example: Understanding if a sentence expresses posi ve or nega ve emo on.
o Since the full sentence affects meaning, Bi-RNNs perform be er by using both
direc ons to get the full sen ment.
2. Named Entity Recognition (NER)
o Task: Finding names of people, places, brands, etc., in a sentence.
o Bi-RNNs look before and a er the target word to decide if it's a named en ty.
o Example: In “Apple is releasing a new product,” the word “Apple” is iden fied as a
company.
3. Part-of-Speech Tagging
o Task: Labeling each word with its gramma cal role (noun, verb, adjec ve, etc.).
o The same word can have different roles depending on context.
o Bi-RNNs use full sentence context to accurately tag each word.
4. Machine Translation
o Goal: Translate a sentence from one language to another.
o Bi-RNNs are used in the encoder part of encoder-decoder architectures.
o The encoder reads the sentence in both direc ons (forward and backward) to get full
context.
o This helps the decoder generate more accurate transla ons.
5. Speech Recognition
o Bi-RNNs help understand speech be er by considering both what was said before and
what’s coming next.
o They analyze the audio signal in both direc ons to recognize words and meaning more
effec vely.
o Useful in systems like Siri, Google Assistant, etc.
Disadvantages of Bi-RNNs
1. Computational Complexity
o Bi-RNNs double the processing by using both forward and backward passes.
o This leads to higher memory use and longer processing me, making them expensive to
run.
2. Long Training Time
o Because Bi-RNNs have more parameters than standard RNNs, they take longer to train.
o This is especially true for large datasets or deep Bi-RNN networks.
3. Difficulty in Parallelization
o Unlike models like Transformers, RNNs (including Bi-RNNs) process inputs sequen ally.
o This makes it hard to run them in parallel, which slows down training and inference.
4. Overfitting
o With so many parameters, Bi-RNNs can overfit easily, especially on small datasets.
o Overfi ng means the model does well on training data but poorly on new, unseen data.
5. Interpretability
o Since Bi-RNNs process data in both direc ons, it’s hard to explain what’s going on
inside.
o This makes it challenging to debug or understand the reasons behind specific
predic ons.
Advantage Disadvantage
Uses full context (past + future) High computational cost
Better accuracy Slow training
Handles variable-length sequences Difficult to parallelize
Robust to noise Risk of overfitting
Great for NLP tasks Hard to interpret
RNNs: These networks handle sequen al data, where each output depends on previous steps
(temporal dependency).
Applica ons: Ideal for tasks such as:
o Natural Language Processing (NLP)
o Time series predic on
o Speech recogni on
Structure:
o DRNs go beyond tradi onal RNNs by stacking mul ple recurrent layers, allowing the
network to learn more complex pa erns.
o Each layer passes its output to the next layer, building hierarchical representa ons of
the data.
Advantages:
o Handle long-range dependencies be er than shallow RNNs.
o Perform be er in complex tasks like language modeling and machine transla on.
Types of Recurrent Units Used in DRNs:
1. Vanilla RNNs:
o The simplest form.
o Compute output based only on the current input and previous hidden state.
o Can suffer from vanishing gradient problems in long sequences.
2. Long Short-Term Memory (LSTM):
o Designed to overcome the limita ons of vanilla RNNs.
o Uses ga ng mechanisms (input, forget, output gates) to manage the flow of informa on.
o Good at learning long-term dependencies.
3. Gated Recurrent Units (GRUs):
o A simpler alterna ve to LSTMs.
o Combines the forget and input gates into a single update gate.
o More computa onally efficient while retaining performance.
Input Layer: Takes a sequence of inputs over me steps xt (e.g., words in a sentence, frames in a
video).
Mul ple Hidden Layers: The network is “deep” because it contains more than one hidden
recurrent layer stacked ver cally.
State Vectors: Hidden states h are maintained and passed from one me step to the next, and
between layers.
Output Layer: Produces outputs yt at each me step, based on the final hidden layer’s output at
that me.
This is unfolded over me, showing how informa on flows through each layer and across me steps.
1. Data Prepara on: Clean and structure sequen al input data (e.g., text, audio).
2. Model Architecture Design:
o Choose number of hidden layers L
o Set number of hidden units k per layer
o Decide between RNN, LSTM, or GRU units
3. Training the Model:
o Use backpropaga on through me (BPTT)
o Op mize weights and biases using a loss func on
4. Deployment: Apply the trained model to real tasks, e.g., sen ment analysis.
1. Data Prepara on:
Decide on:
o Number of layers
o Number of hidden units
o Type of recurrent unit (e.g., LSTM, GRU)
Also decide how to manage input/output sequence lengths (e.g., padding or trunca on).
Integrate into a real- me applica on (e.g., web or API) to classify sen ment on live data.
Components:
• Input Sequence:
• Embedding Layer:
• Recurrent Layers:
The core of an RNN. Processes data one step at a me, maintaining a memory of previous steps.
Stacked layers form a deep RNN, enabling more complex understanding.
Types of recurrent units:
o Vanilla RNNs
o LSTMs (Long Short-Term Memory)
o GRUs (Gated Recurrent Units)
• Output Layer:
Converts the final hidden state(s) into the desired output format.
Two common uses:
o Classifica on (e.g., so max for sen ment analysis)
o Regression (e.g., predic ng a value like temperature)
• Output (Prediction):
3. Increased Expressiveness:
With more layers, DRNs can model more complex and subtle pa erns in sequen al data.
Especially helpful for detec ng nuances in long texts or speech.
What it means: Each layer in a DRN learns different levels of features from the input.
Why it ma ers: This allows the model to:
o Extract simple pa erns in lower layers (e.g., word-level context).
o Capture complex pa erns in higher layers (e.g., sentence meaning).
Use cases: Language modeling, speech recogni on, and transla on.
5. Transfer Learning
What it is: Reusing a model trained on one task for another, related task.
In DRNs: The model can be pre-trained on large datasets (like Wikipedia for language) and then
fine-tuned on a smaller, task-specific dataset.
Benefit: Saves me and improves performance when labeled data is limited.
What it is:
o During training, gradients are used to update weights.
o In deep RNNs, gradients may become:
Too small (vanish) → No learning happens.
Too large (explode) → Instability and bad updates.
Solu on:
o Use LSTM or GRU units (they mi gate this issue).
o Apply techniques like gradient clipping and careful weight ini aliza on.
2. Computational Complexity
What it means: Deep RNNs require a lot of processing power and memory.
Why:
o Many layers and sequen al opera ons are involved.
o Large datasets increase the computa on me.
Impact: Harder to use on mobile devices or in real- me systems.
Explana on:
o Training requires running many itera ons across large and complex datasets.
o DRNs process data step-by-step (sequen ally), which is slower than models like CNNs.
Real-world issue: Training could take days or weeks, depending on dataset size and hardware.
4. Overfitting
What happens:
o Model learns training data too well, including the noise.
o Performs poorly on new, unseen data.
Why it happens:
o Too many parameters and insufficient data.
Solu ons:
o Use regulariza on techniques like:
Dropout
Weight decay
o Reduce model complexity or use more training data.
5. Difficulty in Interpretability:
o Every layer applies mathema cal opera ons (e.g., tanh, ReLU) that distort the data.
o The deeper the network, the harder it is to understand what each layer is doing.
Sequence Dependency:
o In DRNs, each predic on depends on a sequence of past inputs, not just one.
o So you can’t just say, “This word caused the sen ment to be posi ve.”
o The influence is spread across me, making it hard to pinpoint the exact cause.
Application: Image Compression
An application of image compression using Recurrent Neural Networks (RNNs). It highlights
how deep learning has transformed image compression by reducing spatial redundancy between
adjacent pixels and reconstructing high-quality images. Traditional methods have grown
increasingly complex, but RNNs offer an alternative approach that leverages sequential
processing for effective encoding and compression.
A neural network architecture designed for encoding and decoding data, specifically in an
image compression pipeline using Recurrent Neural Networks (RNNs). It consists of two
primary components:
1. Analysis-Encoder Network
2. Synthesis-Decoder Network
RNN Layers: Reverse the encoding process, reconstructing compressed data step by
step.
o RNN #6 (64 → 128)
o RNN #5 (128 → 256)
o RNN #4 (512 → 512)
Synthesis Block: Includes Inverse GDN (iGDN) and convolutional layers to refine and
reconstruct the image.
This design enhances compression efficiency while preserving image fidelity, making it useful
for low-bandwidth transmission, storage optimization, and real-time image processing.
The training process for an image compression network using Recurrent Neural Networks (RNNs). It
presents key mathema cal formula ons and different strategies for applying RNNs in image
compression.
Training Process:
The decoder network reconstructs the compressed data with an adjustment factor:
These techniques enable adaptive compression, improving storage and transmission efficiency
while preserving image quality.
Lossy compression focuses on reducing file size by selectively discarding less critical image
details while maintaining perceptual quality.
RNN-based Models learn to prioritize essential features, ensuring that only the most
relevant aspects of the image are preserved.
Quantization reduces precision in pixel values, enabling efficient storage.
Entropy Coding further optimizes compression by encoding high-probability symbols
with shorter codes (e.g., Huffman coding).
Lossy compression focuses on reducing file size by selectively discarding less critical image
details while maintaining perceptual quality.
RNN-based Models learn to prioritize essential features, ensuring that only the most
relevant aspects of the image are preserved.
Quantization reduces precision in pixel values, enabling efficient storage.
Entropy Coding further optimizes compression by encoding high-probability symbols
with shorter codes (e.g., Huffman coding).
Recurrent Neural Networks (RNNs) are a fundamental deep learning architecture for processing
sequential data, making them particularly useful in Natural Language Processing (NLP). Here's a
breakdown of the key points:
Unfolding Over Time: The image shows how a single RNN unit repeats across different
time steps. Instead of treating inputs independently, the network maintains a hidden state that
carries contextual information forward.
Hidden States (h0,h1,hn+1) Each time step has its own hidden state, which is updated based on
previous states and new inputs. This mechanism allows the RNN to remember past information.
Inputs (x,x0,x1,xn+1): Each time step receives an input vector, which represents part of a
sequence (e.g., words in a sentence).
Outputs (o,o0,o1,on+1): The model generates an output for each time step, which can be used
for various NLP tasks like translation or sentiment analysis.
Weight Matrices (W,U): These control how inputs and hidden states interact, determining
how information flows through the network.
RNNs struggle with long-term dependencies due to issues like vanishing gradients. More
advanced models, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units
(GRU), improve on standard RNNs by better handling long-range dependencies.
Recurrent Neural Networks (RNNs) are powerful tools for Natural Language Processing
(NLP) due to their ability to handle sequential data while retaining memory from previous
computations. Here's a more structured explanation:
Advantages of RNNs
1. Context Awareness – Unlike traditional neural networks, RNNs can remember past
information and use it to influence current computations, making them effective for
processing sequences.
2. Handling Arbitrary-Length Inputs – They can model dependencies over long
sequences, which is essential for natural language tasks where previous words affect the
meaning of later words.
3. Recursive Computation – RNNs apply recursive transformations, enabling dynamic
processing of sequential data without needing a fixed input size.
2. Word-Level Classifica on
Example: Named Entity Recognition (NER) helps identify key entities like people,
places, and organizations in text.
3. Language Modeling
RNNs predict the next word in a sentence, helping with autocomplete, speech
recognition, and chatbots.
4. Seman c Matching
Helps in search engines and question-answer systems by linking queries with relevant
content.
5. Sentence-Level Classifica on
Used for sentiment analysis, classifying a sentence’s emotional tone (positive, negative,
neutral).
Sequence Modeling: Predicts the next word based on previous words, useful for text
generation, autocomplete, and speech recognition.
Machine Translation: Uses Seq2Seq architectures (encoder-decoder RNNs) to
translate text between languages.
Sentiment Analysis: Determines the emotion behind text, used in social media
monitoring and customer feedback analysis.
Named Entity Recognition (NER): Extracts important names and places from text.
Part-of-Speech Tagging: Identifies grammatical roles of words (e.g., noun, verb,
adjective).
Text Classification: Categorizes documents by topic (e.g., spam detection, news
classification).
Dialogue Systems: Powers chatbots by generating relevant conversational responses.
Complete Autoencoder:
A Complete Autoencoder is a type of ar ficial neural network designed for unsupervised learning,
which means it learns from data without labels.
Autoencoders are mainly used to learn efficient, compressed representa ons of data. Think of it as
teaching the model to understand the essence of the input data.
Autoencoders Importants
They are part of deep learning models that automatically discover patterns in complex
data.
Autoencoders are versatile and have been successfully applied in image processing,
anomaly detection, noise reduction, and more.
They work well in cases where we don’t have labeled data but still want the model to
learn useful features.
1. Encoder:
o Takes the original input (like an image or a signal).
o Compresses it into a smaller, dense form called a latent representation or code.
o This process is similar to summarizing data.
2. Decoder:
o Takes that latent representation.
o Reconstructs the original input from it.
o The goal is to make the reconstructed output as close to the original input as
possible.
Original Input → Encoder (W) → Latent Space → Decoder (W′) → Reconstructed Output
1. Input Layer
This is where the original data (like an image or number) is fed into the autoencoder.
In your diagram, you see several input neurons (circles), each one represen ng a part of the data
(e.g., pixel values in an image).
2. Encoder
This layer contains fewer neurons than the input, forcing the network to learn efficient
encoding.
It’s the key point where the input data is summarized in a lower-dimensional space.
This is the "compressed representa on" seen in the second diagram.
4. Decoder
The decoder takes this compressed data and reconstructs the original input.
It uses a different set of weights labeled W to connect the hidden layer to the output layer.
Its job is to make the output as similar as possible to the input.
5. Output Layer
The final layer where the reconstructed version of the input is produced.
The network is trained to minimize the difference between the original input and this output.
1. Vanilla Autoencoder
2. Sparse Autoencoder
3. Denoising Autoencoder
A probabilis c version.
Learns the distribu on of data, not just the input itself.
Can generate new data by sampling from the latent space.
5. Contractive Autoencoder
Penalizes the model if small changes in input cause large changes in output.
Makes the learned representa on stable and robust.
7. Convolutional Autoencoder
8. Recurrent Autoencoder
REGULARIZED AUTOENCODERS
Regularized autoencoders are an improved version of standard autoencoders. Their main goal
is to avoid overfitting — that is, to stop the model from simply memorizing the training data.
Instead, they help the model learn patterns that can generalize well to new, unseen data.
To achieve this, they add regularization — which means placing extra rules or constraints on
how the model learns. These constraints force the autoencoder to focus on the most important
and meaningful features of the data, rather than noise or unnecessary details.
By doing so, regularized autoencoders produce stronger and more useful data
representations. This makes them effective for tasks like:
In short, regularized autoencoders learn smarter, more general features of the data, making
them better suited for many real-world machine learning tasks.
The architecture of an autoencoder neural network, which is used for unsupervised learning tasks like
data compression, feature extrac on, and denoising.
1. Input Layer
The encoder is the part of the network that compresses the input data into a lower-dimensional
form.
The circles labeled a1,a2,…,a6 in the first two hidden layers are encoding layers, where the model
learns a compact representa on.
The encoder transforms the input x into a code (also called the latent space or bo leneck).
The decoder reconstructs the input from the bo leneck representa on.
The layers mirror the encoder in reverse, transforming the compact code back into the original
format.
5. Output Layer
The final layer outputs x’1,x’2,…,x’6 which aim to closely match the original inputs.
The goal of the network is to minimize the difference between input and output — typically
using mean squared error (MSE) or similar loss func ons.
A regularized autoencoder is structurally similar to a tradi onal autoencoder (like the one in the image
you shared), but it includes special techniques to prevent overfi ng and help the model learn more
general and useful features.
1. Neuronal Arrangement
2. Ac va on Func ons
Regularized autoencoders o en use ac va on func ons (like ReLU, Leaky ReLU, or sigmoid) that
work well with regulariza on.
These help improve training stability and performance.
1. L1 and L2 Regulariza on
2. Dropout
3. Batch Normaliza on
5. Contrac ve Regulariza on
Adds a penalty based on the Jacobian matrix of the encoder (i.e., how sensi ve the output is to
small input changes).
Encourages the model to be less sensi ve to small noise or varia ons in the input.
--------------------------------------------------------------------------------------------------
Stochas c encoders and decoders are special parts of a neural network architecture used in
probabilis c models—par cularly in Varia onal Autoencoders (VAEs).
The term "stochas c" means random or involving chance. So, unlike regular (determinis c)
encoders and decoders, these ones introduce randomness in the process.
In VAEs, we want the model to learn not just a single fixed representa on of input data (like an
image or sentence), but a distribu on (i.e., a range of possible representa ons with
probabili es).
This allows the model to capture uncertainty and generate varia ons of the input data.
The stochas c encoder maps input data (e.g., an image) to a probability distribu on (o en a
Gaussian), not a single point.
From this distribu on, a latent vector (a set of features) is randomly sampled.
The stochas c decoder takes this sampled vector and reconstructs the data (or creates a new
varia on).
Why is it useful?
This is your original data, like an image, a sentence, or any high-dimensional input.
3. Reparameterization Trick:
This reparameteriza on trick allows gradients to flow and makes training possible.
6. Output (X′):
In a Varia onal Autoencoder (VAE), the encoder is not like a regular autoencoder.
Because we want to learn a distribu on of possible latent representa ons, not just one.
This gives the model the ability to generate new varia ons of the input data.
The model uses the reparameteriza on trick to sample a latent variable z from that distribu on:
The decoder takes the sampled latent vector z, not a fixed one.
Based on this randomly sampled z, it generates or reconstructs the output (which should look
like the original input).
Component Purpose Deterministic? What it Outputs/Consumes
Stochastic Learn a distribution of latent Outputs μ and σ (parameters
No
Encoder representations of a distribution)
Stochastic Generate realistic data from a Inputs a sampled z, produces
No
Decoder sampled latent code output X′
This part shows a simplified pipeline of how the VAE learns and how its cost func on is computed.
Step-by-step Flow:
1. Reconstruction Error:
Ensures that the learned latent distribu on q(z∣X) stays close to a prior distribu on (usually a
standard normal).
This keeps the latent space well-structured and enables sampling.
CONTRACTIVE AUTOENCODERS
Contractive Autoencoder?
A Contrac ve Autoencoder (CAE) is a type of autoencoder — a neural network used to learn compact
(compressed) representa ons of data — with an extra twist: it includes a regulariza on term that makes
the learned features more robust to small changes in the input.
In real-world data, small changes (like noise or slight varia ons) shouldn't dras cally change the internal
representa on. We want the encoder to be stable and less sensi ve to such small perturba ons.
1. Autoencoder Structure:
o Like any autoencoder, it has:
Encoder: Maps input x to hidden representa on h
Decoder: Reconstructs input from h back to x^
2. Contrac ve Regulariza on:
o During training, a penalty is added to the loss func on.
o This penalty is based on the Jacobian matrix — a matrix of par al deriva ves of the
hidden layer with respect to the input.
o Specifically, we penalize the Frobenius norm (a measure of size) of this Jacobian.
CAEs aim to learn invariant representations — this means the encoded features shouldn’t
change much even if the input changes in a small or unimportant way.
They are designed to ignore irrelevant noise or small transformations in the input data.
CAE outperforms regular and denoising autoencoders in learning more useful, stable
representa ons.
Instead of just adding noise (like in denoising AEs), CAE mathema cally controls how much the
output changes when the input changes.
Regulariza on Term
Training Objective
Dimensionality Reduc on: Like PCA, but nonlinear and more powerful.
Feature Learning: To extract robust features for other tasks like classifica on.
Denoising: Especially when input data has small random varia ons.
Corrupted Input: The original image (in this case, the digit "7") is perturbed by adding
noise, making the input slightly distorted.
Autoencoder Structure: The corrupted input passes through an encoder, which
compresses it into a lower-dimensional representation. The decoder then reconstructs the
image from this compressed form.
Reconstruction Loss: The goal is to compare the reconstructed output with the original
(uncorrupted) image and minimize the loss, ensuring the model can robustly recover
meaningful features despite noise.
Contractive Regularization: This method applies a penalty that encourages similar
inputs to map to similar feature representations, reducing sensitivity to variations in input.
Robustness to Noise: The encoder learns feature representations that remain stable even with
small variations in input data. This makes contractive autoencoders particularly useful in
denoising applications.
Feature Learning: The network extracts discriminative features from the data that can be
valuable for downstream tasks like classification, clustering, and anomaly detection.