Unit 2
Introduction to Deep Learning &
Architectures
Deep Learning V/S Machine Learning
• Both machine learning and deep learning have the potential to
transform a wide range of industries, including healthcare, finance,
retail, and transportation, by providing insights and automating
decision-making processes.
• Machine Learning:
• Machine learning is a subset, an application of Artificial Intelligence
(AI) that offers the ability of the system to learn and improve from
experience without being programmed to that level.
• Machine Learning uses data to train and find accurate results.
Machine learning focuses on the development of a computer
program that accesses the data and uses it to learn from itself.
• Deep Learning:
• Deep Learning is a subset of Machine Learning where the artificial
neural network and the recurrent neural network come in relation.
• The algorithms are created exactly just like machine learning but it
consists of many more levels of algorithms.
• All these networks of the algorithm are together called the artificial
neural network. In much simpler terms, it replicates just like the
human brain as all the neural networks are connected in the brain,
which exactly is the concept of deep learning. It solves all the complex
problems with the help of algorithms and its process.
S. No. Machine Learning Deep Learning
Machine Learning is a superset of Deep Deep Learning is a subset of Machine
1.
Learning Learning
The data represented in Machine The data representation used in Deep
2. Learning is quite different compared to Learning is quite different as it uses
Deep Learning as it uses structured data neural networks(ANN).
Machine learning consists of thousands
34. Big Data: Millions of data points.
of data points.
Anything from numerical values to free-
Outputs: Numerical Value, like
5. form elements, such as free text and
classification of the score.
sound.
Uses various types of automated
Uses a neural network that passes data
algorithms that turn to model
5. through processing layers to, interpret
functions and predict future action
data features and relations.
from data.
Algorithms are detected by data Algorithms are largely self-depicted on
6. analysts to examine specific variables data analysis once they’re put into
in data sets. production.
Training can be performed using A dedicated GPU (Graphics Processing
7.
the CPU (Central Processing Unit). Unit) is required for training.
Although more difficult to set up, deep
More human intervention is involved
8. learning requires less intervention
in getting results.
once it is running.
Deep learning, on the other hand, uses complex neural
Machine learning involves training algorithms to
9. networks with multiple layers to analyze more intricate
identify patterns and relationships in data.
patterns and relationships.
Machine learning algorithms can range from simple Deep learning algorithms, on the other hand, are based on
10. linear models to more complex models such as artificial neural networks that consist of multiple layers and
decision trees and random forests. nodes.
Machine learning algorithms typically require less Deep learning algorithms, on the other hand, require large
11. data than deep learning algorithms, but the quality amounts of data to train the neural networks but can learn
of the data is more important. and improve on their own as they process more data.
Machine learning is used for a wide range of Deep learning, on the other hand, is mostly used for
12. applications, such as regression, classification, complex tasks such as image and speech recognition,
and clustering. natural language processing, and autonomous systems.
Feature Engineering
• Feature engineering is the process of transforming raw data into
features that are suitable for machine learning models. In other
words, it is the process of selecting, extracting, and transforming the
most relevant features from the available data to build more accurate
and efficient machine learning models.
• The success of machine learning models heavily depends on the
quality of the features used to train them.
• Feature engineering involves a set of techniques that enable us to
create new features by combining or transforming the existing ones.
• These techniques help to highlight the most important patterns and
relationships in the data, which in turn helps the machine learning
model to learn from the data more effectively.
Representation Learning
• Representation Learning is a process in machine learning where
algorithms extract meaningful patterns from raw data to create
representations that are easier to understand and process.
• These representations can be designed for interpretability, reveal
hidden features, or be used for transfer learning.
• They are valuable across many fundamental machine learning tasks
like image classification and retrieval.
• Deep neural networks can be considered representation learning
models that typically encode information which is projected into a
different subspace. These representations are then usually passed on
to a linear classifier to, for instance, train a classifier.
• Imagine an engineer designing an ML algorithm to predict malignant cells
based on brain scans. To design the algorithm, the engineer has to rely
heavily on patient data, because that’s where all the answers are.
• Each observation or feature in that data describes the attributes of the
patient. The machine learning algorithm that predicts the outcome has to
learn how each feature correlates with the different outcomes: benign or
malignant.
• So in case of any noise or discrepancies in the data, the outcome can be
totally different, which is the problem with most machine learning
algorithms. Most machine learning algorithms have a superficial
understanding of the data.
• So what is the solution?
• Provide the machine with a more abstract representation of the data.
• For many tasks, it is impossible to know what features should be extracted. This is
where the idea of representation learning truly comes into view.
• In representation learning, the machine is provided with data and it learns the
representation by itself. It’s a method of finding a representation of the data –
the features, the distance function, the similarity function – that dictates how the
predictive model will perform.
• Representation learning works by reducing high-dimensional data into low-
dimensional data, making it easier to find patterns, anomalies, and also giving us
a better understanding of the behavior of the data altogether.
• It also reduces the complexity of the data, so the anomalies and noise are
reduced. This reduction in noise can be very useful for supervised learning
algorithms.
• Representation learning works by reducing high-dimensional data to
low-dimensional data, making it easier to discover patterns and
anomalies while also providing a better understanding of the data’s
overall behavior.
• Basically, Machine learning tasks such as classification frequently
demand input that is mathematically and computationally convenient
to process, which motivates representation learning.
• Real-world data, such as photos, video, and sensor data, has resisted
attempts to define certain qualities algorithmically.
• An approach is to examine the data for such traits or representations
rather than depending on explicit techniques.
• Representation learning is a method of training a machine learning
model to discover and learn the most useful representations of input
data automatically.
• These representations, often known as “features”, are internal states
of the model that effectively summarize the input data, which help
the algorithm to understand the underlying patters of this data better.
• Deep Neural Network models could learn complex, hierarchical
representations of data through multiple layers. Eg, CNN, RNN,
Autoencoder, and Transformers.
• A good representation has three characteristics: Information,
compactness, and generalization.
• Information: The representation encodes important features of the data
into a compressed form.
• Compactness:
• Low Dimensionality: Learned embedding representations from raw data
should be much smaller than the original input. This allows for efficient
storage and retrieval, and also discards noise from the data, allowing the
model to focus on relevant features and converge faster.
• Preserves Essential Information: Despite being lower-dimensional, the
representation retains important features. This balance between
dimensionality reduction and information preservation is essential.
• Generalization (Transfer Learning): The aim is to learn versatile
representations for transfer learning, starting with a pre-trained
model and then fine-tuning it for specific tasks requiring less data.
• Representation learning can be divided into:
• Supervised representation learning:
• Leverages Labeled Data: Uses labeled data. The labels guide the learning
algorithm about the desired outcome.
• Focuses on Specific Tasks: The learning process is tailored towards a
specific task, such as image classification or sentiment analysis. The learned
representations are optimized to perform well on that particular task.
• Examples: Training a Convolutional Neural Network (CNN) to classify
objects in images (e.g., dog, cat) using labeled image datasets, or a
Recurrent Neural Network (RNN) for sentiment analysis of text data
(positive, negative, neutral) with labeled reviews or sentences.
• Unsupervised Representation Learning:
• Without Labels: Works with unlabeled data. The algorithm identifies
patterns and relationships within the data itself.
• Focuses on Feature Extraction: The goal is to learn informative
representations that capture the underlying structure and essential
features of the data. These representations can then be used for various
downstream tasks (transfer learning).
• Examples: Training an autoencoder to compress and reconstruct images,
learning a compressed representation that captures the key features of the
image. Using Word2Vec or GloVe on a massive text corpus to learn word
embeddings, where words with similar meanings have similar
representations in a high-dimensional space. BERT to learn contextual
representation of words.
Width Vs. Depth of Neural Networks
• In deep learning, the terms "depth" and "width" refer to different
aspects of the architecture of a neural network:
1.Depth:
1. Definition: The depth of a neural network is defined by the number of layers
it has. This includes all layers that have learnable parameters, such as
convolutional layers, fully connected layers, and recurrent layers.
2. Implication: A deeper network, meaning one with more layers, can capture
more complex patterns in the data. This is because each additional layer can
build on the representations learned by previous layers, allowing the network
to understand more intricate structures and hierarchies in the input data.
3. Example: If a neural network has an input layer, 5 hidden layers, and an
output layer, it is considered to be 7 layers deep.
• Width:
• Definition: The width of a neural network refers to the number of
units (neurons) in each layer. It typically refers to the layers with the
most neurons.
• Implication: A wider network, meaning one with more neurons per
layer, can learn more features in parallel. This can be beneficial for
capturing more detailed information about the data, especially if each
neuron can learn a distinct feature or pattern.
• Example: If a neural network layer has 512 neurons, it is said to be
512 units wide.
• Balance: Both depth and width need to be balanced based on the specific
problem and the amount of available data. Very deep or very wide networks can
lead to overfitting, especially with insufficient training data.
• Computational Cost: Increasing either depth or width increases the
computational cost and memory requirements of training and using the neural
network.
• Architecture Choice: Different tasks may benefit more from deeper networks
(e.g., image classification with convolutional neural networks) or wider networks
(e.g., natural language processing with transformer models).
• Examples of Architectures:
• Deep Networks: ResNet-50, VGG-16 (these networks have many layers).
• Wide Networks: Wide ResNet (these networks have fewer layers but more
neurons per layer compared to their deep counterparts).
Activation Functions
• We use activation functions to propagate the output of one layer’s
nodes forward to the next layer (up to and including the output
layer).
• Activation functions are a scalar-to-scalar function, yielding the
neuron’s activation.
• We use activation functions for hidden neurons in a neural network
to introduce nonlinearity into the network’s modeling capabilities.
• It’s just a thing function that you use to get the output of node. It is
also known as Transfer Function.
• It is used to determine the output of neural network like yes or no. It
maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending
upon the function).
• The Activation Functions can be basically divided into 2 types-
• Linear Activation Function
• Non-linear Activation Functions
• As you can see the function is a
line or linear. Therefore, the
output of the functions will not
be confined between any range.
• Equation : f(x) = x
• Range : (-infinity to infinity)
• It doesn’t help with the
complexity or various
parameters of usual data that is
fed to the neural networks.
• The Nonlinear Activation Functions are
the most used activation functions.
• It makes it easy for the model to
generalize or adapt with variety of data
and to differentiate between the output.
• The main terminologies needed to
understand for nonlinear functions are:
• Derivative or Differential: Change in y-
axis w.r.t. change in x-axis. It is also known
as slope.
• Monotonic function: A function which is
either entirely non-increasing or non-
decreasing.
• Imagine a neural network without the activation functions. In that
case, every neuron will only be performing a linear transformation on
the inputs using the weights and biases.
• Although linear transformations make the neural network simpler,
but this network would be less powerful and will not be able to learn
the complex patterns from the data.
• A neural network without an activation function in deep learning is
essentially just a linear regression model.
• Thus we use a non linear transformation to the inputs of the neuron
and this non-linearity in the network is introduced by an activation
function.
• Binary Step Function
• The first thing that comes to our mind when we have an activation
function would be a threshold based classifier i.e. whether or not the
neuron should be activated based on the value from the linear
transformation.
• In other words, if the input to the activation function is greater than a
threshold, then the neuron is activated, else it is deactivated, i.e. its
output is not considered for the next hidden layer.
• The binary step function can be used as an activation function while
creating a binary classifier.
• As you can imagine, this function will not be useful when there are
multiple classes in the target variable. That is one of the limitations of
binary step function.
• Moreover, the gradient of the step function is zero which causes a
hindrance in the back propagation process. That is, if you calculate
the derivative of f(x) with respect to x, it comes out to be 0.
• Gradients are calculated to update the weights and biases during the
backprop process. Since the gradient of the function is zero, the
weights and biases don’t update.
• Linear Function
• We saw the problem with the step function, the gradient of the
function became zero. This is because there is no component of x in
the binary step function.
• Instead of a binary function, we can use a linear function. We can
define the function as-
• Here the activation is proportional to the input. The variable ‘a’ in this
case can be any constant value.
• When we differentiate the function with respect to x, the result is the
coefficient of x, which is a constant.
• Although the gradient here does not become zero, but it is a constant
which does not depend upon the input value x at all.
• This implies that the weights and biases will be updated during the
backpropagation process but the updating factor would be the same.
• In this scenario, the neural network will not really improve the error
since the gradient is the same for every iteration.
• The network will not be able to train well and capture the complex
patterns from the data. Hence, linear function might be ideal for
simple tasks where interpretability is highly desired.
• Sigmoid Activation Function
• The next activation function in deep learning that we are going to
look at is the Sigmoid activation function. It is one of the most widely
used non-linear activation function. Sigmoid transforms the values
between the range 0 and 1.
• unlike the binary step and linear functions, sigmoid is a non-linear
function. This essentially means -when I have multiple neurons having
sigmoid function as their activation function, the output is non linear
as well.
• this is a smooth S-shaped function and is continuously differentiable.
The derivative of this function comes out to be ( sigmoid(x)*(1-
sigmoid(x))
•.
• The gradient values are significant for range -3 and 3 but the graph
gets much flatter in other regions.
• This implies that for values greater than 3 or less than -3, will have
very small gradients. As the gradient value approaches zero, the
network is not really learning.
• Additionally, the sigmoid function is not symmetric around zero. So
output of all the neurons will be of the same sign. T
• his can be addressed by scaling the sigmoid function which is exactly
what happens in the tanh function. Let’s read on.
• Tanh
• The tanh function is very similar to the sigmoid function. The only
difference is that it is symmetric around the origin.
• The range of values in this case is from -1 to 1. Thus the inputs to the
next layers will not always be of the same sign. The tanh function is
defined as-
• the range of values is between -1 to 1. Apart from that, all other
properties of tanh function are the same as that of the sigmoid
function.
• Similar to sigmoid, the tanh function is continuous and differentiable
at all points.
• The gradient of the tanh function is steeper as compared to the
sigmoid function. You might be wondering, how will we decide which
activation function to choose? Usually tanh is preferred over the
sigmoid function since it is zero centered and the gradients are not
restricted to move in a certain direction.
• ReLU Activation Function
• The ReLU Activation function is another non-linear activation function
that has gained popularity in the deep learning domain.
• ReLU stands for Rectified Linear Unit. The main advantage of using
the ReLU function over other activation functions is that it does not
activate all the neurons at the same time.
• This means that the neurons will only be deactivated if the output of
the linear transformation is less than 0.
• For the negative input values, the result is zero, that means the
neuron does not get activated.
• Since only a certain number of neurons are activated, the ReLU
function is far more computationally efficient when compared to the
sigmoid and tanh function.
• If you look at the negative side of the graph, you will notice that the
gradient value is zero.
• Due to this reason, during the backpropagation process, the weights
and biases for some neurons are not updated.
• This can create dead neurons which never get activated. This is taken
care of by the ‘Leaky’ ReLU function.
• Leaky ReLU
• Leaky ReLU function is nothing but an improved version of the ReLU
function.
• As we saw that for the ReLU function, the gradient is 0 for x<0, which
would deactivate the neurons in that region.
• Leaky ReLU is defined to address this problem. Instead of defining the
Relu function as 0 for negative values of x, we define it as an
extremely small linear component of x.
• By making this small modification, the gradient of the left side of the
graph comes out to be a non zero value. Hence we would no longer
encounter dead neurons in that region. Here is the derivative of the
Leaky ReLU function
• Since Leaky ReLU is a variant of ReLU, the python code can be
implemented with a small modification-
Unsupervised Training of Neural Networks
• Unsupervised learning is an intriguing area of machine learning that
reveals hidden structures and patterns in data without requiring
labelled samples. Because it investigates the underlying relationships
in data, it’s an effective tool for tasks like anomaly identification,
dimensionality reduction, and clustering.
• There are several uses for unsupervised learning in domains like
computer vision, natural language processing, and data analysis.
• Through self-sufficient data interpretation, it provides insightful
information that enhances decision-making and facilitates
comprehension of intricate data patterns.
• An unsupervised neural network is a type of artificial neural network (ANN)
used in unsupervised learning tasks.
• Unlike supervised neural networks, trained on labeled data with explicit
input-output pairs, unsupervised neural networks are trained on unlabeled
data.
• In unsupervised learning, the network is not under the guidance of
features. Instead, it is provided with unlabeled data sets (containing only
the input data) and left to discover the patterns in the data and build a new
model from it.
• Here, it has to figure out how to arrange the data by exploiting the
separation between clusters within it. These neural networks aim to
discover patterns, structures, or representations within the data without
specific guidance.
• There are several components of unsupervised learning. They are:
• Encoder-Decoder: As the name itself suggests that it is used to encode and
decode the data.
• Encoder basically responsible for transforming the input data into lower
dimensional representation on which the neural network works.
• Whereas decoder takes the encoded representation and reconstruct the
input data from it. There architecture and parameters are learned during
the training of the network.
• Latent Space: It is the immediate representation created by the encoder. It
contains the abstract representation or features that captures important
information about the data’s structures. It is also known as the latent
space.
• Training algorithm: Unsupervised neural network model use specific
training algorithms to get the parameters. Some of the common
optimization algorithms are Stochastic gradient descent, Adam etc. They
are used depending on the type of model and loss function.
• Loss Function: It is a common component among all the machine learning
models. It basically calculates the model’s output and the actual/measured
output. It quantifies how well the model understands the data.
• The primary goal of an autoencoder is to minimize the difference between
the input and the reconstructed output. The loss function quantifies this
difference. Common loss functions used for this purpose include Mean
Squared Error (MSE) for continuous data and Binary Cross-Entropy for
binary or normalized data.
Autoencoder
• Autoencoders are a specialized class of algorithms that can learn efficient
representations of input data with no need for labels.
• It is a class of artificial neural networks designed for unsupervised learning.
Learning to compress and effectively represent input data without specific
labels is the essential principle of an automatic decoder.
• This is accomplished using a two-fold structure that consists of an encoder
and a decoder.
• The encoder transforms the input data into a reduced-dimensional
representation, which is often referred to as “latent space” or “encoding”.
From that representation, a decoder rebuilds the initial input. For the
network to gain meaningful patterns in data, a process of encoding and
decoding facilitates the definition of essential features.
• Autoencoders are a specific type of feedforward neural networks where
the input is the same as the output.
• They compress the input into a lower-dimensional code and then
reconstruct the output from this representation.
• The code is a compact “summary” or “compression” of the input, also
called the latent-space representation.
• An autoencoder consists of 3 components: encoder, code and decoder.
The encoder compresses the input and produces the code, the decoder
then reconstructs the input only using this code.
• To build an autoencoder we need 3 things: an encoding method, decoding
method, and a loss function to compare the output with the target
Architecture of Autoencoder in Deep Learning
• The general architecture of an autoencoder includes an encoder,
decoder, and bottleneck layer.
• Encoder
• Input layer take raw input data
• The hidden layers progressively reduce the dimensionality of the
input, capturing important features and patterns. These layer
compose the encoder.
• The bottleneck layer (latent space) is the final hidden layer, where the
dimensionality is significantly reduced. This layer represents the
compressed encoding of the input data.
• Decoder
• The bottleneck layer takes the encoded representation and expands it
back to the dimensionality of the original input.
• The hidden layers progressively increase the dimensionality and aim
to reconstruct the original input.
• The output layer produces the reconstructed output, which ideally
should be as close as possible to the input data.
• The loss function used during training is typically a reconstruction
loss, measuring the difference between the input and the
reconstructed output. Common choices include mean squared error
(MSE) for continuous data or binary cross-entropy for binary data.
• During training, the autoencoder learns to minimize the
reconstruction loss, forcing the network to capture the most
important features of the input data in the bottleneck layer.
• Using the following 256x256 pixel
grayscale picture as an example
• But when using this picture we
start running into a bottleneck!
Because this image being 256x256
pixels in size correspond with an
input vector of 65536 dimensions!
• If we used an image produced with
conventional cellphone cameras,
that generates images of 4000 x
3000 pixels, we would have 12
million dimensions to analyze
• As you can see, it increases
exponentially! Returning to our
example, we don’t need to use
all of the 65,536 dimensions to
classify an emotion.
• A human identifies emotions
according to some specific facial
expression, some key features,
like the shape of the mouth and
eyebrows.
Boltzmann Machine
• Boltzmann Machine is a generative unsupervised model, which involves
learning a probability distribution from an original dataset and using it to
make inferences about never before seen data.
• Boltzmann Machine has an input layer (also referred to as the visible layer)
and one or several hidden layers (also referred to as the hidden layer).
• Boltzmann Machines were first introduced by Geoffrey Hinton and Terry
Sejnowski in the 1980s. They are inspired by the concepts of statistical
mechanics and the Boltzmann distribution from physics.
• Boltzmann Machines have since evolved and have become an important
tool for various tasks, including unsupervised learning, dimensionality
reduction, feature learning, and generative modeling.
• Boltzmann Machine uses neural networks with neurons that are connected not
only to other neurons in other layers but also to neurons within the same layer.
• Everything is connected to everything. Connections are bidirectional, visible
neurons connected to each other and hidden neurons also connected to each
other
• For Boltzmann Machine all neurons are the same, it doesn’t discriminate
between hidden and visible neurons. For Boltzmann Machine whole things are
system and its generating state of the system.
• In Boltzmann Machine all neurons are same. There is no difference between
input layer and hidden layer neuron.
• There is no Output node in this model hence like our other classifiers, we cannot
make this model learn 1 or 0 from the Target variable of training dataset after
applying Stochastic Gradient Descent (SGD), etc.
• Suppose you are having a factory with many sophisticated machines. And
you are concern about safety of workers and factory.
• There are certain parameters that you monitor regularly, like presence of
smoke, temperature, air quality etc.
• If all these parameters are in certain range, then we can consider the
factory is safe.
• You want a system which can warn you as soon as there is some change in
normal state of any of these parameters or combination of parameters.
• Now we cannot go for Supervised learning in the case. As we will not have
data for unusual/ hazardous states. We are here trying to figure out
something that is not happened yet but can happen.
• We should be able to detect when the system is going into hazardous state
even if not seen such a state before.
• This could be done by building a model of a normal state and noticing
when new state is different from the normal states.
• Boltzmann Machine can help in this. It uses training data as input and
adjusts its weights. Using the input, it learns what are the possible relation
between all these parameters, how do they influence each other. It
resembles normal state of any system (here the factory).
• Now Boltzmann Machine can be used monitor our factory and warn in case
of any unusual state. Boltzmann Machine learns how the system works in
its normal state through number of good examples.
• Boltzmann Machines are primarily divided into two categories: Energy-
based Models (EBMs) and Restricted Boltzmann Machines (RBM).
• As in this machine, there is no output layer so the question arises
how we are going to identify, adjust the weights and how to measure
the that our prediction is accurate or not. All the questions have one
answer, that is Restricted Boltzmann Machine.
Restricted Boltzmann Machine
• RBM are special type of Boltzmann Machine. They are called
‘Restricted’ because no two nodes in same group are connected to
each other like in original Boltzmann Machine.
• The RBM is trained using a process called contrastive divergence, which is
a variant of the stochastic gradient descent algorithm.
• During training, the network adjusts the weights of the connections
between the neurons in order to maximize the likelihood of the training
data. Once the RBM is trained, it can be used to generate new samples
from the learned probability distribution.
• RBM has found applications in a wide range of fields, including computer
vision, natural language processing, and speech recognition.
• It has also been used in combination with other neural network
architectures, such as deep belief networks and deep neural networks, to
improve their performance.
• In RBM there are two phases through which the entire RBM works:
• 1st Phase: In this phase, we take the input layer and using the concept of
weights and biased we are going to activate the hidden layer. This process
is said to be Feed Forward Pass. In Feed Forward Pass we are identifying
the positive association and negative association.
• Feed Forward Equation:
• Positive Association — When the association between the visible unit and
the hidden unit is positive.
• Negative Association — When the association between the visible unit and
the hidden unit is negative.
• 2nd Phase: As we don’t have any output layer. Instead of calculating the
output layer, we are reconstructing the input layer through the activated
hidden state.
• This process is said to be Feed Backward Pass. We are just backtracking the
input layer through the activated hidden neurons. After performing this we
have reconstructed Input through the activated hidden state. So, we can
calculate the error and adjust weight in this way:
• Feed Backward Equation:
• Error = Reconstructed Input Layer-Actual Input layer
• Adjust Weight = Input*error*learning rate (0.1)
• After doing all the steps we get the pattern that is responsible to activate
the hidden neurons. To understand how it works:
• Continuing with the image dataset example, an RBM can be trained to
learn the important features or patterns in the images. The visible units
represent the input pixels, and the hidden units capture higher-level
features. Once trained, the RBM can reconstruct images from hidden units’
activations and generate new images by sampling from the learned
distribution.
• Here are the key steps involved in training an RBM:
• Initialize the RBM: Set the initial weights and biases for the visible and
hidden units.
• Positive phase: Present the training data to the RBM’s visible units and
compute the activations of the hidden units. This step is called the positive
phase because it represents the propagation of information from the
visible units to the hidden units.
• Negative phase: Reconstruct the visible units from the computed
hidden activations and sample new hidden unit activations. This step
is called the negative phase because it represents the propagation of
information from the hidden units back to the visible units.
• Update weights and biases: Adjust the weights and biases of the
RBM based on the difference between the positive and negative
phase activities. The goal is to minimize the reconstruction error and
maximize the likelihood of the training data.
• Repeat steps: Iterate the positive and negative phases for a fixed
number of iterations or until convergence, adjusting the weights and
biases after each iteration.
• Imagine you're training a model to recognize handwritten digits (like the
digits 0-9).
• Initial State:
• You have a dataset of handwritten digits.
• Your model (an RBM) starts with random weights.
• Step-by-Step Process:
• Step 1: Positive Phase:
• You feed a digit image (e.g., a picture of the digit "2") into the model.
• The model activates certain neurons based on the input image, generating
hidden layer activations.
• This phase captures how the model sees the real data.
• Step 2: Reconstruct Phase:
• The model uses these hidden layer activations to reconstruct the
input image. It tries to regenerate the input digit image (e.g., a
reconstructed image that looks somewhat like "2").
• Step 3: Negative Phase:
• The model then takes the reconstructed image and goes through the
process again, producing another set of hidden layer activations and a
further reconstruction.
• This phase captures how the model sees its own predictions.
• Step 4: Contrastive Divergence Update:
• Compare the hidden layer activations from the positive phase (real data)
and the negative phase (model's reconstruction).
• Adjust the weights to reduce the difference between these two activations.
• This adjustment helps the model to improve its predictions.
• Iterate:
• Repeat the above steps for many iterations and for many images in your
dataset.
• Over time, the model's weights are fine-tuned to make the model better at
reconstructing images similar to the training data.
• Types of RBM :
• There are mainly two types of Restricted Boltzmann Machine (RBM)
based on the types of variables they use:
• Binary RBM: In a binary RBM, the input and hidden units are binary
variables. Binary RBMs are often used in modeling binary data such as
images or text.
• Gaussian RBM: In a Gaussian RBM, the input and hidden units are
continuous variables that follow a Gaussian distribution. Gaussian
RBMs are often used in modeling continuous data such as audio
signals or sensor data.
• Apart from these two types, there are also variations of RBMs such as:
• Deep Belief Network (DBN): A DBN is a type of generative model that
consists of multiple layers of RBMs. DBNs are often used in modeling high-
dimensional data such as images or videos.
• Convolutional RBM (CRBM): A CRBM is a type of RBM that is designed
specifically for processing images or other grid-like structures. In a CRBM,
the connections between the input and hidden units are local and shared,
which makes it possible to capture spatial relationships between the input
units.
• Temporal RBM (TRBM): A TRBM is a type of RBM that is designed for
processing temporal data such as time series or video frames. In a TRBM,
the hidden units are connected across time steps, which allows the
network to model temporal dependencies in the data.
Difference between Autoencoders & RBMs
• Autoencoders and Restricted Boltzmann Machines (RBMs) are both unsupervised
learning models used for feature learning and dimensionality reduction. However,
they have some key differences in terms of their architecture, training process,
and applications.
• 1. Architecture:
• Autoencoders: An autoencoder consists of an encoder network that maps the
input data to a lower-dimensional latent representation, and a decoder network
that reconstructs the input from the latent representation. The encoder and
decoder are typically neural networks.
• RBMs: RBMs are bipartite graphical models with visible and hidden units, forming
a two-layer architecture. The visible units correspond to the input data, and the
hidden units capture higher-level features. RBMs have undirected connections
between the visible and hidden units, without connections within the same layer.
• 2. Training Process:
• Autoencoders: Autoencoders are trained using an unsupervised learning
approach called reconstruction. The objective is to minimize the difference
between the input and the reconstructed output, typically using a loss
function such as mean squared error (MSE). Backpropagation and gradient
descent are commonly used for training autoencoders.
• RBMs: RBMs are trained using a procedure called Contrastive Divergence
(CD). The training process involves two phases: the positive phase and the
negative phase. In the positive phase, the RBM is presented with input
data, and the hidden units are activated based on the data. In the negative
phase, the RBM generates a reconstruction of the input by iterative
sampling from the hidden units and reconstructing the visible units. The
training objective is to minimize the difference between the positive and
negative phases.
• 3. Generative vs. Reconstruction Models:
• Autoencoders: Autoencoders are primarily used as reconstruction models.
They aim to learn an efficient representation of the input data and
reconstruct it from the learned latent representation. Once trained,
autoencoders can generate reconstructions of the input, but they may not
be effective at generating entirely new samples.
• RBMs: RBMs are generative models capable of both reconstruction and
generation. They can learn the underlying distribution of the training data
and generate new samples by sampling from the learned distribution.
RBMs are probabilistic models based on the Boltzmann distribution,
allowing them to capture the statistical structure of the data and generate
diverse samples.
• 4. Applications:
• Autoencoders: Autoencoders have applications in various domains
such as image denoising, dimensionality reduction, anomaly
detection, and feature learning for supervised learning tasks.
• RBMs: RBMs have been widely used for collaborative filtering, feature
learning, topic modeling, and generative modeling tasks. They are
often utilized as building blocks for deep learning architectures, such
as Deep Belief Networks (DBNs).