Deep Learning Notes
Deep Learning Notes
Create and improve computer programs that can learn from data, especially using complex
algorithms called neural networks.
Explore new ideas and develop better ways for computers to understand and process information,
often publishing their findings in papers.
Data Scientist
Use deep learning to analyze large amounts of data and find patterns or make predictions.
AI Consultant
Advise companies on how to use deep learning to solve problems and improve their products or
operations.
Develop technology that allows computers to see and understand images or videos, like
recognizing faces or objects. Industries like automotive (self-driving cars), security, and
healthcare (medical imaging).
Robotics Engineer
Build robots that can learn and make decisions using deep learning, for tasks like moving objects
or navigating spaces.
Freelance/Consulting Opportunities
UNIT 1: Machine Learning Basics
Learning
Learning in the context of machine learning and deep learning refers to how computers gain
knowledge and improve their ability to perform tasks without being explicitly programmed to do
so.
How it Works: We give the computer lots of examples of what we want it to do, and it
learns patterns from these examples. It gets smarter as it sees more examples.
It breaks down information into layers and learns from the simplest to the most complex
patterns. It can understand things like images, sounds, and texts in a way that’s more
similar to how humans do.
Why it Matters:
Automation: Helps computers automate tasks that normally require human intelligence,
like recognizing objects or understanding speech.
Improvement: Computers get better over time as they see more data and learn from it,
which improves their accuracy and efficiency.
In simple terms, learning in machine learning and deep learning means teaching computers to
learn from examples and improve their skills, making them more capable and intelligent in
various tasks.
For example, in everyday life, if we try to guess how long it will take to get to a friend's house
based on past trips; we're using a kind of estimator.
In machine learning, an estimator is a model or algorithm that learns from data to make
predictions or decisions.
For instance, if we want to predict whether an email is spam or not, we might use a
machine learning algorithm that has learned from past emails labeled as spam or not.
In summary, estimators are fundamental tool in machine learning that enable us to make
informed decisions or predictions based on data and patterns observed in that data.
Bias, Variance
Bias and variance help us to understand the behavior and performance of models.
Bias: Bias measures how far off the predictions of a model are from the correct values.A high
bias means the model is too simple and doesn’t capture the underlying patterns in the data. It
tends to underfit the data.
Example: If a model predicts that all houses in a neighborhood have the same price, regardless
of their size or location, it has high bias.
Variance:Variance measures how much the predictions of a model vary for different training
sets.
A high variance means the model is too sensitive to small fluctuations in the training data. It
tends to over fit the data.
Example: If a model predicts vastly different house prices for the same house depending on the
training data it sees, it has high variance.
Bias: Error from erroneous assumptions in the learning algorithm. High bias can cause a
model to miss important patterns.
Variance: Error from sensitivity to small fluctuations in the training data. High variance
can cause a model to learn noise instead of the underlying relationships.
Impact on Models:
High Bias, Low Variance: Model is too simple and misses important patterns
(underfitting).
Low Bias, High Variance: Model is too complex and learns noise and random
fluctuations (overfitting).
Maximum Likelihood Estimation (MLE) is a method used in statistics and machine learning to
find the most likely values of the parameters of a model given the data.
Likelihood: First, it calculates the likelihood function, which measures how likely the
observed data are for different values of the parameters.
Maximization: Then, it finds the values of the parameters that maximize this likelihood
function.
Example: If we have data about the heights of students and assume they follow a normal
distribution (bell curve), MLE would find the mean and standard deviation that make the
observed heights most probable.
Key Points:
Probability vs. Likelihood: Probability is used when the parameters are known and we
want to predict the data. Likelihood is used when the data are known and we want to
estimate the parameters.
Applications: Used in various fields, such as estimating parameters in neural networks,
and clustering algorithms.
In general, MLE helps us figure out the most likely values of the parameters of a model
that explain the data we observe.
It’s like trying to find the best guess for how something works based on the evidence we
have.
Bayesian Statistics
Bayesian statistics is a way of thinking about and applying statistics that differs from traditional,
frequentist statistics.
Prior Belief: It starts with an initial belief (prior probability) about the likelihood of an
event or hypothesis being true based on existing knowledge or assumptions.
Data Collection: As new data is collected, Bayesian statistics updates this belief to a
posterior probability, taking into account both the prior belief and the new evidence.
Example: If we believe there’s a 50% chance of rain tomorrow based on historical data
(prior belief), and then we check the weather forecast (new data), Bayesian statistics
helps update our belief to a new probability (posterior probability) of rain tomorrow.
Applications:
Key Points:
Flexibility: Allows for the incorporation of prior beliefs and updates them with new data.
Interpretation: Results in probabilities that can be interpreted as degrees of belief rather
than just frequency of occurrence.
Complexity: Requires specifying prior distributions, which reflect initial beliefs and can
influence results.
Supervised Learning
Supervised learning is a type of machine learning algorithm that learns from labeled data.
Labeled data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Supervised learning is when we teach or train the machine using data that is well-labeled.
Which means some data is already tagged with the correct answer. After that, the machine is
provided with a new set of examples (data) so that the supervised learning algorithm analyses
the training data (set of training examples) and produces a correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each image
tagged with either “Elephant” , “Camel “or “Cow.”
Key Points:
Supervised learning involves training a machine from labeled data.
Labeled data consists of examples with the correct answer or classification.
The machine learns the relationship between inputs (fruit images) and outputs (fruit labels).
The trained machine can then make predictions on new, unlabeled data.
Unsupervised learning
Unsupervised learning is a type of machine learning that learns from unlabeled data. This
means that the data does not have any pre-existing labels or categories. The goal of
unsupervised learning is to discover patterns and relationships in the data without any explicit
guidance.
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
Key Points
Unsupervised learning allows the model to discover patterns and relationships in unlabeled
data.
Clustering algorithms group similar data points together based on their inherent
characteristics.
Feature extraction captures essential information from the data, enabling the model to make
meaningful distinctions.
Label association assigns categories to the clusters based on the extracted patterns and
characteristics.
A Deep Feedforward Network is a type of artificial neural network designed to mimic the way
humans learn.
Definition: It's a network of artificial neurons where information flows in one direction from
input to output.
Structure: Consists of multiple layers: an input layer, several hidden layers, and an output
layer.
Working
Input Layer: Takes in data (like images or numbers).
Hidden Layers: Transform the data through mathematical operations.
Output Layer: Produces the final result or prediction.
Training the Network
Learning: The network learns by adjusting connections (weights) between neurons based
on the difference between predicted and actual outputs.
Error Calculation: The error is calculated using a loss function (how far off the
predictions are from the actual values).
Back-Propagation: A method to adjust the weights by calculating the gradient of the loss
function, moving backward from output to input.
Feed-forward Networks
A feed-forward network is a type of artificial neural network designed to process and learn from
data in a specific way.
1. Input Layer: This is where data enters the network. Each node in this layer represents a
feature or piece of input data.
2. Hidden Layers: These layers are in between the input and output layers. Each layer
consists of nodes (also called neurons) that perform computations on the input data.
3. Output Layer: This layer produces the final output or prediction based on the
computations performed in the hidden layers.
Forward Propagation: The network processes data in a forward direction, from the
input layer through the hidden layers to the output layer. Each neuron in the hidden layers
receives input from neurons in the previous layer, applies a mathematical operation (like
a weighted sum), and passes the result to the next layer.
Activation Function: Neurons often use an activation function to introduce non-linearity
into the network, allowing it to learn more complex patterns in the data.
Learning in Feed-forward Networks
Training: The network learns by adjusting the weights (strengths of connections between
neurons) based on the difference between predicted and actual outputs. This process
involves:
o Loss Function: A measure of how well the network’s predictions match the
actual outputs.
o Gradient Descent: An optimization algorithm used to minimize the loss function
by adjusting the weights in the network.
Back-propagation: The technique used to calculate the gradient of the loss function with
respect to each weight in the network. This allows the network to update its weights in
the right direction to reduce prediction errors.
Applications
Gradient-based Learning
Gradient-based learning is a method used to optimize the parameters (weights and biases) of a
machine learning model, such as a neural network, by minimizing a loss function. The goal is to
adjust these parameters in a way that reduces the difference between the model's predictions and
the actual target values.
Key Components
1. Loss Function:
o A function that measures how far off the model's predictions are from the actual
target values. It quantifies the error between predicted and actual outcomes.
2. Gradient Descent:
o An optimization algorithm used to minimize the loss function by iteratively
adjusting the model's parameters. It works by computing the gradient (derivative)
of the loss function with respect to each parameter.
o Gradient: Indicates the direction of the steepest increase of the function. In
gradient descent, we move in the opposite direction (downhill) to reduce the loss.
3. Back-propagation:
o A specific application of gradient-based learning used in neural networks.
o Backward Propagation of Errors: This technique calculates gradients starting
from the output layer, moving backward through the network. It determines how
much each neuron contributed to the error in prediction.
o Update Rule: The weights and biases are updated in small steps (controlled by a
learning rate) proportional to the negative gradient of the loss function.
Initialization: Start with initial values for the model's parameters (weights and biases).
Forward Pass: Input data is fed through the model, producing predictions.
Compute Loss: Compare the model's predictions with the actual target values using the
loss function.
Backward Pass (Back-propagation): Calculate the gradient of the loss function with
respect to each parameter using the chain rule of calculus. This tells us how much each
parameter contributed to the error.
Gradient Descent: Adjust the parameters in the direction that reduces the loss function,
using the computed gradients. This process is repeated iteratively until convergence
(when the loss is minimized sufficiently) or for a set number of epochs (iterations).
Benefits and Applications
Efficiency: Allows complex models like neural networks to learn patterns in data
effectively.
Versatility: Applicable to a wide range of machine learning tasks, including regression,
classification, and deep learning.
Scalability: Scales well with large datasets and complex models due to its iterative
nature.
Hidden Units
Hidden units, also known as neurons or nodes, are essential components of artificial neural
networks.
Definition: Hidden units are nodes in the hidden layers of a neural network that perform
computations on the input data.
Purpose: They process the input received from the previous layer (which could be the
input layer or another hidden layer) and pass the transformed information to the next
layer or output layer.
Mathematical Operation: Each hidden unit performs a weighted sum of its inputs,
applies an activation function to the result, and then passes the output to the next layer.
Feature Extraction: Hidden units extract relevant features from the input data that are
useful for making predictions or classifications.
Example
Architecture Design
3. Data Flow: How data flows through the network during training and inference:
o Input Data: Initial data fed into the network.
o Forward Propagation: Passing data through layers to generate predictions.
o Backward Propagation: Calculating gradients and updating weights to minimize
errors (training).
Example
Importance
Performance: Determines how well the network learns from data and generalizes to
unseen examples.
Efficiency: Optimizes computational resources and training time.
Computational Graphs
Computational graphs play a crucial role in understanding how feedforward neural networks
process data.
1. Nodes (Operations):
o Input Nodes: Represent input data or variables.
o Operation Nodes: Represent mathematical operations performed on the input
data, such as addition, multiplication, or activation functions (like ReLU or
sigmoid).
o Output Nodes: Represent the final output of the network.
Input Layer: Input data (features) are represented as nodes in the computational graph.
Hidden Layers: Each layer's nodes perform operations (like matrix multiplications and
activations) based on weights and biases.
Output Layer: The final layer's nodes produce the output predictions based on the
transformed data from the last hidden layer.
Visualization: Provides a clear and structured way to understand how data flows through
the network and how computations are performed at each step.
Gradient Calculation: Simplifies the process of calculating gradients during
backpropagation by applying the chain rule of calculus sequentially through the graph.
Debugging and Optimization: Helps in debugging errors and optimizing the network
architecture by visualizing data and operation flows.
Example
𝑑=𝑎+𝑏
𝑒=𝑏−𝑐
𝑌=𝑑∗𝑒
Here, we have three operations, addition, subtraction, and multiplication. To create a
computational graph, we create nodes, each of them has different operations along with
input variables. The direction of the array shows the direction of input being applied to
other nodes.
Back-Propagation
Back-propagation is a method used in training artificial neural networks, particularly deep feed-
forward networks.
1. Neural Network Basics:
o A neural network is like a complex function that takes input data (like images or
numbers) and makes predictions or decisions based on that data.
o It has layers of neurons (nodes) that process the input data and pass it through the
network to produce an output.
o
2. Training a Neural Network:
o The goal of training is to make the network's predictions as accurate as possible.
o We do this by adjusting the "weights" of the connections between the neurons.
These weights determine how strongly one neuron affects another.
3. Forward Pass:
o When we input data into the network, it flows forward through the layers (from
input to output).
o The network makes a prediction based on the current weights.
4. Error Calculation:
o After the network makes a prediction, we compare it to the actual, correct output
(the "ground truth").
o The difference between the prediction and the actual output is called the error or
loss.
5. Back-Propagation:
o To reduce this error, we need to adjust the weights. This is where back-
propagation comes in.
o Back-propagation works by calculating the gradient of the error with respect to
each weight. This tells us how much the error would change if we slightly
adjusted each weight.
6. Gradient Descent:
o Using these gradients, we update the weights in a way that should reduce the
error. This process is called gradient descent.
o Essentially, we "nudge" the weights in the direction that decreases the error the
most.
7. Iterative Process:
o This process of forward pass, error calculation, back-propagation, and weight
adjustment is repeated many times.
o Each iteration helps the network learn and improve its predictions.
Example
Forward Pass: Show a picture, and the child guesses if it's a cat.
Error Calculation: Tell the child if the guess was right or wrong.
Back-Propagation: Explain why the guess was wrong (e.g., "this picture has stripes like
a tiger, not a cat").
Weight Adjustment: The child adjusts their mental criteria for what a cat looks like.
Repeat: Over time, with many examples, the child's guesses improve.
Back-propagation is the technical way neural networks learn from mistakes and get better at
making accurate predictions.
Regularization
Imagine we are trying to draw a line through a scatter plot of points to predict future points. If we
make the line too wavy and try to pass through every single point exactly, it will not be a good
predictor for new points. This is overfitting.
Regularization helps smooth out the model so it doesn't become too complex. Here are a few
ways regularization can be applied:
1. L1 Regularization (Lasso):
o Adds a penalty equal to the absolute value of the magnitude of the coefficients.
o Encourages the model to have simpler and more sparse coefficients, often driving
some coefficients to zero, effectively performing feature selection.
2. L2 Regularization (Ridge):
o Adds a penalty equal to the square of the magnitude of the coefficients.
o Encourages smaller coefficients overall, making the model less sensitive to the
noise in the data.
Example
Imagine we are trying to predict house prices based on various features like size, number of
rooms, location, etc.
Without Regularization: our model might place a huge emphasis on very specific
features, like the color of the front door, because it just so happened to correlate with
price in our training data.
With Regularization: our model will be more cautious and not put too much weight on
any single feature, unless it's genuinely important. It will focus on the general patterns,
like size and location, which are more likely to be useful for predicting new house prices.
Parameter Penalties
Parameter penalties are methods used in machine learning to prevent models from becoming too
complex and overfitting the training data. They work by adding an extra term to the model's loss
function that penalizes large or overly complex model parameters (weights).
When training a machine learning model, the goal is to make accurate predictions on new,
unseen data. However, if a model becomes too complex, it can learn not only the underlying
patterns in the training data but also the noise. This makes the model perform well on training
data but poorly on new data.
To prevent this, we add a penalty term to the loss function that discourages the model from
having overly large or complex weights. There are two common types of parameter penalties:
1. L1 Penalty (Lasso):
o Adds the sum of the absolute values of the weights to the loss function.
o Formula: Loss=Original Loss+λ∑wi
o Effect: Encourages sparsity, meaning it drives some weights to zero, effectively
performing feature selection.
2. L2 Penalty (Ridge):
o Adds the sum of the squares of the weights to the loss function.
o Formula: Loss= Original Loss+λ∑wi2
o Effect: Encourages smaller weights overall, making the model less sensitive to
individual data points and more robust.
Example
Imagine we are predicting the price of a car based on features like age, mileage, horsepower, etc.
Without Penalty: The model might place a huge weight on horsepower if there's a strong
correlation in the training data, even if it's just due to random noise.
With Penalty: The penalty discourages overly large weights, leading the model to
consider all features more evenly and avoid relying too heavily on any single feature.
Data Augmentation
Data augmentation is a technique used in machine learning to increase the amount of training
data by making small modifications to the existing data. This helps improve the model's
performance and generalization without actually collecting new data.
More Data: Having more data usually helps the model learn better.
Prevent Overfitting: By showing the model varied versions of the data, it becomes less
likely to memorize specific details (noise) and more likely to understand general patterns.
Improve Generalization: The model becomes better at making accurate predictions on
new, unseen data.
Example
Imagine we have 100 pictures of apples to train a model to recognize apples. This is a small
dataset, and the model might not learn well. By using data augmentation, we can create many
more images from these 100 pictures:
Now, instead of just 100 images, we might have 1,000 varied images of apples. This helps the
model learn to recognize apples better.
Multi-task Learning
Multi-task learning is a machine learning approach where a model is trained to perform multiple
tasks simultaneously. Instead of training separate models for each task, a single model learns to
handle all the tasks together.
Why Use Multi-task Learning?
1. Shared Knowledge: Different tasks can share information and features, helping the
model learn better.
2. Efficiency: Training one model for multiple tasks is often faster and requires less
computational power than training separate models for each task.
3. Improved Performance: The model can become more robust and generalize better
because it learns from more diverse data.
In multi-task learning, the model has a shared part and task-specific parts:
1. Shared Layers: These layers learn common features from all tasks. For example,
recognizing edges and shapes in images is useful for both object detection and facial
recognition.
2. Task-Specific Layers: These layers focus on the details specific to each task. For
instance, one set of layers might specialize in identifying objects, while another set
specializes in recognizing faces.
Example
Imagine we are developing an app that needs to do both object detection and facial recognition
from images:
Object Detection: Identifying and labeling different objects in an image (e.g., cars, trees,
buildings).
Facial Recognition: Identifying and labeling faces in an image.
Shared Layers: The model learns common features like edges, textures, and basic
shapes.
Task-Specific Layers: One set of layers is fine-tuned to detect general objects, while
another set is fine-tuned to recognize faces.
Bagging
Bagging, short for "Bootstrap Aggregating," is a technique used in machine learning to improve
the stability and accuracy of algorithms. It's especially useful for reducing variance and
preventing overfitting.
Reduce Overfitting: By averaging multiple models, bagging helps prevent the model
from becoming too tailored to the training data.
Increase Stability: Combining the predictions of multiple models makes the final model
more robust and less sensitive to the specific quirks of the training data.
How Bagging Works
1. Bootstrapping: This involves creating multiple subsets of the original training data by
randomly sampling with replacement. This means some data points might appear more
than once in a subset, while others might not appear at all.
2. Training: Each subset is used to train a separate model. This means we end up with
multiple models, each trained on slightly different data.
3. Aggregating: The predictions from all the models are combined to make the final
prediction. For regression tasks (predicting a number), we might average the predictions.
For classification tasks (predicting a category), we might use majority voting.
Example
Original Data: we have a dataset with 1,000 houses and their prices.
Bootstrapping: we create 10 different subsets of this data, each containing 1,000 houses
but with some duplicates and some missing (because of random sampling with
replacement).
Training: we train 10 separate models, each on one of these subsets.
Aggregating: For a new house, each of the 10 models makes a price prediction. The final
predicted price is the average of these 10 predictions.
Dropout
o Prevent Overfitting: It helps the model from becoming too tailored to the
training data.
o Improve Generalization: By forcing the network to not rely too heavily on any
particular neuron, it learns to be more robust and performs better on new, unseen
data.
o
2. How Dropout Works:
o During Training: At each training step, randomly set some of the neurons (along
with their connections) to zero. This means they are temporarily removed from
the network.
o During Testing: All neurons are used, but their outputs are scaled down by the
dropout rate (the fraction of neurons dropped during training) to balance the
effect.
Example
Imagine we are teaching a class of students, and we want to ensure they all understand the
material well:
Without Dropout: Every student is allowed to depend heavily on one smart student who
answers all questions.
With Dropout: Randomly, we ask different students to answer questions each time,
making everyone stay attentive and learn the material better.
Adversarial Training
Adversarial training is a method to make neural networks more robust by training them on
adversarial examples. These are inputs intentionally modified to confuse the model.
1. Why Use Adversarial Training?
o Improve Robustness: It helps the model to be less sensitive to small changes or
noise in the input data.
o Enhance Security: It makes the model more resistant to adversarial attacks where
someone tries to deceive the model.
2. How Adversarial Training Works:
o Create Adversarial Examples: Generate slightly altered versions of the training
data that are designed to fool the model.
o Train on Adversarial Examples: Include these adversarial examples in the
training process so the model learns to correctly classify them.
Example
Without Adversarial Training: The dog learns to recognize intruders only in clear,
straightforward situations.
With Adversarial Training: we also train the dog with people wearing disguises or in
different lighting conditions, making the dog better at recognizing intruders in various
situations.
Optimization
Optimization in machine learning refers to the process of adjusting the model's parameters (like
weights) to minimize the error (loss function) and improve performance. It's how the model
learns from the data.
Example
Imagine we are trying to find the lowest point in a hilly landscape (minimize the error):
Without Optimization: we might randomly walk around and hope to find the lowest
point.
With Optimization: we carefully step downhill each time, gradually getting closer to the
lowest point with each step.
UNIT 3 Convolution Networks
CNNs are a type of deep learning model specifically designed to handle image and video data.
They are very good at recognizing patterns, shapes, and objects in images.
Convolution Operation
The convolution operation is the core process in Convolutional Neural Networks (CNNs), which
are widely used for image and video recognition. It helps detect important features in the input
data, such as edges, textures, and patterns.
Key Components
1. Filter (Kernel):
o A small matrix of numbers (e.g., 3x3 or 5x5).
o Think of it as a small window that looks at a part of the image.
2. Input Image:
o The image we want to process.
o It’s represented as a matrix of pixel values.
3. Feature Map:
o The output after applying the filter to the input image.
o Highlights important features detected by the filter.
How It Works
Example
1. Filter Values:
101
010
101
2. First Position:
o Place the filter at the top-left corner of the image.
o Multiply each filter value by the corresponding image value and add them up.
o Example calculation: (1image[0][0] + 0image[0][1] + 1image[0][2] + ... +
1image[2][2]).
3. Move the Filter:
o Slide the filter to the right by one pixel and repeat the calculation.
o Continue this process across the entire image.
4. Feature Map:
o After sliding the filter over the whole image, we get a new matrix (the feature
map) that highlights where the filter detected certain features.
Pooling
Pooling in convolutional neural networks (CNNs) is a technique used to simplify and reduce the
size of data. Imagine we have a large image, and we want to make it smaller while still keeping
the important features. Pooling helps with this by summarizing regions of the image.
1. Dividing into small parts: The image is divided into small sections, like 2x2 squares.
2. Choosing the most important information: From each small section, pooling picks the
most important value. This could be the highest value (max pooling) or the average of all
values in the section (average pooling).
3. Creating a smaller image: By repeating this process for the entire image, we get a
smaller version that retains the important features while reducing the amount of data to
process.
Types of Pooling Layers:
Max Pooling:
Max pooling is a pooling operation that selects the maximum element from the region of
the feature map covered by the filter. Thus, the output after max-pooling layer would be a
feature map containing the most prominent features of the previous feature map.
For example, if we have a 4x4 image and we apply 2x2 max pooling, we will end up with a 2x2
image where each value represents the highest value from each 2x2 section of the original image.
This makes the computation faster and helps the neural network focus on the most significant
parts of the image.
Average Pooling
Average pooling computes the average of the elements present in the region of feature map
covered by the filter. Thus, while max pooling gives the most prominent feature in a
particular patch of the feature map, average pooling gives the average of features present in
a patch.
Basic Convolution Function
A basic convolution function in a convolutional neural network (CNN) is used to detect specific
features in an image, like edges, corners, or textures.
1. Kernel (Filter): Think of a small square grid of numbers, usually 3x3 or 5x5. This grid is
called a kernel or filter.
2. Sliding the Kernel: The kernel slides over the entire image, one pixel at a time, from left
to right and top to bottom. At each position, the kernel focuses on a small section of the
image.
3. Multiplying and Summing: At each position, the numbers in the kernel are multiplied
by the corresponding numbers in the small section of the image. Then, all these products
are added together to get a single number.
4. Creating a New Image: The single number obtained from the multiplication and
summing replaces the original central pixel of the image section. This process creates a
new image (called a feature map) that highlights specific features detected by the kernel.
For example, if the kernel is designed to detect vertical edges, sliding it over the image will
create a feature map where vertical edges are more prominent.
Convolution Algorithm
A convolution algorithm in a convolutional neural network (CNN) helps the network find
patterns in an image, like edges, textures, or shapes.
1. Start with an Image: Think of image as a grid of numbers, where each number
represents the brightness of a pixel.
2. Choose a Filter (Kernel): Select a small grid of numbers, usually 3x3 or 5x5. This small
grid is called a filter or kernel. Each filter is designed to detect a specific feature in the
image.
3. Place the Filter on the Image: Put the filter on the top-left corner of the image.
4. Multiply and Sum: Multiply each number in the filter by the corresponding number in
the image grid under the filter. Then, add up all these products to get a single number.
5. Record the Result: Write down the single number we got from step 4 in a new grid,
starting at the top-left corner.
6. Move the Filter: Slide the filter one pixel to the right and repeat steps 4 and 5. Continue
this until we reach the end of the row.
7. Continue Downwards: Move the filter one pixel down to the start of the next row, and
repeat steps 4 to 6. Keep doing this until the filter has covered the entire image.
8. Create a Feature Map: The new grid of numbers we have written down is called a
feature map. This map highlights the specific features detected by the filter.
9. Apply Multiple Filters: Usually, multiple filters are used to detect different features in
the image. Each filter produces its own feature map.
10. Combine Feature Maps: The combined feature maps are then used as input for the next
layers of the CNN, helping the network to understand and recognize more complex
patterns in the image.
Unsupervised Features:
1. Learning without Labels: Unsupervised learning means the model learns from data
without any labeled examples. It finds patterns and structures on its own.
2. Feature Extraction: In a convolutional neural network (CNN), the network learns to
identify features (like edges, textures, shapes) from images without being explicitly told
what to look for. For example, it might learn that certain patterns of pixels tend to appear
together and represent meaningful parts of the image.
3. Clustering and Patterns: The network groups similar patterns together and recognizes
common structures in the images. This can help in tasks like grouping similar images or
identifying anomalies.
Neuroscientific Inspirations:
1. Mimicking the Brain: Convolutional networks are inspired by how the human brain
processes visual information. Neuroscientists discovered that our visual cortex (part of
the brain) processes images in layers, detecting simple features first and then combining
them into more complex representations.
2. Receptive Fields: In the brain, neurons have receptive fields, meaning they respond to
specific regions of the visual field. Similarly, in CNNs, filters (kernels) act like receptive
fields, focusing on small regions of the image at a time to detect features.
3. Hierarchical Processing: Just like the brain processes visual information in stages,
CNNs have multiple layers. Early layers might detect simple features like edges, while
deeper layers detect more complex features like faces or objects.
4. Local Connectivity: In the brain, neurons in the visual cortex are locally connected,
meaning they only connect to a small region of the visual field. CNNs mimic this by
having each neuron (or filter) only look at a small part of the image at a time.
Unit IV Sequence Modelling
Sequence modeling in deep learning involves creating models that can process and make
predictions based on sequences of data. These sequences can be anything that has a specific
order, like sentences in a paragraph, time-series data, or DNA sequences.
Key Points:
1. Understanding Order: Sequence models take into account the order of the data,
meaning they understand that "cat" followed by "sat" is different from "sat" followed by
"cat".
2. Handling Variable Lengths: They can manage data sequences of different lengths,
making them versatile for various tasks.
3. Context Awareness: These models remember previous inputs to make better predictions
about future inputs. For example, in language processing, knowing the previous words
helps predict the next word.
It is a type of artificial neural network designed to handle sequential data, like sentences or time
series.
Key Points:
1. Sequential Data: RNNs are good at processing data where the order matters. They can
understand sequences, such as a sentence where the meaning depends on the order of
words.
2. Memory: RNNs have a kind of memory that allows them to remember what they've seen
before. This memory helps them make better predictions or decisions based on previous
information.
3. Loops: Inside an RNN, there's a loop that passes information from one step to the next.
This loop helps the network keep track of context over time.
Example:
Predicting the Next Word: If you give an RNN the beginning of a sentence, like "The
cat sat on the", it uses the words it has seen to guess the next word, like "mat". It does this
by remembering the sequence of words that came before.
Bidirectional RNNs
Key Points:
1. Two Directions: Unlike regular RNNs that read the sequence in one direction (usually
left to right), bidirectional RNNs read it both ways (left to right and right to left).
2. Two Layers: They have two hidden layers for each time step: one for processing the
sequence from start to end (forward), and another for processing it from end to start
(backward).
3. Combined Output: The outputs of these two layers are combined, providing more
context and understanding of the sequence.
Example:
Understanding a Sentence: For the sentence "The cat sat on the mat," a standard RNN
would read from "The" to "mat." A bidirectional RNN would read it from "The" to "mat"
and simultaneously from "mat" to "The," giving it a better understanding of the whole
sentence.
Key Points:
1. Two Parts:
o Encoder: This part reads the entire input sequence (like a sentence in English)
and compresses it into a fixed-size summary called a context vector.
o Decoder: This part takes the context vector and generates the output sequence
(like the translated sentence in Spanish) one step at a time.
2. Context Vector: The encoder transforms the input sequence into a context vector, which
is a compact representation of the input sequence’s information.
3. Sequential Processing: The decoder uses the context vector to produce the output
sequence step by step, often predicting one word at a time.
Example:
Translating a Sentence: To translate "Hello, how are you?" from English to Spanish:
o Encoder: Reads the English sentence and creates a context vector summarizing
its meaning.
o Decoder: Uses the context vector to generate the Spanish sentence "Hola, ¿cómo
estás?" word by word.
A network is considered "deep" when it has many layers. In the case of a Deep Recurrent
Network, there are multiple layers of recurrent neurons stacked on top of each other. This depth
allows the network to learn more complex patterns and relationships in the data.
1. Handling Sequences: DRNs are great for tasks involving sequences, such as language
translation, speech recognition, and time-series forecasting.
2. Learning Long-Term Dependencies: Because they can remember information from
earlier in the sequence, they can understand context and long-term dependencies.
3. Complex Patterns: The deep structure allows the network to learn and represent very
complex patterns and features in the data.
1. Input Sequence: Data is fed into the network one step at a time.
2. Processing: Each layer processes the data, with recurrent connections allowing
information to flow through time steps.
3. Output: The final layer produces the output, which could be a prediction, a
classification, or any other desired result.
4.
Example
Imagine trying to predict the next word in a sentence. A DRN would read the sentence word by
word, using the information from previous words to predict the next one. For example, in the
sentence "The cat sat on the...", the network uses the words "The cat sat on" to predict the word
"mat."
A Recursive Neural Network (RecNN) is a type of neural network that processes data in a
hierarchical or tree-like structure. Instead of processing data in a flat sequence (like a Recurrent
Neural Network), RecNNs process data by breaking it down into smaller parts and combining
them in a tree-like fashion.
1. Hierarchical Data: Recursive Neural Networks are designed to handle data that has a
hierarchical structure, such as sentences (which can be broken down into phrases, words,
and characters) or images (which can be broken down into parts, sub-parts, and so on).
2. Breaking Down Data: The network takes the input data and breaks it down into its
components. For example, a sentence can be broken down into smaller phrases and then
into individual words.
3. Combining Components: The network processes these components by combining them
recursively. It starts from the smallest components and works its way up, combining them
to form larger and larger structures.
4. Final Output: The final output is produced after all the components have been combined
and processed. This output could be a prediction, classification, or any other desired
result.
Example
Imagine trying to understand the meaning of a sentence: "The quick brown fox jumps over the
lazy dog."
1. Breaking Down: The sentence can be broken down into phrases: "The quick brown fox"
and "jumps over the lazy dog."
2. Further Breakdown: Each phrase can be broken down further into words: "The,"
"quick," "brown," "fox," etc.
3. Combining: The network processes each word, then combines them into phrases, and
finally combines the phrases to understand the entire sentence.
Why Use a Recursive Neural Network?
1. Hierarchical Data: RecNNs are perfect for data with a natural hierarchical structure, like
sentences or images.
2. Context Understanding: By processing data hierarchically, they can understand context
and relationships between components more effectively.
3. Complex Structures: They can handle complex structures and dependencies in data that
other types of neural networks might struggle with.
An Echo State Network is a special type of RNN that has a unique way of handling the learning
process.
1. Reservoir: The core part of an ESN is a large, fixed, random network of neurons called
the "reservoir." This reservoir creates a dynamic system that processes input data and
maintains a memory of previous inputs.
2. Input and Output Connections: Only the connections from the input to the reservoir
and from the reservoir to the output are trained. The connections within the reservoir are
fixed and not trained.
3. Echo State Property: The reservoir has the "echo state property," which means that the
influence of any input gradually fades over time, like an echo.
1. Input Data: Data is fed into the network through the input layer.
2. Reservoir Processing: The input data is processed by the fixed, random connections
within the reservoir. The reservoir’s internal state evolves based on the current input and
its previous states.
3. Output Generation: The output is generated by the trained connections from the
reservoir to the output layer.
Why Use an Echo State Network?
1. Simpler Training: Only the connections to the reservoir and from the reservoir to the
output are trained, making the training process faster and simpler compared to other
RNNs.
2. Rich Dynamics: The random, fixed connections within the reservoir create complex and
rich dynamics, which can be useful for processing time-series data and other sequential
inputs.
3. Memory and Context: The reservoir maintains a memory of previous inputs, allowing
the network to use context from past data to make better predictions.
Example
Imagine trying to predict the next temperature reading based on past data. An ESN would take
the past temperature readings as input, process them through the reservoir, and use the reservoir's
dynamic state to predict the next temperature.
UNIT 5: Deep Generative Models
Deep generative models are advanced computer programs that can create new data that looks
similar to the data they were trained on.
Boltzmann Machines
Structure
Training Process
Applications
Optimization Problems: Solving complex problems where we need to find the best
solution out of many possible options.
Pattern Recognition: Recognizing patterns and structures in data, like understanding
images or sequences.
Key Points
Two Layers:
o Visible Layer: This is the layer we can see, which corresponds to the input data.
o Hidden Layer: This layer is hidden from view and helps the visible layer learn
patterns.
Connections: Each unit in the visible layer is connected to every unit in the hidden layer,
but there are no connections between units within the same layer.
Training Process
Data Input: We start with input data fed into the visible layer.
Activation: The visible units activate the hidden units.
Reconstruction: The hidden units try to reconstruct the input data, and the network
adjusts its weights based on how well it does.
Iteration: This process is repeated many times to improve the network's ability to learn
patterns.
Applications
Feature Learning: RBMs are great at finding useful features in data, which can then be
used for other tasks like classification.
Recommendation Systems: They can be used to recommend products, like movies or
books, by learning patterns in user preferences.
Pre-training Deep Networks: RBMs can be used to pre-train deeper networks, making
the learning process more efficient.
Key Points
RBMs are simpler and faster to train compared to general Boltzmann Machines because
of their restricted connections.
They are effective at finding hidden features and patterns in data.
RBMs serve as building blocks for more complex models like Deep Belief Networks
(DBNs).
Layers:
o Visible Layer: The first layer that directly interacts with the input data.
o Hidden Layers: Multiple layers stacked on top of each other. Each layer learns
from the layer below it.
RBMs: Each pair of layers in a DBN forms an RBM, which means each layer's visible
units are connected to the hidden units of the next layer.
Layer-by-Layer Training:
1. Train the First RBM: Start with the first visible layer and the first hidden layer.
Train this RBM to learn basic features of the input data.
2. Use Learned Features: Once the first RBM is trained, use its hidden layer as the
visible layer for the next RBM.
3. Train the Next RBM: Train the second RBM to learn features from the hidden
layer of the first RBM.
4. Repeat: Continue this process for all layers, training one RBM at a time.
Fine-Tuning: After training all the layers, the entire network can be fine-tuned using
supervised learning to improve its performance on specific tasks.
Training Process
Applications
Key Points
DBNs are powerful because they learn in layers, with each layer capturing increasingly
complex patterns.
They start by learning basic features and build up to more complex ones.
The combination of unsupervised pre-training and supervised fine-tuning makes them
effective for various tasks.
DBMs as an advanced version of Boltzmann Machines with many layers of hidden units. They
learn complex patterns in data by capturing deeper and more detailed features.
Structure
Layers:
o Visible Layer: The layer we can see, which corresponds to the input data.
o Multiple Hidden Layers: Several layers of hidden units stacked on top of each
other. These hidden layers help the network learn more detailed patterns in the
data.
Probabilistic Learning: DBMs learn by adjusting the connections between units to find
the best way to represent the input data.
Energy Minimization: They aim to minimize the "energy" of the system, finding the
most efficient way to represent the data.
Layer-wise Training:
1. Train Each Layer: Start by training one layer at a time, similar to how we train
an RBM.
2. Iterative Process: Each layer learns to represent the data from the previous layer
better.
3. Fine-Tuning: After training all layers, the entire network is fine-tuned to improve
overall performance.
Training Process
Applications
Image and Speech Recognition: DBMs can recognize objects in images and transcribe
spoken words by learning detailed patterns.
Complex Pattern Recognition: They are used in applications that require understanding
complex data relationships, like natural language processing.
Feature Extraction: DBMs can extract useful features from data for other machine
learning tasks.
Key Points
DBMs are powerful because they can learn very complex patterns in data by using
multiple hidden layers.
They use a probabilistic approach, which helps them find the most efficient way to
represent the data.
Training DBMs is more complex than RBMs but allows for more detailed and accurate
pattern recognition.
Imagine a network of units (like tiny decision-makers) where each unit's state depends on its
parent units. They use a specific mathematical function called the sigmoid function to decide
these states.
Structure
Directed Graph: The network is directed, meaning connections between units have a
direction (from parent to child).
Units:
o Visible Units: The ones we can see, representing the input data.
o Hidden Units: The oneswe can't see, helping to learn patterns in the data.
How They Learn
Sigmoid Function: This is a special mathematical function that outputs a value between
0 and 1. Each unit in the network uses this function to determine its state based on the
states of its parent units.
Probabilistic Approach: SBNs use probabilities to decide the states of units, making
them robust for different types of data.
Variational Inference: A complex method used to estimate the parameters of the
network. It helps the network learn the best representation of the data.
Training Process
Data Input: Start with input data fed into the visible units.
State Determination: Each unit uses the sigmoid function to decide its state based on its
parent units.
Parameter Adjustment: The network adjusts its parameters to better represent the input
data.
Iteration: This process is repeated many times to improve the network's ability to learn
patterns.
Applications
Pattern Recognition: SBNs can recognize patterns in data, making them useful for tasks
like image and speech recognition.
Data Generation: They can generate new data samples similar to the input data.
Complex Inference: Suitable for tasks that require understanding complex relationships
in data.
Key Points
SBNs are directed networks where each unit's state depends on its parent units.
They use the sigmoid function to determine states probabilistically.
Variational inference is used to learn the best parameters for the network.
These networks are designed to generate new data by following a specific direction or sequence
in the network, much like creating a story step-by-step.
Structure
Directed Connections: The connections between units in the network have a direction,
meaning information flows in one way (from input to output).
Layers: They usually consist of several layers of units (neurons), where each layer passes
information to the next.
Training Process
Data Input: The input data is fed into the first layer of the network.
Forward Pass: The data flows through the network layer by layer, generating an output.
Error Calculation: The error between the generated output and the actual output is
calculated.
Backpropagation: The network adjusts its weights to minimize this error.
Repetition: This process continues until the network can generate data that closely
matches the input data.
Applications
Image Generation: Creating new images that look similar to a set of training images.
Text Generation: Writing new text that follows the patterns and style of training text.
Audio Generation: Producing new sounds or music based on training audio.
Key Points
Directed generative networks use a directed approach, where data flows in one direction
from input to output.
They learn to generate new data by adjusting their internal connections to minimize
errors.
Backpropagation is a key method for training these networks.
Auto-encoders are a type of neural network designed to learn efficient representations of data.
They compress the input data into a smaller form and then try to recreate the original data from
this smaller form.
Structure
Encoder: The part of the network that compresses the input data into a smaller, more
efficient representation (called the latent space or code).
Decoder: The part of the network that takes the compressed representation and tries to
recreate the original input data.
Training Process:
1. Input Data: Start with input data (like images or text).
2. Compression: The encoder compresses this data into a smaller representation.
3. Reconstruction: The decoder tries to recreate the original data from this smaller
representation.
4. Error Calculation: The network calculates the difference between the original
input and the reconstructed data.
5. Adjustment: The network adjusts its weights to minimize this difference.
6. Iteration: This process is repeated many times until the network can accurately
recreate the input data.
Drawing Samples from Auto-encoders
Sampling:
1. Generate Code: Start by generating or selecting a code from the latent space.
This code can be a compressed version of existing data or a new one created from
scratch.
2. Decode: Pass this code through the decoder part of the auto-encoder.
3. Generate Data: The decoder transforms the code back into a full data sample
(like generating a new image, text, etc.).
Applications
Data Denoising: Removing noise from data by learning to reconstruct clean data from
noisy input.
Dimensionality Reduction: Reducing the number of features in data while preserving
important information.
Data Generation: Creating new data samples similar to the training data by sampling
from the latent space.
Key Points
Auto-encoders learn to compress and then recreate data, capturing important features in
the process.
Drawing samples involves generating new codes in the latent space and decoding them to
create new data.
They can be used for cleaning data, reducing its size, and generating new, similar data
samples.