Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
18 views35 pages

Machine Learning Notes

Uploaded by

prafulprasad911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views35 pages

Machine Learning Notes

Uploaded by

prafulprasad911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Hebbian Learning Rule (Detailed 10-Mark Answer)

1. Introduction
Hebbian Learning is one of the earliest and simplest learning rules used to train artificial neural
networks.
It was introduced by Donald Hebb in 1949 through his book "The Organization of Behavior."
The basic idea behind Hebbian learning is:
"If two neurons are activated (fire) at the same time, the connection (synapse) between them gets
stronger."
Thus, it explains how connections between neurons are formed and strengthened based on
experience and activity.
This concept is summarized as:
"Neurons that fire together, wire together."
It is one of the first & also easiest learning rules in neural network. It is used for pattern
classification.
iii) The network has one input layer (with many units) & one output layer (with a single unit).
iv) The rule updates the weights between neurons based on the input and output.
vi) It is used for pattern classification & association also for logic functions (AND, OR, XOR).
vii) Works best with bipolar data rather than binary data.
2. Hebbian Learning Principle
The principle of Hebbian learning is:
- When the input neuron and the output neuron are both active simultaneously, the weight
between them is increased.
- If they are not active together, no significant change happens to the weight.
Thus, the learning occurs locally at the connection between two neurons, depending on their
activities.
3. Mathematical Formulation
Let:
- x_i = input from the i-th neuron
- y = output of the neuron
- w_i = weight associated with input x_i
- η = small positive learning rate (controls the speed of learning)
The change in weight Δw_i is given by:
Δw_i = η × x_i × y
And the weight update rule is:
w_i^(new) = w_i^(old) + Δw_i
Thus, the weight increases proportionally to the product of the input and the output.
4. Algorithm for Hebbian Learning
Algorithm Steps:
1. Initialize the synaptic weights w_i to small random values or zeros.
2. For each training example:
- Present the input vector x to the neuron.
- Compute the output y given by the formula y = ∑ (w_i * x_i) (simple sum or activation
function can be used).
- Update each weight using the rule:
w_i = w_i + η × x_i × y
3. Repeat the process for all training patterns.
4. Stop when weights converge or after a fixed number of iterations.
5. Explanation of the Algorithm
- At the start, weights are small, meaning weak connections.
- When an input x is applied and output y is produced, the product x_i × y reflects how strongly
both neurons are active.
- A positive product leads to an increase in weight (strengthening the connection).
- Learning continues until the network properly encodes the patterns in the form of strengthened
weights.
Thus, Hebbian learning models association between neurons based on their co-activation.
6. Example
Given:
- Initial weight w = 0.2
- Learning rate η = 0.1
- Input x = 1
- Output y = 1
Then,
Δw = 0.1 × 1 × 1 = 0.1
w_new = 0.2 + 0.1 = 0.3
Thus, the weight increases, strengthening the synaptic connection.
Another Example (negative input):
If x = -1, y = 1, then:
Δw = 0.1 × (-1) × 1 = -0.1
w_new = w_old - 0.1
Thus, the weight decreases, weakening the connection.
7. Diagram (Optional)
You can draw a simple two-neuron diagram:
[Neuron A] --(weight w)--> [Neuron B]
When both are active, weight w increases.
8. Features of Hebbian Learning
- Unsupervised learning: No need for a teacher signal; based only on input-output correlation.
- Local learning rule: Weight change depends only on the activity of connected neurons.
- Biological relevance: Explains learning and memory in biological brains.
- Incremental learning: Weights gradually adapt over time.
9. Advantages
- Simple and easy to implement.
- Mimics real biological neural behavior.
- Good for pattern recognition, associative memory, and feature extraction.
10. Disadvantages
- Weights can grow indefinitely without bounds, leading to instability.
- No mechanism for forgetting old patterns.
- Sensitive to noise (random inputs may falsely strengthen connections).
- Requires additional mechanisms like normalization to prevent infinite growth.
11. Applications
- Neuroscience modeling: Brain learning simulations.
- Pattern storage: Associative memory networks like Hopfield networks.
- Feature detection: Extraction of important input patterns.
- Competitive learning: Forming self-organizing maps (SOMs).
Summary
Hebbian learning strengthens the synaptic connection between two neurons based on their
simultaneous activation. It is an unsupervised, simple, biologically-inspired learning rule widely
used in neural network models.

2.Expectation Maximization Algorithm


The Expectation Maximization (EM) algorithm is an iterative method used for
finding maximum likelihood estimates of parameters in statistical models,
especially when the model involves latent (hidden) variables. It is commonly
used in scenarios where the data has missing or incomplete information,
such as in mixture models like Gaussian Mixture Models (GMMs).

The Expectation-Maximization (EM) algorithm is an iterative


method used in unsupervised machine learning to estimate
unknown parameters in statistical models. It helps find the best
values for unknown parameters, especially when some data is
missing or hidden.
It works in two steps:
1. E-step (Expectation Step): Estimates missing or hidden
values of latent variables using current parameter
estimates.
2. M-step (Maximization Step): Updates model parameters
to maximize the likelihood based on the estimated values
from the E-step.
This process repeats until the model reaches a stable solution,
improving accuracy with each iteration. EM is widely used in
clustering (e.g., Gaussian Mixture Models) and handling missing
data

Expectation-Maximization in EM Algorithm

By iteratively repeating these steps, the EM algorithm seeks to


maximize the likelihood of the observed data. It is commonly used
for clustering, where latent variables are inferred and has
applications in various fields, including machine learning, computer
vision, and natural language processing.
Key Terms in Expectation-Maximization (EM) Algorithm

Lets understand about some of the most commonly used key terms in the
Expectation-Maximization (EM) Algorithm below:

Latent Variables: These are hidden or unmeasured variables that affect what
we can observe in the data. We can’t directly see them, but we can make
educated guesses about them based on the data we can see.

Likelihood: This refers to the probability of seeing the data we have, based
on certain assumptions or parameters. The EM algorithm tries to find the
best parameters that make the data most likely.

Log-Likelihood: This is just the natural log of the likelihood function. It’s used
to make calculations easier and measure how well the model fits the data.
The EM algorithm tries to maximize the log-likelihood to improve the model
fit.

Maximum Likelihood Estimation (MLE): This is a technique for estimating the


parameters of a model. It does this by finding the parameter values that
make the observed data most likely (maximizing the likelihood).

Posterior Probability: In Bayesian methods, this is the probability of the


parameters, given both prior knowledge and the observed data. In EM, it
helps estimate the “best” parameters when there’s uncertainty about the
data.

Expectation € Step: In this step, the algorithm estimates the missing or


hidden information (latent variables) based on the observed data and current
parameters. It calculates probabilities for the hidden values given what we
can see.

Maximization (M) Step: This step updates the parameters by finding the
values that maximize the likelihood, based on the estimates from the E-step.
It often involves running optimization methods to get the best parameters.

Convergence: Convergence happens when the algorithm has reached a


stable point. This is checked by seeing if the changes in the model’s
parameters or the log-likelihood are small enough to stop the process.

How Expectation-Maximization (EM) Algorithm Works:


So far, we’ve discussed the key terms in the EM algorithm. Now, let’s dive
into how the EM algorithm works. Here’s a step-by-step breakdown of the
process:

EM Algorithm Flow chart-Geeksforgeeks

EM Algorithm Flowchart

Initialization:

The algorithm starts with initial parameter values and assumes the observed
data comes from a specific model.

E-Step (Expectation Step):

Estimate the missing or hidden data based on the current parameters.

Calculate the posterior probability (responsibility) of each latent variable


given the observed data.

Compute the log-likelihood of the observed data using the current parameter
estimates.
M-Step (Maximization Step):

Update the model parameters by maximizing the log-likelihood computed in


the E-step.

This involves solving an optimization problem to find parameter values that


improve the model fit.

Convergence:

Check if the model parameters are stable (converging).

If the changes in log-likelihood or parameters are below a set threshold, stop.


If not, repeat the E-step and M-step until convergence is reached

3.Artificial Neural Networks

As you read this article, which organ in your body is thinking about it? It’s the
brain, of course! But do you know how the brain works? Well, it has neurons
or nerve cells that are the primary units of both the brain and the nervous
system. These neurons receive sensory input from the outside world, which
they process and then provide the output, which might act as the input to
the next neuron.

Each of these neurons is connected to other neurons in complex


arrangements at synapses. Now, are you wondering how this is related to
Artificial Neural Networks? Let’s check out what they are in detail and how
they learn information.

Artificial Neural Networks

Artificial Neural Networks contain artificial neurons, which are called units.
These units are arranged in a series of layers that together constitute the
whole Artificial Neural Network in a system. A layer can have only a dozen
units or millions of units, as this depends on how the complex neural
networks will be required to learn the hidden patterns in the dataset.
Commonly, an Artificial Neural Network has an input layer, an output layer,
as well as hidden layers. The input layer receives data from the outside
world, which the neural network needs to analyze or learn about. Then, this
data passes through one or multiple hidden layers that transform the input
into data that is valuable for the output layer. Finally, the output layer
provides an output in the form of a response of the Artificial Neural Networks
to the input data provided.

In the majority of neural networks, units are interconnected from one layer to
another. Each of these connections has weights that determine the influence
of one unit on another unit. As the data transfers from one unit to another,
the neural network learns more and more about the data, which eventually
results in an output from the output layer.

The structures and operations of human neurons serve as the basis for
artificial neural networks. It is also known as neural networks or neural nets.
The input layer of an artificial neural network is the first layer, and it receives
input from external sources and releases it to the hidden layer, which is the
second layer. In the hidden layer, each neuron receives input from the
previous layer neurons, computes the weighted sum, and sends it to the
neurons in the next layer. These connections are weighted means effects of
the inputs from the previous layer are optimized more or less by assigning
different-different weights to each input and it is adjusted during the training
process by optimizing these weights for improved model performance.

4.Biological Neuron (Simple Explanation):

A biological neuron is a specialized cell in the human body (especially in the


brain and nervous system) that is designed to receive, process, and transmit
information through electrical and chemical signals.
It is the basic building block of the brain, and inspired the design of artificial
neurons in machine learning models like neural networks.

Structure of a Biological Neuron:

Dendrites:

These are branch-like structures.

They receive signals (electrical or chemical messages) from other neurons.

You can think of dendrites as the “input channels” of the neuron.

Cell Body (Soma):

This is the main part of the neuron.

It processes the incoming signals received from the dendrites.

If the total incoming signal is strong enough (reaches a threshold), the


neuron becomes activated and sends out a signal.

Axon:

This is a long, thin structure extending from the cell body.

Once the neuron is activated, the axon carries the output signal away from
the neuron.

Think of it like a “wire” that carries the processed message forward.

Axon Terminals (Synaptic Terminals):

The axon ends in smaller branches called terminals.

These terminals pass the signal to the next neuron through connections
called synapses.

Synapse:

It’s the gap between the axon terminal of one neuron and the dendrites of
the next neuron.
Signals cross the synapse using chemicals called neurotransmitters.

5.Difference between Artificial Neural Network and Biological Neural Network

Biological Neural Artificial Neural Network


Feature
Network (ANN)

Found naturally in human Created by humans using


Origin
and animal brains algorithms and computers

Basic Unit Biological neuron Artificial neuron (or node)

Signal Type Electrochemical signals Numerical values (real numbers)

Happens chemically and Happens mathematically using


Processing
electrically inside the brain calculations

Relatively slow Very fast (nanoseconds to


Speed
(milliseconds) microseconds)

Adjusts through synaptic


Learning Adjusts through weight updates
plasticity (chemical
Mechanism (e.g., backpropagation)
changes)

Complex, highly Simplified, usually layered (input,


Structure
interconnected, dynamic hidden, output)

Highly adaptable and self- Needs explicit programming and


Adaptability
organizing training

Highly fault-tolerant (can


Fault Less fault-tolerant; performance
still work with damaged
Tolerance degrades if nodes fail
neurons)

Energy Very low (brain uses about High (requires significant


Consumption 20 watts) computational power)

Dendrites, soma (cell Inputs, weighted sum, activation


Components
body), axon, synapse function, output

Learning Learning through Learning through algorithms like


Examples experience, memory, supervised learning,
Biological Neural Artificial Neural Network
Feature
Network (ANN)

environment unsupervised learning

BOSSSPACEF

6.Neural Network Architecture

Neural Network Architecture


Neural network architecture refers to the structure and design of a neural
network. It defines how neurons (also called nodes) are organized in layers
and how these neurons are connected. Typically, a neural network has three
types of layers: an input layer, one or more hidden layers, and an output
layer. Each neuron in a layer can be connected to neurons in the next layer.
The architecture also specifies how data flows through the network and how
learning happens during training.

Single Layer Feedforward Network


A single layer feedforward network is the simplest type of neural network. It
has only two layers: an input layer and an output layer. The input layer feeds
the data into the output layer directly without any hidden layers in between.
The connections between the neurons move only in one direction, from the
input to the output. There are no loops or cycles. It is mainly used for simple
pattern recognition tasks where the relationship between inputs and outputs
is not very complex.

Multilayer Feedforward Network


A multilayer feedforward network has three or more layers: an input layer,
one or more hidden layers, and an output layer. Each neuron in one layer is
connected to every neuron in the next layer. Data flows strictly in one
direction from the input layer to the output layer through the hidden layers.
There are no cycles or feedback connections. Multilayer feedforward
networks can learn more complex patterns compared to single layer
networks and are trained using algorithms like backpropagation.

Single Node with Its Own Feedback


A single node with its own feedback is a very simple type of recurrent neural
network. It consists of just one neuron that sends its output back to itself as
an input in the next time step. This feedback loop allows the node to have a
memory of its previous output. This type of network is useful for simple time-
dependent tasks where past information needs to be considered along with
current input.

Single Layer Recurrent Network


A single layer recurrent network has one layer of neurons where the neurons
are interconnected with feedback loops. Each neuron can send its output
back to itself or to other neurons in the same layer. This feedback allows the
network to maintain information over time, making it useful for tasks like
sequence prediction or time series analysis. Unlike feedforward networks, the
presence of loops means the network can process temporal information.

Multilayer Recurrent Network


A multilayer recurrent network consists of multiple layers of neurons, where
each layer can have recurrent connections. In addition to forward
connections from one layer to the next, neurons in the same layer or across
different layers can send feedback to previous neurons. This structure allows
the network to learn complex temporal patterns and dependencies.
Multilayer recurrent networks are used in advanced tasks like speech
recognition, language modeling, and video processing where understanding
sequences and long-term dependencies is important.

First, notice they are grouped into two big categories:

 Feedforward Networks (no loops)


 Recurrent Networks (with feedback loops)

Now, the five types are:

1. Single Layer Feedforward Network — input goes straight to output,


only one layer connecting input to output.
2. Multilayer Feedforward Network — input passes through one or
more hidden layers before reaching output.
3. Single Node with Its Own Feedback — just one neuron sending
feedback to itself (smallest recurrent network).
4. Single Layer Recurrent Network — one full layer of neurons with
feedback among themselves.
5. Multilayer Recurrent Network — multiple layers where neurons can
send feedback to earlier layers or to themselves.

Quick memory trick:

Feedforward (no feedback):


 Single Layer Feedforward
 Multilayer Feedforward

Recurrent (with feedback):

 Single Node Feedback


 Single Layer Recurrent
 Multilayer Recurrent

Or, in short:

Feedforward — Single, Multi | Recurrent — Single Node, Single


Layer, Multi Layer

7.McCulloch-Pitts Neuron Architecture

The McCulloch-Pitts neuron model is one of the earliest and simplest models
of a biological neuron, proposed by Warren McCulloch and Walter Pitts in
1943. It is a mathematical model that represents a neuron as a simple device
performing a weighted sum of its inputs and producing an output based on a
threshold. The architecture consists of a set of input signals, a processing
unit, and an output. Each input to the neuron can either be 0 or 1,
representing the absence or presence of a signal. The inputs are fed into the
neuron, where they are summed. If the total sum of the inputs exceeds a
certain threshold value, the neuron “fires” and produces an output of 1.
Otherwise, it produces an output of 0. The model does not use actual
weights; all inputs are treated equally, and only the number of active inputs
matters in deciding whether the neuron fires or not. The McCulloch-Pitts
neuron is a binary model, meaning both its inputs and output are binary (0 or
1). It is mainly used to model logical operations like AND, OR, and NOT by
setting appropriate thresholds and input combinations. Although simple, this
model laid the foundation for the development of modern neural networks.

8.Activation Function

An activation function is a mathematical function used in neural networks to


decide whether a neuron should be activated or not. It is applied to the
output of each neuron after calculating the weighted sum of its inputs and
bias. The main role of the activation function is to introduce non-linearity into
the network, enabling it to learn complex data patterns and relationships.
Without activation functions, a neural network would simply behave like a
linear regression model, regardless of the number of layers, and would not
be able to solve problems that require complex mapping like image
recognition, natural language processing, etc.

When a neuron receives input data, it first computes a weighted sum of the
inputs and adds a bias term. This computed value is then passed through the
activation function to determine the neuron’s output. Depending on the
activation function used, the output can vary, which influences how the
network learns and makes predictions.

The different activation functions are:

Linear Activation Function:

The formula is f(x) = x.

The output is directly proportional to the input.

It behaves like a straight line passing through the origin.

However, it does not introduce any non-linearity, making it ineffective for


complex learning.

Binary Step Activation Function:

The formula is

F(x) = 1 if x ≥ 0

F(x) = 0 if x < 0

The output is either 0 or 1.

It is used for binary classification problems where the decision is yes/no.


However, it is not differentiable at x = 0, which makes it unsuitable for
backpropagation.

Bipolar Step Activation Function:

The formula is

F(x) = 1 if x ≥ 0

F(x) = -1 if x < 0

The output is either +1 or -1.

It is similar to the binary step function but swings between -1 and 1 instead
of 0 and 1.

Sigmoidal Activation Function (General Sigmoid):

The formula is

F(x) = 1 / (1 + e^(-x))

The output is between 0 and 1.

It is smooth, continuous, and differentiable, making it good for


backpropagation.

However, it can suffer from the vanishing gradient problem when inputs are
very high or very low.

Binary Sigmoidal Activation Function:

It uses the same formula as the general sigmoid function:

F(x) = 1 / (1 + e^(-x))
The output is between 0 and 1.

It is specifically termed “binary” because the output lies between 0 and 1,


suitable for binary classification.

Bipolar Sigmoidal Activation Function:

The formula is

F(x) = (2 / (1 + e^(-x))) – 1

The output lies between -1 and 1.

It is useful when outputs need to swing between negative and positive


values.

Ramp Activation Function:

The formula is

F(x) = 0 if x < 0

F(x) = x if 0 ≤ x ≤ 1

F(x) = 1 if x > 1

It increases linearly between 0 and 1 but is capped at 0 and 1.

It behaves like a ramp: flat at 0, linearly increasing, and flat again at 1.

Summary of output ranges:

Linear activation: Output range is from -∞ to +∞.

Binary step activation: Output is 0 or 1.


Bipolar step activation: Output is -1 or 1.

Sigmoidal (binary sigmoid) activation: Output is between 0 and 1.

Bipolar sigmoidal activation: Output is between -1 and 1.

Ramp activation: Output is between 0 and 1.

Binary Activation Function:

A binary activation function outputs only two possible values: 0 or 1. It is


used to make a hard decision about whether a neuron should “fire” (activate)
or not. If the input is greater than or equal to a certain threshold (usually 0),
the output is 1; otherwise, the output is 0.

Example:

If input ≥ 0, output = 1

If input < 0, output = 0

It is used for simple binary classification problems where only two classes
(yes/no, true/false) exist.

Bipolar Activation Function:

A bipolar activation function also gives two possible outputs, but instead of 0
and 1, it gives -1 and 1. This is useful when the model needs to handle both
positive and negative activations.

Example:

If input ≥ 0, output = 1

If input < 0, output = -1

It is similar to the binary step function but swings between -1 and 1, making
it useful for certain neural networks that benefit from having negative
outputs.
Continuous Activation Function:

A continuous activation function provides a smooth and gradual change in


output as the input changes. Unlike binary or bipolar step functions that
jump suddenly from one value to another, continuous functions like the
sigmoid, tanh, or ReLU (Rectified Linear Unit) vary smoothly.

Example:

Sigmoid function: f(x) = 1 / (1 + e^(-x))

For input values, the output smoothly changes between 0 and 1 without any
sudden jumps.

Continuous activation functions are essential for training deep networks


because they are differentiable, allowing the use of gradient descent and
backpropagation algorithms.

Ramp Activation Function:

The ramp function is a piecewise linear function that acts like a ramp or
slope. It behaves as follows:

If the input is less than 0, the output is 0.

If the input is between 0 and 1, the output is equal to the input itself.

If the input is greater than 1, the output stays at 1 (capped at 1).

Thus, the output gradually increases with the input between 0 and 1 but
becomes constant at 1 after that.
Ramp functions combine both linearity (increasing part) and saturation
(constant part), making them useful in situations where you want outputs
within a limited range

9.Delta Learning Rule (LMS WidrowHoff)

The Delta Learning Rule (LMS - Widrow-Hoff Rule)

The Delta Learning Rule, also known as the Least Mean Squares (LMS) rule or
Widrow-Hoff rule, is a widely used supervised learning rule for training
artificial neural networks. It is particularly effective for single-layer neural
networks and is the foundation of many neural network algorithms. The
primary objective of this rule is to adjust the weights of the network during
the training process to minimize the error between the actual output and the
desired output.

In essence, the Delta Learning Rule works by adjusting the weights of the
neurons based on the difference between the desired output (d) and the
actual output (y) of the neuron. This difference is referred to as the error, and
the weight updates are made proportional to this error. The idea is that the
weights of the neuron are updated in such a way that the network’s output is
gradually brought closer to the desired target.
Mathematical Formulation:

The core mathematical principle behind the Delta Learning Rule is the weight
update equation, which specifies how to adjust the weights based on the
input, the error, and the learning rate.

The weight update formula is:

Δwi = η (d - y) xi

Where:
Δwi = the change in weight for the i-th input
η = learning rate (a small positive constant that controls the magnitude of
the weight update)
d = desired output (target output)
y = actual output produced by the neuron after applying the activation
function
xi = input value for the i-th feature

The weight update rule essentially adjusts each weight in proportion to the
error (d - y) and the corresponding input value xi. After the weights are
updated, the new weight wi(new) becomes:

wi(new) = wi(old) + Δwi

How the Learning Process Works:

1. Input Presentation: During training, an input pattern x = (x1, x2, ..., xn)
is presented to the network.
2. Weighted Sum: The weighted sum of the inputs is computed. This sum
is passed through the activation function to compute the actual output
y of the neuron:

y = f(Σi=1n wi xi)

Where f is typically an activation function such as sigmoid, tanh, or even a


linear function.

3. Error Calculation: The error e = d - y is computed by subtracting the


actual output y from the desired output d.
4. Weight Update: The weights are updated based on the error. The
update is proportional to the input value and the error, with a constant
learning rate η.
5. Convergence: This process is repeated for multiple input-output pairs
(training examples). The weight adjustments continue until the error
across all examples is sufficiently small or until a predefined number of
training iterations (epochs) is completed.

Flowchart

6. START
7. ↓
8. Initialize Weights
9. ↓
10. Present Input x
11. ↓
12. Compute Output y
13. ↓
14. Compute Error (d - y)
15. ↓
16. Update Weights
17. ↓
18. More Patterns?
19. → YES: Repeat from step 3
20. → NO: Check Stopping Condition
21. ↓
22. STOP

10. Backpropagation in Neural Network

Backpropagation is also known as "Backward Propagation of Errors"


and it is a method used to train neural network . Its goal is to
reduce the difference between the model’s predicted output and
the actual output by adjusting the weights and biases in the
network
Backpropagation is a technique used in deep learning to train
artificial neural networks particularly feed-forward networks. It
works iteratively to adjust weights and bias to minimize the cost
function.
In each epoch the model adapts these parameters reducing loss by following
the error gradient. Backpropagation often uses optimization algorithms like
gradient descent or stochastic gradient descent. The algorithm computes the
gradient using the chain rule from calculus allowing it to effectively navigate
complex layers in the neural network to minimize the cost function.

Working of Backpropagation Algorithm

The Backpropagation algorithm involves two main steps: the Forward Pass
and the Backward Pass.

How Does Forward Pass Work?

In forward pass the input data is fed into the input layer. These inputs
combined with their respective weights are passed to hidden layers. For
example in a network with two hidden layers (h1 and h2 as shown in Fig. (a))
the output from h1 serves as the input to h2. Before applying an activation
function, a bias is added to the weighted inputs.

Each hidden layer computes the weighted sum (`a`) of the inputs then
applies an activation function like ReLU (Rectified Linear Unit) to obtain the
output (`o`). The output is passed to the next layer where an activation
function such as softmax converts the weighted outputs into probabilities for
classification.

How Does the Backward Pass Work?

In the backward pass the error (the difference between the predicted and
actual output) is propagated back through the network to adjust the weights
and biases. One common method for error calculation is the Mean Squared
Error (MSE) given by:

MSE=(Predicated Output – Actual Output )^2

Once the error is calculated the network adjusts weights using gradients
which are computed with the chain rule. These gradients indicate how much
each weight and bias should be adjusted to minimize the error in the next
iteration. The backward pass continues layer by layer ensuring that the
network learns and improves its performance. The activation function
through its derivative plays a crucial role in computing these gradients
during backpropagation.
Or Read the answer below

Error Back Propagation Algorithm (EBPA) Overview:

The Error Back Propagation Algorithm is a supervised learning technique


used to train multilayer neural networks, especially multilayer perceptrons
(MLPs). It works by minimizing the error between the network’s predicted
output and the actual desired output. The key idea is to propagate the error
backward through the network and adjust the weights accordingly to reduce
the error.

The Error Back Propagation concept consists of two main phases:

In the Forward Pass, input patterns are fed into the input layer. The data is
passed through the hidden layer(s) and output layer. At each neuron, the
weighted sum of inputs is calculated, passed through an activation function
such as sigmoid, tanh, or ReLU to produce output. The output at the final
layer is compared with the desired output to compute the error.

In the Backward Pass, the error at the output layer is propagated back to the
network. The algorithm calculates how much each weight contributed to the
error using partial derivatives, known as gradients. Weights are updated in
proportion to the negative gradient of the error with respect to each weight
using the Gradient Descent technique. This process ensures that the network
learns from its mistakes by adjusting the weights in the right direction to
reduce the error for future inputs.
11.Curse Of Dimensionality and Principal Component Analysis

 Curse of Dimensionality:

The curse of dimensionality refers to the various problems that arise when
analyzing and organizing data in high-dimensional spaces. As the number of
dimensions increases, the volume of the space increases exponentially, and
data points become increasingly sparse. This sparsity makes it difficult to
compute meaningful distances or find patterns in the data, which negatively
affects the performance of machine learning algorithms. It also increases
computational complexity and the risk of overfitting.
When a dataset has too many features or dimensions, it causes several
issues. These include overfitting (the model learns the training data too well
but fails on new data), slower computation, and decreased model accuracy.
As the number of features increases, the volume of the space increases
exponentially, making the available data sparse. This sparsity makes it
difficult for machine learning algorithms to find reliable patterns or
groupings.

 Why Dimensionality Reduction is Needed:


In high-dimensional datasets, operations like clustering, classification, or
visualization become complex and inefficient. To address these issues,
dimensionality reduction techniques are used to simplify the data without
losing important information.

 Principal Component Analysis (PCA):


PCA is one of the most popular dimensionality reduction techniques. It was
introduced by Karl Pearson in 1901. PCA reduces the number of features by
transforming the original features into a new set of features called principal
components. These components are selected in such a way that they capture
the maximum variance (information) from the original data.

 Step 1 – Standardize the Data:


All features must be standardized to the same scale. This is done by
subtracting the mean and dividing by the standard deviation for each
feature. This ensures that features with large ranges (like income) don’t
dominate those with small ranges (like age).
 Step 2 – Compute the Covariance Matrix:
After standardization, a covariance matrix is computed to understand how
features vary together. If two features increase together, they have positive
covariance; if one increases while the other decreases, they have negative
covariance; and if there's no relationship, the covariance is close to zero.

 Step 3 – Compute Eigenvectors and Eigenvalues:


PCA calculates eigenvectors and eigenvalues from the covariance matrix.
The eigenvectors define the directions of the new feature space (principal
components), and the eigenvalues define their importance (how much
variance each component captures).

 Step 4 – Select Principal Components:


The eigenvectors are sorted in descending order based on their
corresponding eigenvalues. The top k components (those capturing most of
the variance, usually 95% or more) are selected to form the new feature
space.

 Step 5 – Transform the Data:


The original data is projected onto the selected principal components. This
results in a new dataset with fewer dimensions, but it still retains the most
significant information from the original data.

 Key Points about PCA:


PCA is an unsupervised learning technique, meaning it doesn’t use output
labels. It is mainly used in exploratory data analysis, data visualization, noise
reduction, and to speed up machine learning models by reducing the number
of input features.

Here's a simple and clear explanation you can write if you're asked to explain
Dimensionality Reduction Technique in an exam:

Dimensionality reduction is a technique used to reduce the number of


input features or variables in a dataset while preserving as much important
information as possible. When datasets have too many features, it can lead
to problems like overfitting, increased computation time, and poor model
performance. This situation is known as the curse of dimensionality.

Dimensionality reduction helps in:


1. Simplifying models.
2. Reducing storage and processing requirements.
3. Improving the performance and accuracy of machine learning models.

There are two main types of dimensionality reduction:

 Feature selection, where only the most important features are


chosen and the rest are removed.
 Feature extraction, where new features are created from existing
ones to capture most of the useful information.

One of the most commonly used dimensionality reduction techniques is


Principal Component Analysis (PCA), which transforms the data into a
new coordinate system with fewer dimensions, while keeping the maximum
variance (i.e., important information) in the data.

12.Feature Selection Methods for Dimensionality Reduction

Feature Selection Methods for Dimensionality Reduction

Feature selection is the process of selecting the most relevant features from
the dataset and removing the irrelevant or redundant ones. This helps
reduce the dimensionality of the data, improves model accuracy, reduces
overfitting, and makes computation faster.

There are three main types of feature selection methods:

Filter Methods

These methods evaluate the importance of each feature based on statistical


measures, independently of any machine learning model. They do not use
the model’s performance as a basis for selection. Instead, features are
ranked by score, and the top ones are selected. Common statistical
techniques used include correlation coefficient, chi-square test, and mutual
information. For example, if a feature has very low correlation with the
output, it may be removed using this method. Filter methods are fast and
suitable for high-dimensional datasets.

Wrapper Methods

Wrapper methods evaluate subsets of features using a machine learning


model and select the best subset based on model performance. These
methods are more accurate than filter methods but take more time to
compute. In wrapper methods, features are selected based on how well they
help the model perform. Techniques include forward selection, where
features are added one at a time; backward elimination, where features are
removed one at a time; and recursive feature elimination, where the least
important features are removed recursively. Since they involve training
models multiple times, they are computationally expensive.

Embedded Methods

These methods perform feature selection during the model training process.
They combine the speed of filter methods with the accuracy of wrapper
methods. A common technique is LASSO (Least Absolute Shrinkage and
Selection Operator), which uses L1 regularization to shrink some coefficients
to zero, effectively removing those features. Decision tree-based models like
Random Forests also rank feature importance during training and are often
used in embedded feature selection. These methods are efficient and model-
driven.

In summary, filter methods are fast but less accurate, wrapper methods are
accurate but slow, and embedded methods offer a balance between speed
and performance. Choosing the right method depends on the size of the
dataset, available computational resources, and the goal of the analysis.

13.Steps in Developing a Machine Learning Application and Define


Machine Learning
Machine learning is a branch of artificial intelligence that enables
algorithms to uncover hidden patterns within datasets. It allows
them to predict new, similar data without explicit programming for
each task.

Comprehensive Guide to Building a Machine Learning Model

Building a machine learning model involves several steps, from data


collection to model deployment. Here’s a structured guide to help you
through the process:

Step 1: Data Collection for Machine Learning

Data collection is a crucial step in the creation of a machine learning model,


as it lays the foundation for building accurate models. In this phase of
machine learning model development, relevant data is gathered from various
sources to train the machine learning model and enable it to make accurate
predictions. The first step in data collection is defining the problem and
understanding the requirements of the machine learning project. This usually
involves determining the type of data we need for our project like structured
or unstructured data, and identifying potential sources for gathering data.

Once the requirements are finalized, data can be collected from a variety of
sources such as databases, APIs, web scraping, and manual data entry. It is
crucial to ensure that the collected data is both relevant and accurate, as the
quality of the data directly impacts the generalization ability of our machine
learning model. In other words, the better the quality of the data, the better
the performance and reliability of our model in making predictions or
decisions.

Step 2: Data Preprocessing and Cleaning

Preprocessing and preparing data is an important step that involves


transforming raw data into a format that is suitable for training and testing
for our models. This phase aims to clean i.e. remove null values, and
garbage values, and normalize and preprocess the data to achieve greater
accuracy and performance of our machine learning models.

As Clive Humby said, “Data is the new oil. It’s valuable, but if unrefined it
cannot be used.” This quote emphasizes the importance of refining data
before using it for analysis or modeling. Just like oil needs to be refined to
unlock its full potential, raw data must undergo preprocessing to enable its
effective utilization in ML tasks. The preprocessing process typically involves
several steps, including handling missing values, encoding categorical
variables i.e. converting into numerical, scaling numerical features, and
feature engineering. This ensures that the model’s performance is optimized
and also our model can generalize well to unseen data and finally get
accurate predictions.

Step 3: Selecting the Right Machine Learning Model

Selecting the right machine learning model plays a pivotal role in building of
successful model, with the presence of numerous algorithms and techniques
available easily, choosing the most suitable model for a given problem
significantly impacts the accuracy and performance of the model.

The process of selecting the right machine learning model involves several
considerations, some of which are:

Firstly, understanding the nature of the problem is an essential step, as our


model nature can be of any type like classification, regression, clustering or
more, different types of problems require different algorithms to make a
predictive model.

Secondly, familiarizing yourself with a variety of machine learning algorithms


suitable for your problem type is crucial. Evaluate the complexity of each
algorithm and its interpretability. We can also explore more complex models
like deep learning may help in increasing your model performance but are
complex to interpret.

Step 4: Training Your Machine Learning Model

In this phase of building a machine learning model, we have all the


necessary ingredients to train our model effectively. This involves utilizing
our prepared data to teach the model to recognize patterns and make
predictions based on the input features. During the training process, we
begin by feeding the preprocessed data into the selected machine-learning
algorithm. The algorithm then iteratively adjusts its internal parameters to
minimize the difference between its predictions and the actual target values
in the training data. This optimization process often employs techniques like
gradient descent.

As the model learns from the training data, it gradually improves its ability to
generalize to new or unseen data. This iterative learning process enables the
model to become more adept at making accurate predictions across a wide
range of scenarios.

Step 5: Evaluating Model Performance

Once you have trained your model, it’s time to assess its performance. There
are various metrics used to evaluate model performance, categorized based
on the type of task: regression/numerical or classification.

For regression tasks, common evaluation metrics are:

Mean Absolute Error (MAE): MAE is the average of the absolute differences
between predicted and actual values.

Mean Squared Error (MSE): MSE is the average of the squared differences
between predicted and actual values.
Root Mean Squared Error (RMSE): It is a square root of the MSE, providing a
measure of the average magnitude of error.

R-squared (R2): It is the proportion of the variance in the dependent variable


that is predictable from the independent variables.

For classification tasks, common evaluation metrics are:

Accuracy: Proportion of correctly classified instances out of the total


instances.

Precision: Proportion of true positive predictions among all positive


predictions.

Recall: Proportion of true positive predictions among all actual positive


instances.

F1-score: Harmonic mean of precision and recall, providing a balanced


measure of model performance.

Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure


of the model’s ability to distinguish between classes.

Confusion Metrics: It is a matrix that summarizes the performance of a


classification model, showing counts of true positives, true negatives, false
positives, and false negatives instances.

Step 6: Tuning and Optimizing Your Model

As we have trained our model, our next step is to optimize our model more.
Tuning and optimizing helps our model to maximize its performance and
generalization ability. This process involves fine-tuning hyperparameters,
selecting the best algorithm, and improving features through feature
engineering techniques. Hyperparameters are parameters that are set before
the training process begins and control the behavior of the machine learning
model. These are like learning rate, regularization and parameters of the
model should be carefully adjusted.
Techniques like grid search cv randomized search and cross-validation are
some optimization techniques that are used to systematically explore the
hyperparameter space and identify the best combination of hyperparameters
for the model. Overall, tuning and optimizing the model involves a
combination of careful speculation of parameters, feature engineering, and
other techniques to create a highly generalized model.

Step 7: Deploying the Model and Making Predictions

Deploying the model and making predictions is the final stage in the journey
of creating an ML model. Once a model has been trained and optimized, it’s
to integrate it into a production environment where it can provide real-time
predictions on new data.

During model deployment, it’s essential to ensure that the system can
handle high user loads, operate smoothly without crashes, and be easily
updated. Tools like Docker and Kubernetes help make this process easier by
packaging the model in a way that makes it easy to run on different
computers and manage efficiently. Once deployment is done our model is
ready to predict new data, which involves feeding unseen data into the
deployed model to enable real-time decision making.

14.Difference between Supervised and Unsupervised Learning

Aspect Supervised Learning Unsupervised Learning

Learning from unlabeled


Definition Learning from labeled data
data

Input has both features and Input has only features, no


Data
labels labels

Discover hidden patterns or


Objective Predict target/output variable
structure

Types of Classification, Regression Clustering, Dimensionality


Aspect Supervised Learning Unsupervised Learning

Problems Reduction

Spam detection, House price Customer segmentation,


Examples
prediction Market basket analysis

Algorithms Linear Regression, Decision K-Means, Hierarchical


Used Trees, SVM, Neural Networks Clustering, PCA

Accuracy Accuracy, Precision, Recall, F1- Silhouette score, Davies–


Measurement score Bouldin index

Predicted labels or values for Groups/clusters or


Output
new input relationships in data

15.Applications Of SVD

Application Area Description

SVD is used in Principal Component Analysis to


🔍 Dimensionality
reduce the number of features while preserving
Reduction (PCA)
important variance.

In image compression (like JPEG), SVD is used to


📥 Data Compression
approximate images with fewer values.

🎯 Latent Semantic In NLP, SVD helps identify relationships between


Analysis (LSA) words and documents.

SVD is used to predict missing ratings in


🧠 Recommender
collaborative filtering (e.g., Netflix, Amazon
Systems
recommendations).

By keeping only significant singular values, we can


📊 Noise Reduction
remove noise from data.

SVD helps solve systems of linear equations,


🔍 Solving Linear
especially when the matrix is non-invertible or ill-
Systems
conditioned.
16✅ Support Vector Machine (SVM) – Explained

1. Definition:
A Support Vector Machine is a supervised learning algorithm used primarily for
classification, but it can also be used for regression and outlier detection.
2. Goal:
The goal of SVM is to find the best separating boundary (called a hyperplane) that
divides the data into classes with the maximum margin.
3. Hyperplane:
A hyperplane is a decision boundary that separates different classes. In 2D, it's a line; in
3D, it's a plane; and in higher dimensions, it's a hyperplane.
4. Margin:
The margin is the distance between the hyperplane and the nearest data points from
either class. SVM tries to maximize this margin to achieve better generalization.
5. Support Vectors:
The support vectors are the data points that lie closest to the hyperplane. These points
influence the position and orientation of the hyperplane and are critical to defining the
decision boundary.
6. Linear vs Non-linear Data:
For linearly separable data, SVM can find a straight hyperplane. For non-linear data,
SVM uses the kernel trick to map data into a higher-dimensional space where a linear
separator may exist.
7. Kernels:
A kernel is a mathematical function used to transform the input data. Common kernels
include:
o Linear kernel
o Polynomial kernel
o Radial Basis Function (RBF or Gaussian) kernel
o Sigmoid kernel
8. Objective Function:
SVM tries to minimize classification error while maximizing the margin. For linearly
separable data, the optimization objective is:

Minimize: (1/2) * ||w||²

9. Subject to: yᵢ(w · xᵢ + b) ≥ 1 for all iSoft Margin and C Parameter:


When perfect separation is not possible, SVM allows some misclassification using the
soft margin approach. The C parameter controls the trade-off between maximizing
margin and minimizing classification errors.
10. Applications:
SVM is widely used in real-world problems such as:
o Email spam detection
o Face recognition
o Handwritten digit classification
o Text categorization
o Disease classification in bioinformatics

You might also like