Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views54 pages

Module 2 Notes - Full

The document provides an overview of Multi-layer Perceptrons (MLPs) and their components, including the forward and backward propagation processes, activation functions, and error minimization techniques. It discusses the implementation of MLPs, the XOR problem, and the importance of training data and hidden layers in achieving effective learning. Additionally, it introduces Radial Basis Functions (RBFs), the curse of dimensionality, and the significance of dimensionality reduction in machine learning.

Uploaded by

6cm872mffs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views54 pages

Module 2 Notes - Full

The document provides an overview of Multi-layer Perceptrons (MLPs) and their components, including the forward and backward propagation processes, activation functions, and error minimization techniques. It discusses the implementation of MLPs, the XOR problem, and the importance of training data and hidden layers in achieving effective learning. Additionally, it introduces Radial Basis Functions (RBFs), the curse of dimensionality, and the significance of dimensionality reduction in machine learning.

Uploaded by

6cm872mffs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Module 2

Syllabus
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation
Error – Multi-layer Perceptron in Practice – Examples of using the MLP –
Overview – Deriving Back-Propagation – Radial Basis Functions and Splines –
Concepts – RBF Network – Curse of Dimensionality – Interpolations and Basis
Functions – Support Vector Machines
Perceptron
It is an artificial neural network
It is the simplest possible neural network
It is a binary classifier with three main components:
1. input nodes/input layer
2. Weights and bias
3. Activation function
Types of Activation Functions

Step function
Multilayer Perceptron
•Single-layer networks can only create linear decision boundaries (planes).
•Multi-layer networks can create complex decision boundaries by transforming the input space
through non-linear activation functions in hidden layers.
•This allows the network to solve non-linearly separable problems by finding linear separations in
higher-dimensional spaces created by the hidden layers.
•Hidden layers consist of neurons that apply non-linear
transformations to the input data
XOR problem
Going Forward
Forward Pass (Recall)
•Purpose: To compute the predicted output of the network given the input data.
•Steps:
• Input Layer: The input data is fed into the network.
• Hidden Layers: Each neuron's output is calculated using the weighted sum of inputs plus a bias,
followed by an activation function.
• Output Layer: The final output is computed in the same manner, providing the network's prediction.

•Output: The network's prediction for the given input data.


Implementation of XOR
•Network Structure:
•Input layer: Nodes A and B
•Hidden layer: Nodes C and D
•Output layer: Node E
•For input (1, 0): a.
a. Hidden Layer:
•Node C: Input = -1×0.5 + 1×1 + 0×1 = 1.5 Result: C fires (output 1) as 0.5 > 0 (threshold)
•Node D: Input = -1×1 + 1×1 + 0×1 = 0 Result: D doesn't fire (output 0) as 0 ≤ 0 (threshold)
b. Output Layer:
•Node E: Input = -1×0.5 + 1×1 + 0×-1 = 0.5 Result: E fires (output 1) as 0.5 > 0 (threshold)
•XOR Function:
•E fires when A and B are different (1,0 or 0,1)
•E doesn't fire when A and B are the same (0,0 or 1,1)
GOING BACKWARDS: BACK-PROPAGATION OF ERROR
Backward Pass (Weight Update)

•Purpose: To update the network's weights and biases to minimize the error between the predicted output and the actual target.

•Steps:
• Calculate Error: Determine the difference between the predicted output and the actual target.
• Compute Gradients: Calculate the gradient of the error with respect to each weight and bias in the network using the
chain rule. This involves:
• Output Layer: Compute the gradient of the error with respect to the output layer's inputs (derivative of the loss
function with respect to the network's output).
• Hidden Layers: Propagate the error back through the network, computing the gradient of the error with respect to the
inputs of each hidden layer neuron.
• Update Weights and Biases: Adjust each weight and bias by a small amount proportional to the negative of its
gradient (using the learning rate to control the size of the update).

Output: Updated weights and biases for the network, aimed at reducing the prediction error.
1. Error minimization in multi-layer perceptrons (MLPs) is more complex than in simple perceptrons
due to multiple layers of weights.

2. Determining which weights caused the error (credit assignment problem) is challenging in MLPs.

3. The simple error function used for perceptrons (Σ(yk - tk)) is inadequate for MLPs as positive and
negative errors can cancel out.

4. A sum-of-squares error function is introduced

5. This new error function ensures all errors contribute positively to the total error.

6. The (1/2) factor in the error function simplifies differentiation.


•If we differentiate a function, then it tells us the gradient of that function, which is the direction along which
it increases and decreases the most. So if we differentiate an error function, we get the gradient of the error.

•The weights of the network are trained so that the error goes downhill until it reaches a local minimum, just
like a ball rolling under gravity
The Multi-layer Perceptron Algorithm
Introduction

The inputs are fed forward through the network the error is computed as the sum-of-squares difference between
the network outputs and the targets

The error is fed backwards through the network in order to


L M N
• first update the second-layer weights

• and then afterwards, the first-layer weights

Initialisation

– initialise all weights to small random values

• Training – repeat:

∗ for each input vector:


Forwards phase:
•compute the activation of each neuron j in the hidden layer(s) using:

•It goes to the output neurons with activation function


Backward Phase

(yk - tk) is the error term


yk(1-yk) is the derivative of the activation
function
Multilayer perceptron in practise
we are going to look more at choices that can be made about the network in order to use it for solving real problems.

i) Amount of training data

• (MLPs) with one hidden layer contain a substantial number of adjustable parameters, specifically (L + 1) × M + (M + 1) ×
N weights, where L, M, and N represent the number of nodes in the input, hidden, and output layers, respectively.

• Training these networks involves setting these numerous weights through the back-propagation algorithm, which relies on
errors derived from training data. While more training data generally improves learning outcomes, it also increases
training time.

• There is no precise formula to determine the minimum required amount of data, as it varies depending on the problem.
ii) Number of hidden layers

First Hidden Layer: This layer typically combines sigmoid functions to create ridge-like or hill-shaped functions (as
shown in Figure (a) and (b)).The outputs of individual neurons in this layer are not yet "bumps" but rather sigmoid-
shaped curves or hills.

Second Hidden Layer (if present):This layer combines the outputs from the first hidden layer. It's at this stage that true
"bump" functions can be formed Figure (c)).The combination of hills from the first layer, when oriented properly (e.g., at
90° to each other), creates localized bump responses.

Output Layer: This is where the final addition of bumps typically occurs. The outputs from the previous layer (either the
first or second hidden layer) are combined linearly to approximate the desired function. If using two hidden layers, this
layer combines the bump functions to create the final output.
The effective learning at each layer
When to stop learning?
Training Process: The MLP is trained over multiple epochs (iterations over
the entire dataset). Weights are adjusted as the network makes errors in each
iteration.

Stopping Criteria: Simple methods like setting a fixed number of iterations


or a minimum error threshold are not sufficient. These can lead to overfitting
or underfitting.

Validation Set: A separate dataset used to monitor the network's


generalization ability during training.

Error Curves: Training error: Typically decreases rapidly at first, then slows
down.

Validation error: Initially decreases but may start increasing at some point.

Early Stopping: The technique of stopping training when the validation error
starts to increase.
Examples of MLP
Given Information:
•Input layer: x1 = 0.35, x2 = 0.7
•Hidden layer: h1, h2
•Output layer: o3
•Weights: w11 = 0.2, w21 = 0.2, w12 = 0.3, w22
= 0.3, w13 = 0.3, w23 = 0.9
•Activation function: Sigmoid
•Actual output (y) = 0.5
Steps:
1.Calculate inputs to hidden layer neurons: For h1: net_h1 = x1 * w11 + x2 * w21 = 0.35 * 0.2 + 0.7 *
0.2 = 0.07 + 0.14 = 0.21 For h2: net_h2 = x1 * w12 + x2 * w22 = 0.35 * 0.3 + 0.7 * 0.3 = 0.105 +
0.21 = 0.315
2.Apply sigmoid activation function to hidden layer: sigmoid(x) = 1 / (1 + e^-x) h1 = sigmoid(0.21) = 1
/ (1 + e^-0.21) ≈ 0.5523 h2 = sigmoid(0.315) = 1 / (1 + e^-0.315) ≈ 0.5781
3.Calculate input to output neuron: net_o3 = h1 * w13 + h2 * w23 = 0.5523 * 0.3 + 0.5781 * 0.9 =
0.1657 + 0.5203 = 0.686
4.Apply sigmoid activation to output neuron: o3 = sigmoid(0.686) = 1 / (1 + e^-0.686) ≈ 0.6651
5.Calculate error: Error = (y - o3)^2 / 2 = (0.5 - 0.6651)^2 / 2 ≈ 0.0136
Forward propagation output: 0.6651
Error: 0.0136
Deriving Back Propagation
Prerequistes:
1. d/dx(1/2x2)= x,
2. chain rule, dy/dx = dy/dt .dt/dx.
3. dy /dx = 0 if y is not a function of x.
The output of the neural network (the end of the forward phase of the algorithm) is a function of three
things:
• the current input (x)
• the activation function g(·) of the nodes of the network
• the weights of the network (v for the first layer and w for the second)
1. The Error of the network
The error function (for example, Mean Squared Error) for the neural network is defined as:

where:
• yk is the output of the network for the k-th training example.
• tk is the target value for the k-th training example.
•N is the number of training examples.
Taking the Partial Derivative
To update the weights using gradient descent, we need to compute the gradient of the error function with
respect to the weights
Start with the error function:
We are going to use a gradient descent algorithm that adjusts each weight wικ for fixed
values of ι and κ, in the direction of the negative gradient of E(w)

Taking partial derivative


the weight update rule is that we follow the gradient downhill, that is, in the direction
Requirement of an activation function
To effectively model a neuron in a neural network, an activation function should have the
following properties:
1.Differentiable:
Must be differentiable to compute gradients during backpropagation.

2.Saturation:
Should saturate at both ends of its range, allowing the neuron to either fire or not.

3.Rapid Transition:
Should change quickly in the middle of its range for sensitivity to input changes.
Derivation of activation function
Sigmoid function is defined as
Back Propagation Error
Chain rule,
output of output layer neuron κ is

Error term or Delta term,


The expression for the error at the output,

Chain rule,
Weight updation,

Hidden node contributes to the activation of all of


the output nodes, and so we need to consider all of
these contributions (with the relevant weights).
Now the weight update rule for hidden layer,
vl,
Radial Basis Function
Receptive fields: A receptive field refers to the
specific region of the input space that a neuron or
node responds to

Neuron Firing Behavior: If the input x is near to the


center of a neuron's receptive field, that neuron will
fire more strongly. The "center" here refers to the
point in input space where the neuron is most
sensitive.

A Radial Basis Function is a real-valued function


whose value depends only on the distance from a fixed
point, called the center.
Gaussian Function Behavior:In the RBF network, each neuron's response is modeled by a
Gaussian function. The center of this Gaussian corresponds to the neuron's optimal input.
φ(x)=

This graph illustrates a Gaussian Radial Basis Function:


1.The x-axis represents the input space.
2.The y-axis represents the output of the RBF, φ(x).
3.The red dashed line indicates the center (c) of the function.
4.The blue curve shows how the function's output decreases
symmetrically as the distance from the center increases.
THE RADIAL BASIS FUNCTION (RBF)
NETWORK

. Basis Functions (RBFs) are a special category


Radial
of feed-forward neural networks comprising three
layers:
 Input Layer: Receives input data and passes it
to the hidden layer.
 Hidden Layer: The core computational layer
where RBF neurons process the data.
 Output Layer: Produces the network’s
predictions, suitable for classification or
regression tasks.
RBF Working
 Input Vector: The network receives an n-dimensional input vector that needs classification or
regression.
 RBF Neurons: Each neuron in the hidden layer represents a prototype vector from the training set. The
network computes the Euclidean distance between the input vector and each neuron’s center.
 Activation Function: The Euclidean distance is transformed using a Radial Basis Function (typically a
Gaussian function) to compute the neuron’s activation value. This value decreases exponentially as the
distance increases.
 Output Nodes: Each output node calculates a score based on a weighted sum of the activation values
from all RBF neurons. For classification, the category with the highest score is chosen.
The Radial Basic Function Algorithm
Step 1: Selecting the Centers : Techniques for Centre Selection: Centre’s can be picked at random from the
training set of data or by applying techniques such as k-means clustering.
K-Means Clustering: The center’s of these clusters are employed as the center’s for the RBF neurons in
this widely used center selection technique, which groups the input data into k groups.

Step 2: Calculate the actions of the RBF nodes using


Step 3: Train the output weights by either:
– using the Perceptron OR
– computing the pseudo-inverse of the activations of the RBF centres
Pseudo inverse calculation for weights
Activation Matrix G:
•For each input vector, the activations of all hidden nodes are computed.
•These activations are assembled into a matrix G.
•Outputs of the network can then be computed as y=GW, W is the weight
•If all the outputs are correct, t=GW
•W= t
matrix inverse is only defined if a matrix is square, If its non square we can go for pseudo inverse,
•Pseudo inverse is defined as,
Curse of Dimensionality
•The curse of dimensionality highlights the challenges of working with high-dimensional data. As the
number of dimensions increases, the volume of the unit hypersphere tends to zero, making it
harder to analyze and interpret the data effectively.

•This phenomenon underscores the importance of dimensionality reduction techniques and careful
feature selection in machine learning and data analysis.
Unit hydrosphere
The unit hypersphere is a generalization of circles and spheres to higher dimensions. In a d-dimensional
space, a unit hypersphere is the set of all points that are exactly one unit distance away from a central point
(usually the origin).

•In 2 Dimensions (2D): A unit hypersphere is a circle with radius 1 centered at the origin (0, 0).

•In 3 Dimensions (3D): A unit hypersphere is a sphere with radius 1 centered at the origin (0, 0, 0).
How higher dimension affect in ML?
•The curse of dimensionality affects our machine learning algorithms because as the number of
input dimensions increases, we need more data to help the algorithm generalize well.

•Since our algorithms classify data based on features, more features mean we need more data
points. Therefore, we must be selective about the information we provide to the algorithm,
which requires some prior understanding of the data. So we perform dimensionality reduction
techniques.
Dimensionality Reduction Techniques:
 Feature Selection  Feature Extraction:  Data Preprocessing  Handling Missing Values
Interpolation
Interpolation is a fundamental concept in numerical analysis and data science, used to estimate values between known data
points. Imagine you have a set of scattered points on a graph, and you want to draw a smooth line or curve that passes through
or near all these points. This process of filling in the gaps between known data points is called interpolation.

• Why interpolate?

1. Discrete data: In practice, we usually have access only to discrete data points, not the full continuous function.

2. Approximation: Interpolation allows us to estimate function values between known data points.

3. Computational efficiency: A simple representation (like piecewise linear) can be more efficient to compute with than a
complex underlying function.

4. Noise reduction: If the original data contains noise, interpolation can help smooth it out.
Fig1: The true function we're trying to approximate Fig:3:The step function interpolation

The step shape denotes the output after interpolation


Fig 2: The datapoints available from the function
Conclusion: Not a good interpolation
Types
1. Step function interpolation(refer Fig3)

2. Linear Interpolation with Derivative Matching

• Uses straight lines between data points

• Lines are not necessarily horizontal

• Slopes match the first derivative of the function at each point

3. Continuous Piecewise Linear Interpolation


Linear Interpolation with Derivative Matching
• Linear segments between data points

• Lines meet at data points, ensuring continuity

• Continuous function, but may have discontinuous derivatives

4. Spline Interpolation

5. Radial Basis Function (RBF) interpolation Continuous Piecewise Linear Interpolation


Spline
A spline is a mathematical function used for interpolation and smoothing of data.

Definition:

1. A spline is a piecewise polynomial function

2. It's composed of multiple polynomial segments joined together at points called knots

1.Characteristics:

1. Smooth: Typically continuous up to a certain degree of derivative

2. Flexible: Can approximate complex shapes while maintaining smoothness

3. Local control: Changes in one segment don't significantly affect distant parts
1.Common types:

1. Linear splines: Use first-degree polynomials (straight lines) f(x)=mx+c

2. Cubic splines: Cubic splines are a powerful interpolation method used to create
smooth curves through a set of data points.

Piecewise cubic polynomials that connect data points

Each segment is a cubic function: f(x) = ax³ + bx² + cx + d

where a, b, c, and d are coefficients that are determined to ensure: The curve passes through
the data points

Use third-degree polynomials, very popular due to their balance of smoothness and
computational efficiency
RBF interpolation
It is a method of interpolation that uses radial basis functions to approximate the underlying data.

Uses a combination of radial basis functions centered at each data point to construct the interpolating function.

RBF (Radial Basis Function) interpolation is a method for estimating unknown values in a dataset. It works by:

1.Placing an RBF at each known data point

2.Adjusting the height of each RBF

3.Summing all RBFs to create a smooth surface

An RBF is a function whose value depends only on the distance from its center. The most common RBF is the Gaussian
(bell-shaped) curve.
Applications of Interpolation

 Image Processing: .
 Computer Graphics:
 Numerical Analysis:.
 Signal Processing:
 Mathematical Modeling: .
 Geographic Information Systems (GIS):.
 Audio Processing:
Support Vector Machine
Support Vector Machines (SVMs) are powerful and versatile machine learning algorithms used primarily for classification and
regression tasks.

Core Concept:
1. SVMs aim to find the optimal hyperplane that best separates different classes in the feature space.
2. For non-linearly separable data, SVMs use the "kernel trick" to map data into a higher-dimensional space where it
becomes linearly separable.

1.Key Components: a. Hyperplane: The decision boundary that separates classes.

Margin: The distance between the hyperplane and the nearest data points (support vectors).

Support Vectors: The data points closest to the hyperplane that define the margin.

1.Objective:

Maximize the margin between classes to achieve the best generalization.


Which is showing best separation of
classes?
w·x+b=0

•For binary classification:


•w · x + b > 0 classifies as one class (often labeled +1)
•w · x + b < 0 classifies as the other class (often labeled -1)
Support vectors are the data points that are closest to the
hyperplane and influence its position.
Support Vector Machine Terminology
Hyperplane: Hyperplane is the decision Margin: Margin is the distance between the support vector and
hyperplane.
boundary that is used to separate the data
The main objective of the SVM algorithm is to
points of different classes in a feature space.
maximize the margin.
In the case of linear classifications, it will be
The wider margin indicates better classification
a linear equation i.e. wx+b = 0.
performance.
Support Vectors: Support vectors are the
closest data points to the hyperplane, which Kernel: Kernel is the mathematical function, which is used
makes a critical role in deciding the in SVM to map the original input data points into highdimensional
feature spaces, so, that the hyperplane can be
hyperplane and margin.
easily found out even if the data points are not linearly
separable in the original input space.
Steps to Determine Linear Decision
Boundary using SVM
1. Data Preparation:
5. Formulate the Decision Boundary:The linear
•Prepare dataset with features (X) and binary
decision boundary is given by the equation:w · x + b
labels (y)
= 0.
2. Solve Optimization Problem:
Classification of New Points: For a new point
Objective: Maximize the margin between
x:If w · x + b > 0, classify as positive class
classes
If w · x + b < 0, classify as negative class
Constraints: Ensure correct classification of
•Evaluate and Refine:
training points
•Test model and adjust if necessary
•3. Identify Support Vectors:
These points define the decision boundary
4. Determine the Bias Term (b):Use any
support vector (xₛ, yₛ).
b = yₛ - w · xₛ
Advantages and Disadvantages of SVM
•Effective in High-Dimensional Spaces: •Not Suitable for Large Datasets: Training time
Performs well even when the number of can be high for large datasets
dimensions exceeds the number of samples
•Sensitive to Noisy Data: Performs poorly with
•Memory Efficient: Uses a subset of training overlapping classes
points (support vectors) in the decision
function •No Probabilistic Explanation: Does not directly
provide probability estimates
•Versatile: Different kernel functions can be
specified for various decision functions •Kernel Selection Can Be Challenging:
Choosing the right kernel and tuning
•Works Well with Clear Margin of Separation: parameters can be complex
Highly effective when there's a clear margin of
separation between classes •Interpretability Issues: Especially with non-
linear kernels, the model can be hard to
• Robust to Overfitting: Especially in high- interpret
dimensional spaces due to regularization

You might also like