Unit 4 Iml Introduction To Machine Learning
Unit 4 Iml Introduction To Machine Learning
UNIT-IV
Support Vector Machines (SVMs) are a powerful and versatile class of supervised machine
learning algorithms used for classification and regression tasks. At its core, SVM aims to find an
optimal hyperplane that best separates data points into distinct classes. This hyperplane is chosen
to maximize the margin between the two classes, meaning it should be positioned in such a way
that it maximally separates the closest data points from each class. These closest data points are
known as "support vectors." By focusing on support vectors, SVMs achieve better generalization
to new, unseen data.
Suppose you want to predict house prices based on features like the number of bedrooms,
square footage, and age of the house. If the relationship between these features and the
house price can be accurately represented by a linear equation like
House Price = (Number of Bedrooms * w1) + (Square Footage * w2) + (Age of the
House * w3) + b
In this case, the dataset is considered linear. Linear regression is a common algorithm
used for such problems.
If you were to plot the data points on a graph, they would form a clear straight line or a
hyperplane that approximates the relationship between the features and the target
variable.
Suppose you want to predict a student's exam score based on the number of hours they
studied and the number of hours they slept the night before. The relationship between
these features and the exam score might not be linear.
It could be that studying more hours initially increases the score, but after a certain point,
further studying becomes less effective. Also, the quality of sleep might interact with the
effect of studying.
These relationships can be highly non-linear and are better captured by non-linear
models like decision trees, random forests, or support vector machines with non-linear
kernels.
In this case, plotting the data points would result in a curve or a complex shape that
doesn't fit a straight line or a simple plane.
Understanding whether your data is linear or non-linear is crucial for selecting the
appropriate machine learning model and algorithm. Linear regression, for example, is ideal for
linear datasets, while non-linear datasets require more advanced techniques to capture the
underlying patterns effectively.
1. Intuition:
At its core, SVM aims to find a hyperplane that maximizes the margin between two classes of
data points. This hyperplane is known as the decision boundary, and the margin represents the
distance between the boundary and the nearest data points from each class. The idea is to ensure
that the decision boundary has the greatest separation, making it robust to classify new, unseen
data accurately.
2. Linear Separation
For linearly separable data, SVM finds a hyperplane that separates the two classes with the
largest possible margin. In a two-dimensional feature space, this hyperplane is a straight line.
Here's a simplified diagram to illustrate this concept:
The solid line is the decision boundary, and the dashed lines represent the margin.
3. Support Vectors
Support vectors are the data points closest to the decision boundary. These points play a critical
role in defining the margin and the overall performance of the SVM. The margin is determined
by the distance between the support vectors and the decision boundary, as shown below:
4. Non-linear Separation
In many real-world scenarios, data may not be linearly separable. SVM can handle such cases by
mapping the data into a higher-dimensional space where it becomes linearly separable. This
transformation is done using a kernel function. Common kernel functions include the linear,
polynomial, and radial basis function (RBF) kernels. The following diagram demonstrates this
concept:
In this example, data is transformed into a 3D space where it becomes linearly separable.
In summary,
Support Vector Machine is a versatile machine learning algorithm that excels in both linear and
non-linear classification tasks. It aims to find the optimal hyper plane that maximizes the margin
between classes, and it uses support vectors to define this margin. By introducing kernel
functions, SVM can handle non-linear separations, making it a valuable tool in various
applications, from image classification to financial analysis.
SVMs advantages:
They provide strong theoretical foundations, making them a preferred choice for many
researchers and practitioners.
They are effective with high-dimensional data, relatively resistant to overfitting, and can
handle both binary and multiclass classification problems.
Moreover, SVMs are also used in regression tasks, known as Support Vector Regression
(SVR), where they aim to fit a hyperplane that approximates the target values as closely
as possible.
In SVM, the linear discriminant function for binary classification can be represented as
f(x) : The decision function that determines the class label of a data point x.
w: The weight vector that defines the orientation of the decision boundary.
sign(.): The sign function, which assigns a class label based on the sign of the result.
Example
Let's illustrate this with a simple example. Suppose we have a 2D feature space for binary
classification, where class A is represented by Blue points and class B is represented by Black
points. The linear discriminant function is:
w1 and w2 are the weights that define the orientation of the decision boundary.
Here's a diagram
w1 x1 + w2 x2 + b = 0
That separates the two classes. Any data point x falling on one side of the decision boundary is
classified as Class A, and on the other side as Class B.
SVM's objective is to find the optimal values of w and b that maximize the margin
between the two classes while minimizing classification errors.
The support vectors are the data points that are closest to the decision boundary, and they
are used to define the margin in an SVM.
The margin is the perpendicular distance from the decision boundary to the support
vectors.
SVM aims to maximize this margin while ensuring that data points are correctly
classified.
If the data points are not linearly separable, SVM can use kernel functions to map the
data to a higher-dimensional space where linear separation is possible.
Consider a binary classification problem where you have two classes: Class A and Class
B. The objective of SVM is to find a hyperplane that best separates these two classes
while maximizing the margin. The equation for this hyperplane is
w⋅x+b = 0
Where
The margin is the distance between the hyperplane and the nearest data points from each
class. To maximize this margin, we want to find w and b such that the distance from the
hyperplane to the closest point in Class A and the closest point in Class B is maximized.
Mathematically, this can be represented as:
2
Margin =
||w||
Where ||w|| is the Euclidean norm (magnitude) of the weight vector w. The objective of SVM is
to maximize this margin.
In the diagram, the decision hyperplane (the straight line) separates Class A from Class B. The
margin is the distance from the hyperplane to the closest data points from each class.
SVM's objective is to find the optimal w and b that maximize this margin while ensuring
that data points are correctly classified.
In this ideal scenario of linearly separable data, the support vectors are the data points
closest to the hyperplane, and they are used to define the margin.
SVM finds these support vectors and optimizes the margin by solving a constrained
optimization problem.
The large margin classifier provides a robust solution for linearly separable data, ensuring a
wider separation between classes and making it less sensitive to noise in the data.
Fig: Representing good and bad SVM classifier models in small and large margin cases
A large margin classifier in SVM for linearly separable data aims to find an optimal hyperplane
that maximizes the margin between two classes, ensuring a robust separation. Support vectors
define this margin, and SVM finds the best hyperplane by minimizing classification errors while
maximizing the margin, enhancing classification accuracy and robustness.
The linear soft margin classifier in SVM aims to find a hyperplane that best separates
overlapping classes, even when perfect separation isn't possible. It introduces a "slack
variable" (ξ) to account for classification errors. The objective function is modified as
follows
1
min w ,b ∥w∥2+C ∑𝑛𝑖=1 ξi
2
Subject to:
C: A hyper parameter that controls the trade-off between maximizing the margin and minimizing
the misclassification error.
The decision hyperplane (a straight line) attempts to separate the classes, but due to
overlapping, some data points may lie on the wrong side.
The slack variables (ξ) allow for some misclassifications while trying maximizing the
margin. The parameter C controls the balance between minimizing errors (small C) and
maximizing the margin (large C).
It helps SVM adapt to overlapping classes and create a margin that balances the trade-off
between classification accuracy and margin size.
In SVM, the kernel function plays a central role. The kernel function, denoted as K(x, y),
takes two input data points x and y and returns a measure of similarity between them.
It implicitly maps the data into a higher-dimensional feature space where linear
separation might be possible.
The equation for SVM's decision boundary in the feature space is:
Where
K (xi, x): The kernel function that maps xi and x into the feature space
Consider a simple 2D dataset where Class A (Green points) and Class B (blue points) are
not linearly separable in the original feature space:
In this diagram, it's evident that a straight-line decision boundary cannot separate the
classes effectively in the original 2D space.
Now, by using a kernel function, we implicitly map this data to a higher-dimensional
feature space, often referred to as a "kernel-induced feature space." Let's say we use a
radial basis function (RBF) kernel
This RBF kernel implicitly maps the data to a higher-dimensional space where the classes
might become linearly separable
Fig: Non linear separable data using 3D Kernel Space and 2D Space
In this new feature space, the data points might be linearly separable with the right choice
of kernel and kernel parameters, enabling SVM to find an optimal decision boundary that
maximizes the margin between classes.
The transformation into the kernel-induced feature space is implicit and doesn't require explicit
calculation of the transformed feature vectors. It allows SVM to handle non-linearly separable
data effectively.
Perceptron Algorithm:
The Perceptron algorithm is a simple supervised learning algorithm used for binary
classification. It's designed to find a linear decision boundary that separates data points of two
classes. Here's an explanation of the Perceptron algorithm with suitable diagrams:
1. Initialize the weights (w) and bias (b) to small random values or zeros.
2. For each data point (x) in the training dataset, compute the predicted class label (ŷ) using the
following formula
ŷ =sign (w⋅x+b)
Here, w represents the weight vector, x is the feature vector of the data point, and sign (.) is a
function that returns +1 for values greater than or equal to zero and -1 for values less than zero.
3. Compare the predicted class label (ŷ) to the true class label (y) of the data point. If they don't
match, update the weights and bias as follows:
b=b+α⋅ (y− ŷ)
Here, (α) is the learning rate, and (y − ŷ) is the classification error. These updates help the
Perceptron adjust the decision boundary to classify the data points correctly.
4. Repeat the above steps for a fixed number of iterations or until the algorithm converges,
meaning no more misclassifications occurs.
Fig: Perceptron
Let's illustrate the Perceptron algorithm with a simple 2D dataset and a linear decision boundary
Suppose you have a 2D feature space with two classes, Class 1 (labeled as -1) and Class 2
(labeled as +1), and you want to separate them using the Perceptron algorithm.
In the initial state, the decision boundary (a straight line) is randomly placed. The
Perceptron algorithm starts making predictions and adjusting the decision boundary based
on classification errors.
As it iterates through the data points, it gradually shifts the decision boundary to correctly
classify the points into their respective classes.
The process continues until no misclassifications remain, or a maximum number of
iterations are reached. The final decision boundary separates the two classes effectively
The Perceptron algorithm finds a linear decision boundary that minimizes classification
errors and correctly classifies the data points based on the training data. It's a basic
algorithm suitable for linearly separable datasets and serves as the foundation for more
complex neural networks.
The key difference between the Perceptron and SVM is that SVM aims to find the optimal
hyperplane that maximizes the margin, whereas the Perceptron algorithm doesn't consider
margin maximization. SVM is a more sophisticated and powerful classification algorithm,
especially suitable for scenarios where data may not be perfectly separable and a margin is
essential for generalization and reducing over fitting.
Linear regression using Support Vector Machines (SVM) is a variation of SVM designed for
regression tasks. It aims to find a linear relationship between input features and a continuous
target variable.
In linear regression using SVM, the goal is to find a linear function that best
approximates the relationship between input features and the target variable. This linear
function is represented as:
f(x) = w⋅x+b
f(x): The predicted target variable.
The linear regression objective is to minimize the mean squared error (MSE) between the
predictions and the true target values
1
min w ,b ∥w∥2+C ∑𝑛𝑖=1 (yi−f(xi))2
2
C: A regularization parameter controlling the trade-off between fitting the data and
keeping the model simple.
The target variable (y) is represented on the vertical axis, and the input features (x) are on
the horizontal axis.
The linear function f(x) = w.x + b is the best-fitting line that minimizes the mean squared
error by adjusting the weight vector (w) and the bias term (b).
This linear model can be used for regression tasks to predict continuous target variables
based on input features.
Non-linear regression by Support Vector Machines (SVM) uses the principles of SVM to model
non-linear relationships between input features and a continuous target variable. The key idea is
to use kernel functions to implicitly map the data into a higher-dimensional space, where a linear
regression model can be applied effectively
In non-linear regression using SVM, the goal is to find a non-linear function that best fits
the relationship between input features and the target variable.
Unlike linear regression, which assumes a linear relationship, non-linear regression
allows for more complex, non-linear patterns.
The non-linear regression objective is to minimize the mean squared error (MSE)
between the predictions and the true target values:
1
min w ,b ∥w∥2+C ∑𝑛𝑖=1 (yi−f(xi))2
2
K(xi, x): The kernel function that implicitly maps xi and x into a higher-dimensional
feature space.
The target variable (y) is represented on the vertical axis, and the input features (x) are on
the horizontal axis.
The non-linear function f(x)= ∑𝑛 𝑖=1 αi K(xi, x)+b captures non-linear relationships
between input features and the target variable by implicitly mapping the data into a
higher-dimensional feature space using the kernel function.
The model can then make non-linear predictions based on the input features.
Neural networks, particularly deep learning models like convolutional neural networks (CNNs)
and recurrent neural networks (RNNs), serve as the foundation for cognitive machine learning.
These models are capable of handling complex data, learning patterns, and making predictions.
3. Reinforcement Learning:
Cognitive machines also incorporate reinforcement learning, enabling them to learn through
interactions with their environment. Agents learn by receiving rewards and penalties based on
their actions, enabling them to make decisions and adapt over time.
4. Transfer Learning:
To mimic cognitive abilities, neural networks use transfer learning. Pre-trained models are
fine-tuned for specific tasks, which is akin to humans applying knowledge learned in one context
to solve related problems.
Cognitive machines process data from various sources (text, images, audio) simultaneously,
fostering a more comprehensive understanding of the environment. They can analyze multiple
data modalities to make informed decisions.
Cognitive machines integrate memory networks and reasoning modules, enabling them to store
and retrieve information and perform logical reasoning. This allows them to solve problems by
considering context and past experiences.
Cognitive machines excel in natural language processing tasks. They can understand and
generate human-like text and engage in meaningful conversations, making them highly
interactive and adaptive.
8. Contextual Awareness:
These machines have contextual awareness, recognizing the importance of the context in which
they operate. They can adapt their behavior, decisions, and responses based on the current
situation.
9. Continuous Learning:
Cognitive machines don't stop learning after initial training. They engage in continuous
learning and self-improvement, allowing them to adapt to changing conditions and acquire new
knowledge over time.
The ultimate goal of learning with neural networks toward cognitive machines is to create
systems that replicate and augment human-like cognition. They mimic human problem-solving,
decision-making, creativity, and adaptability.
In summary, learning with neural networks toward cognitive machines involves a holistic
approach to developing intelligent systems. By combining various learning techniques, these
machines can process complex data, reason, understand language, adapt to changing situations,
and replicate cognitive functions, bringing us closer to creating intelligent systems that emulate
human cognition and understanding.
Neuron Models:
Let us discuss two neuron models
1. Biological neuron
2. Artificial neuron
1. Biological neuron:
Neuron Structure:
Cell Body (Soma): The cell body contains the nucleus and other organelles.
Dendrites: These are the branched extensions that receive signals from other neurons.
Axon: The axon is a long, slender extension that transmits signals to other neurons or
cells.
Synapses:
Neurons communicate with each other through synapses, which are small gaps between
the axon of one neuron and the dendrites of another. Neurotransmitters are released at the
synapse to transmit signals.
Action Potential:
Neurons transmit electrical signals in the form of action potentials. An action potential is
a brief change in the neuron's electrical charge, leading to the propagation of a signal
along the axon.
Resting Potential:
Neurons maintain a resting potential, which is a difference in electrical charge across the
cell membrane. It is around -70 mill volts and is essential for neural signaling.
When the electrical charge inside the neuron reaches a certain threshold, an action
potential is initiated. This action potential travels down the axon and signals the release
of neurotransmitters at the synapse.
Neural Networks:
Neurons are interconnected in complex networks. These networks allow for information
processing, learning, and memory formation.
2. Artificial Neuron:
Summation (Σ):
The weighted inputs are summed together, typically with a bias term (b), to compute the
net input:
Output (y):
The result of the activation function is the output of the artificial neuron. It represents the
neuron's response to the input signals.
In the biological brain, a huge number of neurons are interconnected to form the network and
perform advanced intelligent activities. The artificial neural network is built by neuron models.
Many different types of artificial neural networks have been proposed, just as there are many
theories on how biological neural processing works. We may classify organization of the neural
networks into two types. They are
A single-layer neural network, also known as a single-layer Perceptron, is the simplest neural
network architecture. It consists of an input layer, which directly connects to an output layer,
without any hidden layers. Single-layer networks are mainly used for binary classification
problems or linearly separable tasks.
Where:
b is the bias.
A step function, also known as the Heaviside step function, is often used as the activation
function. It outputs 1 if the weighted sum z is greater than or equal to 0, and 0 otherwise
f(z) = 1 if z ≥ 0
0 if z < 0
In the diagram, input features (x1, x2... xn) are connected to the weighted sum calculation,
followed by the activation function (step function), which produces a binary output (0 or
1).
This single-layer neural network can make binary decisions based on the weighted sum
of its input features, which is often used for linearly separable classification problems.
Single-layer networks are limited in their capability compared to more complex neural
architectures like multi-layer perceptrons (MLPs) or deep neural networks.
They can only solve problems that are linearly separable and cannot capture complex
non-linear relationships in data. While simple, they are foundational in understanding
neural networks and are a starting point for more sophisticated architectures. To handle
more complex tasks, deeper neural networks with hidden layers are employed.
Multi-layer neural networks, often referred to as multi-layer perceptrons (MLPs), are a type of
artificial neural network with multiple layers of interconnected neurons. These networks are
designed to handle more complex tasks by introducing hidden layers between the input and
output layers.
The weighted sum for each neuron in a hidden layer is calculated as follows:
zj=∑𝑛𝑖=1 wij⋅xi + bj
Where
Common activation functions for hidden layers include the sigmoid, ReLU, or tanh
functions
The weighted sum for each neuron in the output layer is calculated similarly to the hidden
layer zk= ∑𝑚
𝑖=1 wkj′⋅f(zj)+bk′
Where
w'kj is the weight connecting neuron j in the hidden layer to neuron k in the output layer.
The activation function in the output layer depends on the type of problem. For binary
classification, you might use a sigmoid function. For multiclass classification, a softmax
function is common.
In this diagram, input features (x1, x2... xn) are connected to the weighted sum
calculations in the hidden layer, followed by the activation function for the hidden layer. The
output of the hidden layer is then connected to the weighted sum calculations in the output layer,
followed by the activation function for the output layer. This network structure allows multi-
layer neural networks to capture complex relationships and solve a wide range of tasks, including
classification, regression, and more.
Linear Neuron:
Inputs (x1, x2... xn): A linear neuron takes multiple input values (x1, x2... xn). Each input is
associated with a weight (w1, w2... wn), which represents the importance of that input.
Weighted Sum (z): The weighted sum of inputs is computed as
Z = w1∗x1+w2∗x2+...+wn∗xn
Threshold (θ): The weighted sum is compared to a threshold (θ) to produce the output.
Output (y): If the weighted sum z is greater than or equal to the threshold θ, the neuron's
output is 1. Otherwise, the output is 0.
y(z) = 1 if z ≥ θ
0 if z < θ
A linear neuron can be used for binary classification, where it acts as a simple decision-maker,
and the weights and threshold are adjusted to make correct classifications.
The Widrow-Hoff learning rule, also known as the delta rule or the LMS (Least Mean Squares)
algorithm, is a supervised learning algorithm used to adjust the weights of a linear neuron to
minimize the error in classification or regression tasks. It updates the weights based on the
prediction error and the input values. The update rule for the ith weight is as follows
α :is the learning rate, controlling the step size for weight updates.
The learning rule adjusts the weights in the direction that reduces the error. It continues to
update the weights in an iterative process until the error is minimized or converges to a
satisfactory level.
The Widrow-Hoff learning rule is a foundational concept in machine learning and neural
networks, providing a mechanism for training linear neurons to make accurate binary
classifications or predictions in a supervised learning context.
The Error Correction Delta Rule, often referred to simply as the Delta Rule or the Delta Learning
Rule, is a supervised learning algorithm used to adjust the weights of artificial neurons in a
neural network, specifically in the context of supervised learning tasks. The primary goal of this
rule is to minimize the error between the actual output of the neuron and the desired target
output.
Actual Output (Y): This is the output produced by the artificial neuron or network based
on the current set of weights and inputs.
Desired Target Output (D): This is the expected or correct output for the given input. It's
provided during the training phase.
Error (E):The error is the difference between the actual output and the desired target
output:
E=D-Y
The goal of the Error Correction Delta Rule is to adjust the weights to minimize the error (E).
The update for the ith weight wi is given by
α :is the learning rate, controlling the step size for weight updates.
Calculate the error (E) by taking the difference between the desired target output (D) and
the actual output (Y).
Adjust each weight (wi) based on the weight update rule, considering the learning rate
(α).
This weight adjustment process is repeated iteratively for multiple data points during the training
process until the error converges to a satisfactory level, meaning that the difference between the
desired and actual outputs is minimized.
The Error Correction Delta Rule is a foundational concept in supervised learning for
neural networks. It's used to train the network by iteratively adjusting the weights to make the
network's predictions more accurate and aligned with the desired target outputs. The choice of
the learning rate is crucial, as it affects the speed and stability of the learning process.
Y.Naga Prasanthi
Assistant Professor
Department of ECE
DIET