Dr.S.
Sridevi, ASP/SCOPE, VIT University Chennai 1
SUPPORT VECTOR
MACHINES
VISUALS INTRO FOR SVM Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 2
Support Support Types of SVM
Vector Vector
Machine Classifier
Linear
SVM
Non-Linear
SVM
Classification
Regression
Kernel
Method
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 3
Linearly separable feature space –
Clusters of two classes/features are
located far apart
Non-linear or non-separable features –
Clusters of two classes/features are
overlapping with less distance between the
clusters.
Outlier/noise in feature space: Distantly
located sample
Non-linear transformation of data to a
higher dimensional feature space
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai
Glossary terms
◼ Support vectors: The sample points that lie
closest to both the classes.
◼ Margin: The distance between the points
and dividing line (decision boundary).
Types: Functional margin is denoted by
Linear SVM
Geometric margin
is normalized margin
◼ Objective of SVM: It is also called as large
margin classifier and it aims at maximizing
the margin between the two classes. OR
SVM model tries to enlarge the distance
Non-linear SVM between two classes by optimizing the well
discriminating decision boundary
◼ Kernels: Are used to solve non-separable
data classification. Kernels transform
features to higher dimension and optimize
4
the hyperplane.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 5
Support Vector Machine
• SVM is a linear classifier that can be viewed as an
extension of the Perceptron developed by Rosenblatt in
1958.
• The Perceptron guaranteed that you find a hyperplane if it
exists. The SVM finds the maximum margin separating
hyperplane.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 8
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 9
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 10
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 11
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 12
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 13
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 14
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 15
Objectives/Aim of SVM
• SVM maximizes the separation between the sample points and
decision boundary.
• For each sample point we will measure the distance between decision
boundary
• The distance of closest sample point to decision boundary is called as
Margin.
• To identify the best decision boundary we could shift the decision line
until margin between the class becomes equal.
• Width of the band around decision line = 2 x Margin
• Among all possible decision boundary, we want to choose the line for
which the margin width is highest.
• Margin is denoted by minimum distance of a training instance from the
decision surface
• Consider a decision line/surface and find the sample points which lie
closer to the decision boundary – those points are support vectors.
• There may be minimum 2 support vectors and typically less
• These support vectors are more crucial which determines the best
decision boundary
• Based on this best decision boundary – the given test input is
classified
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 16
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 17
SVM training process
• Consider binary classification problem;
• Target labels y={+,-}
• The training data consists of n features
Hyperplane
H2
H1
Boundary
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 18
SVM with hyperplane - dimension
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 19
SVM with hyperplane - dimension
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 20
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 21
LETS ANALYZE 1D DATA
OF CLASSIFICATION
From the view point of getting introduced with
conceptual idea of SVM classifier
Mix of data pertaining to two classes
Class 0 → red label
Class 1 → green label
Assume a threshold to activate between class 0
and 1
Is the threshold is obvious to classify?
Find a strategy to choose appropriate
threshold?
Margin → Metric to choose threshold
Margin → should be maximum
Noise / Outlier
Is this obvious in the presence of outlier in
the data?
SVM becomes stupid
In SVM model training sometimes don’t update model
parameter for few outliers by making mistake is allowed/good
Like this 5-Star chocolate Adv
Permit to make minimal mistakes to make
a robust classifier
These kind of SVM is called Soft Margin
Margin is not hard anymore → There is a
loop hole .. So margin is soft
Soft Margin SVM Classifier
Advantage is the Margin is
maximum for Soft Margin
classifier
Support vectors
2D Data
SVM Decision boundary with maximum
margin Soft Margin SVM
Model
Misclassification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 46
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 47
LINEAR DISCRIMINANT CLASSIFICATION
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 48
Who are support vectors?
• Support vector is a sample that is most probably
incorrectly classified or a sample close to a boundary.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 49
Data set
Separable data Non-separable data
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 50
Hinge loss function
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 51
Hinge loss function
• Hinge loss helps ensure that data points are not just
classified correctly but are also far from the decision
boundary for a clear distinction between classes (To have
a wide margin).
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 52
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 53
simplified
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 54
Gradient descent to maximize margin
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 55
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 56
Why this constraint is
used?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 57
WHY IS TO BE OPTIMIZED?
Reason → It leads to maximum margin… But
how?????
But SVM is large margin classifier?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 58
So we need to measure the margin?
• What kind of measurement scale are we going to use?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 59
How to determine support vectors???
Optimization problem : Goal of SVM
1. Maximize the margin from support vector
2. Maximize the minimum distance
1. Optimization objective starts using concepts of functional
and geometric margin; then the final optimization
objective is derived.
2. Functional margin of a hyperplane w.r.t. ith training
example is defined as:
3. Functional margin of a hyperplane w.r.t. the entire
dataset is defined as:
4. Geometric margin of a hyperplane w.r.t. ith training
example is defined as functional margin normalized by
norm(w):
ith training example, AB is the
geometric margin of hyperplane
w.r.t. A
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 61
Discrimination function
• We need weight parameters w1, w2
• Geometric & functional margin 𝛾
Optimization problem : Goal of SVM
1. Maximize the margin from support
vector
2. Maximize the minimum distance
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 62
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 63
CASE I: Solution for perfectly separable dataset
Goal for testing/prediction: The
Training process: Once training product of predicted label and
phase is done, using SVM actual label should yield +ve value
objective function, we determine if the test point is correctly
functional and geometric classified and vice versa.
distance to learn optimal decision
boundary with maximum margin
using training sample points.
Testing process: Every test
sample point belonging to class
“+ve” should lie above the main
decision boundary
“-ve” test point lie below the main
decision boundary
Primal Form
of SVM
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 64
What we are going to optimize and how???
Now margin is obtained by subtracting D1 − D2
The problem of MMC is formulated as
max
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 65
2
WHY IS CONSIDERED AS
CONSTRAINT OF OPTIMIZATION?
Reason → It is the correct classification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 66
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 67
Hard margin versus soft margin (same dataset)
Being lenient with slack variables, the
maximum margin is achieved
HARD MARGIN (Strict Classifier): Define an optimal hyperplane by
maximizing margin (Without any misclassification)
SOFTMARGIN (Lenient Classifier): Extend the above definition for non-linearly
separable problems: have a penalty term for misclassifications.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 68
Linear, Soft-Margin SVM Classifier
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 69
Hard margin versus soft margin
1. Given a (training) dataset consisting of positive and negative class
instances.
2. Objective is to find a maximum-margin classifier, in terms of a hyper-
plane (the vectors w and b) that separates the positive and negative
instances (in the training dataset).
3. If the dataset is noisy (with some overlap in positive and negative samples),
there will be some error in classifying them with the hyper-plane.
4. In the latter case the objective will be to minimize the errors in classification
along with maximizing the margin and the problem becomes a soft-margin
SVM (as opposed to the hard margin SVM without slack variables).
5. A slack variable per training point is introduced to include the classification
errors (for the miss-classified points in the training dataset) in the objective,
this can also be thought of adding regularization.
1. Hard-Margin does not 1. Soft-Margin always have a solution
require to guess the cost 2. Soft-Margin is more robust to outliers
parameter (requires no 3. Smoother surfaces (in the non-linear case)
parameters at all)
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 70
Linear, Hard-Margin SVM Formulation
• Find w, b that solves Optimization
1 2 function
min w
2
constraint
s.t. yi (w xi + b) 1, xi
• Problem is convex so, there is a unique global
minimum value (when feasible)
• There is also a unique minimizer, i.e. weight and b
value that provides the minimum
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 71
Linear, Hard-Margin SVM Formulation
• Find w, b that solves Optimization
1 2 function
min w
2
constraint
s.t. yi (w xi + b) 1, xi
• Problem is convex so, there is a unique global minimum
value (when feasible)
• There is also a unique minimizer, i.e. weight and b
value that provides the minimum
• Non-solvable if the data is not linearly separable
• Quadratic Programming
• Very efficient computationally with modern
constraint optimization engines (handles
thousands of constraints and training
instances).
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 72
Slack
variable
Linear, Soft-Margin SVMs 1
w + C i
2
Optimization
min
• If the dataset comprises of non- 2 i
separable features function
yi (w xi + b) 1 − i , xi
• Algorithm tries to maintain i to
zero while maximizing margin i 0
• Notice: algorithm does not
constraint
minimize the number of
misclassifications but the sum of
distances from the margin hyper
planes
• Other formulations use i2 instead
• C is used to decrease the
misclassifications.
• As C→, we get closer to the hard-
margin solution
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 73
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 74
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 75
Soft Margin: Formulating the Optimization Problem
New Primal
Form for soft
SVM
1
min w + C i
2
Var1 2 i
i Objective function penalizes
for misclassified instances
and those within the margin
Constraint becomes :
i yi (w xi + b) 1 − i , xi
w x + b = 1 i 0
w
w x + b = −1 1 Var2
1 C trades-off the
w x + b = 0 margin width and
misclassifications
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 76
1. The optimization problem (Primal Form) New Primal
is quadratic in nature, since it has quadratic Form
objective with linear constraints. formulation
2. It is easier to solve the optimization problem in
1
min w + C i
the dual space rather than the primal space, since 2
there are less number of variables.
3. Hence the optimization problem is often solved in
2 i
the dual space by converting the minimization to Objective function penalizes
a maximization problem (keeping in mind for misclassified instances
the weak/strong duality theorem and and those within the margin
the complementary slackness conditions), by first
Constraint becomes :
constructing the Lagrangian (Using Lagrangian
multiplier method) and then using the Karush Khun yi (w xi + b) 1 − i , xi
Tucker - KKT conditions for a saddle point.
i 0
Dual Form
formulation
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 77
Dual Problem: Re-expresses the optimization entirely in terms of these multipliers, with
the key data interaction appearing as an inner product.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 78
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 79
Applications
• Face detection,
• Handwriting recognition
• Bioinformatics.
• Text classification
LINEAR SVM
NUMERICAL
PROBLEMS
LINEAR SVM
NUMERICAL PROBLEM
Support
Support
vectors
vectors
Using the Lagrange multiplier
method to solve Linear SVM
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 93
Multiclass classification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 94
Advantages and Disadvantages
• Kernel method may end up with
• Simplified calculation and efficient
overfitting
algorithm
• Choosing optimal kernel may
• Works efficient with relatively large
take much computation time
number of features without have
computational complexity
• Avoids overfitting
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 95
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 96
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 97
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 98
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 99
Transformed input data to higher dimension
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 100
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 101
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 102
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 103
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 104
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 105
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 106
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 107
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 108
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 109
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 110
Similar
features
Non-Similar
features
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 111
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 112
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 113
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 114
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 115
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 116
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 117
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 118
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 119
NON-LINEAR SVM
NUMERICAL EXAMPLE
Mapping done to higher dimension
Identify the support vectors
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 134
Discrimination function
• We need weight parameters w1, w2
• Geometric & functional margin 𝛾
FIND BEST
HYPERPLANE – BASED
Numerical example
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 140
Radial Basis Kernel
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 141
Plynomial Kernel
Plynomial Kernel
Polynomial Kernel
Polynomial Kernel
Polynomial Kernel
• (a x b + r)d
• Let r= ½ and d=2
• (a x b + ½)2
• =(a x b + ½) (a x b + ½)
• =a2b2+ab+1/4
• =ab + a2b2 +1/4 [ flipping the terms]
• =(a, a2, ½) (b, b2, ½) [dot product]
The dot product gives the high dimensional coordinates for the
data
Radial Kernel
• Radial Kernel finds Support Vector Classifiers in infinite
dimensions and hence it’s not possible to visualize what it
does.