Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views152 pages

SVM Slides

Uploaded by

Riya Kapoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views152 pages

SVM Slides

Uploaded by

Riya Kapoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Dr.S.

Sridevi, ASP/SCOPE, VIT University Chennai 1

SUPPORT VECTOR
MACHINES
VISUALS INTRO FOR SVM Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 2

Support Support Types of SVM


Vector Vector
Machine Classifier
Linear
SVM

Non-Linear
SVM

Classification
Regression
Kernel
Method
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 3

Linearly separable feature space –


Clusters of two classes/features are
located far apart
Non-linear or non-separable features –
Clusters of two classes/features are
overlapping with less distance between the
clusters.
Outlier/noise in feature space: Distantly
located sample
Non-linear transformation of data to a
higher dimensional feature space
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai

Glossary terms
◼ Support vectors: The sample points that lie
closest to both the classes.
◼ Margin: The distance between the points
and dividing line (decision boundary).
Types: Functional margin is denoted by

Linear SVM
Geometric margin
is normalized margin
◼ Objective of SVM: It is also called as large
margin classifier and it aims at maximizing
the margin between the two classes. OR
SVM model tries to enlarge the distance
Non-linear SVM between two classes by optimizing the well
discriminating decision boundary
◼ Kernels: Are used to solve non-separable
data classification. Kernels transform
features to higher dimension and optimize
4
the hyperplane.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 5
Support Vector Machine
• SVM is a linear classifier that can be viewed as an
extension of the Perceptron developed by Rosenblatt in
1958.

• The Perceptron guaranteed that you find a hyperplane if it


exists. The SVM finds the maximum margin separating
hyperplane.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 8
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 9
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 10
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 11
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 12
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 13
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 14
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 15

Objectives/Aim of SVM
• SVM maximizes the separation between the sample points and
decision boundary.
• For each sample point we will measure the distance between decision
boundary
• The distance of closest sample point to decision boundary is called as
Margin.
• To identify the best decision boundary we could shift the decision line
until margin between the class becomes equal.
• Width of the band around decision line = 2 x Margin
• Among all possible decision boundary, we want to choose the line for
which the margin width is highest.
• Margin is denoted by minimum distance of a training instance from the
decision surface
• Consider a decision line/surface and find the sample points which lie
closer to the decision boundary – those points are support vectors.
• There may be minimum 2 support vectors and typically less
• These support vectors are more crucial which determines the best
decision boundary
• Based on this best decision boundary – the given test input is
classified
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 16
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 17

SVM training process


• Consider binary classification problem;
• Target labels y={+,-}
• The training data consists of n features

Hyperplane
H2

H1
Boundary
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 18

SVM with hyperplane - dimension


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 19

SVM with hyperplane - dimension


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 20
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 21

LETS ANALYZE 1D DATA


OF CLASSIFICATION
From the view point of getting introduced with
conceptual idea of SVM classifier
Mix of data pertaining to two classes
Class 0 → red label
Class 1 → green label
Assume a threshold to activate between class 0
and 1
Is the threshold is obvious to classify?
Find a strategy to choose appropriate
threshold?
Margin → Metric to choose threshold
Margin → should be maximum
Noise / Outlier
Is this obvious in the presence of outlier in
the data?
SVM becomes stupid
In SVM model training sometimes don’t update model
parameter for few outliers by making mistake is allowed/good

Like this 5-Star chocolate Adv


Permit to make minimal mistakes to make
a robust classifier
These kind of SVM is called Soft Margin
Margin is not hard anymore → There is a
loop hole .. So margin is soft
Soft Margin SVM Classifier

Advantage is the Margin is


maximum for Soft Margin
classifier
Support vectors
2D Data
SVM Decision boundary with maximum
margin Soft Margin SVM
Model

Misclassification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 46
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 47

LINEAR DISCRIMINANT CLASSIFICATION


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 48

Who are support vectors?


• Support vector is a sample that is most probably
incorrectly classified or a sample close to a boundary.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 49

Data set

Separable data Non-separable data


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 50

Hinge loss function


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 51

Hinge loss function


• Hinge loss helps ensure that data points are not just
classified correctly but are also far from the decision
boundary for a clear distinction between classes (To have
a wide margin).
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 52
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 53

simplified
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 54

Gradient descent to maximize margin


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 55
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 56

Why this constraint is


used?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 57

WHY IS TO BE OPTIMIZED?
Reason → It leads to maximum margin… But
how?????
But SVM is large margin classifier?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 58

So we need to measure the margin?


• What kind of measurement scale are we going to use?
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 59

How to determine support vectors???


Optimization problem : Goal of SVM
1. Maximize the margin from support vector
2. Maximize the minimum distance
1. Optimization objective starts using concepts of functional
and geometric margin; then the final optimization
objective is derived.
2. Functional margin of a hyperplane w.r.t. ith training
example is defined as:

3. Functional margin of a hyperplane w.r.t. the entire


dataset is defined as:

4. Geometric margin of a hyperplane w.r.t. ith training


example is defined as functional margin normalized by
norm(w):

ith training example, AB is the


geometric margin of hyperplane
w.r.t. A
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 61

Discrimination function
• We need weight parameters w1, w2
• Geometric & functional margin 𝛾
Optimization problem : Goal of SVM
1. Maximize the margin from support
vector
2. Maximize the minimum distance
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 62
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 63

CASE I: Solution for perfectly separable dataset


Goal for testing/prediction: The
Training process: Once training product of predicted label and
phase is done, using SVM actual label should yield +ve value
objective function, we determine if the test point is correctly
functional and geometric classified and vice versa.
distance to learn optimal decision
boundary with maximum margin
using training sample points.
Testing process: Every test
sample point belonging to class
“+ve” should lie above the main
decision boundary
“-ve” test point lie below the main
decision boundary

Primal Form
of SVM
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 64

What we are going to optimize and how???

Now margin is obtained by subtracting D1 − D2

The problem of MMC is formulated as

max
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 65

2
WHY IS CONSIDERED AS
CONSTRAINT OF OPTIMIZATION?
Reason → It is the correct classification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 66
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 67

Hard margin versus soft margin (same dataset)


Being lenient with slack variables, the
maximum margin is achieved

HARD MARGIN (Strict Classifier): Define an optimal hyperplane by


maximizing margin (Without any misclassification)
SOFTMARGIN (Lenient Classifier): Extend the above definition for non-linearly
separable problems: have a penalty term for misclassifications.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 68

Linear, Soft-Margin SVM Classifier


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 69

Hard margin versus soft margin


1. Given a (training) dataset consisting of positive and negative class
instances.
2. Objective is to find a maximum-margin classifier, in terms of a hyper-
plane (the vectors w and b) that separates the positive and negative
instances (in the training dataset).
3. If the dataset is noisy (with some overlap in positive and negative samples),
there will be some error in classifying them with the hyper-plane.
4. In the latter case the objective will be to minimize the errors in classification
along with maximizing the margin and the problem becomes a soft-margin
SVM (as opposed to the hard margin SVM without slack variables).
5. A slack variable per training point is introduced to include the classification
errors (for the miss-classified points in the training dataset) in the objective,
this can also be thought of adding regularization.

1. Hard-Margin does not 1. Soft-Margin always have a solution


require to guess the cost 2. Soft-Margin is more robust to outliers
parameter (requires no 3. Smoother surfaces (in the non-linear case)
parameters at all)
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 70

Linear, Hard-Margin SVM Formulation


• Find w, b that solves Optimization
1 2 function
min w
2
constraint
s.t. yi (w  xi + b)  1, xi

• Problem is convex so, there is a unique global


minimum value (when feasible)
• There is also a unique minimizer, i.e. weight and b
value that provides the minimum
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 71

Linear, Hard-Margin SVM Formulation


• Find w, b that solves Optimization
1 2 function
min w
2
constraint
s.t. yi (w  xi + b)  1, xi
• Problem is convex so, there is a unique global minimum
value (when feasible)
• There is also a unique minimizer, i.e. weight and b
value that provides the minimum
• Non-solvable if the data is not linearly separable
• Quadratic Programming
• Very efficient computationally with modern
constraint optimization engines (handles
thousands of constraints and training
instances).
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 72
Slack
variable
Linear, Soft-Margin SVMs 1
w + C  i
2
Optimization
min
• If the dataset comprises of non- 2 i
separable features function
yi (w  xi + b)  1 − i , xi
• Algorithm tries to maintain i to
zero while maximizing margin i  0
• Notice: algorithm does not
constraint
minimize the number of
misclassifications but the sum of
distances from the margin hyper
planes
• Other formulations use i2 instead
• C is used to decrease the
misclassifications.
• As C→, we get closer to the hard-
margin solution
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 73
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 74
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 75

Soft Margin: Formulating the Optimization Problem


New Primal
Form for soft
SVM
1
min w + C  i
2

Var1 2 i
i Objective function penalizes
for misclassified instances
and those within the margin
Constraint becomes :
i yi (w  xi + b)  1 − i , xi
 w x + b = 1 i  0
w

w  x + b = −1 1 Var2
1 C trades-off the
 
w x + b = 0 margin width and
misclassifications
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 76

1. The optimization problem (Primal Form) New Primal


is quadratic in nature, since it has quadratic Form
objective with linear constraints. formulation
2. It is easier to solve the optimization problem in
1
min w + C  i
the dual space rather than the primal space, since 2
there are less number of variables.
3. Hence the optimization problem is often solved in
2 i

the dual space by converting the minimization to Objective function penalizes


a maximization problem (keeping in mind for misclassified instances
the weak/strong duality theorem and and those within the margin
the complementary slackness conditions), by first
Constraint becomes :
constructing the Lagrangian (Using Lagrangian
multiplier method) and then using the Karush Khun yi (w  xi + b)  1 − i , xi
Tucker - KKT conditions for a saddle point.
i  0

Dual Form
formulation
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 77

Dual Problem: Re-expresses the optimization entirely in terms of these multipliers, with
the key data interaction appearing as an inner product.
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 78
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 79

Applications
• Face detection,
• Handwriting recognition
• Bioinformatics.
• Text classification
LINEAR SVM
NUMERICAL
PROBLEMS
LINEAR SVM
NUMERICAL PROBLEM
Support
Support
vectors
vectors
Using the Lagrange multiplier
method to solve Linear SVM
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 93

Multiclass classification
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 94

Advantages and Disadvantages

• Kernel method may end up with


• Simplified calculation and efficient
overfitting
algorithm
• Choosing optimal kernel may
• Works efficient with relatively large
take much computation time
number of features without have
computational complexity
• Avoids overfitting
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 95
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 96
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 97
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 98
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 99

Transformed input data to higher dimension


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 100
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 101
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 102
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 103
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 104
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 105
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 106
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 107
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 108
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 109
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 110

Similar
features

Non-Similar
features
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 111
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 112
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 113
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 114
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 115
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 116
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 117
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 118
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 119
NON-LINEAR SVM
NUMERICAL EXAMPLE
Mapping done to higher dimension
Identify the support vectors
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 134

Discrimination function
• We need weight parameters w1, w2
• Geometric & functional margin 𝛾
FIND BEST
HYPERPLANE – BASED
Numerical example
Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 140

Radial Basis Kernel


Dr.S.Sridevi, ASP/SCOPE, VIT University Chennai 141

Plynomial Kernel

Plynomial Kernel
Polynomial Kernel
Polynomial Kernel
Polynomial Kernel
• (a x b + r)d

• Let r= ½ and d=2

• (a x b + ½)2
• =(a x b + ½) (a x b + ½)
• =a2b2+ab+1/4
• =ab + a2b2 +1/4 [ flipping the terms]
• =(a, a2, ½) (b, b2, ½) [dot product]

The dot product gives the high dimensional coordinates for the
data
Radial Kernel
• Radial Kernel finds Support Vector Classifiers in infinite
dimensions and hence it’s not possible to visualize what it
does.

You might also like