0% found this document useful (0 votes)

439 views40 pages

SVM Notes

This document summarizes key aspects of support vector machine (SVM) classifiers. It begins by reviewing linear classifiers and the perceptron algorithm. It then introduces the SVM, which finds the optimal separating hyperplane with the maximum margin between classes. This is formulated as a quadratic optimization problem with slack variables to allow misclassified points. Gradient descent algorithms like Pegasos can be used to efficiently learn the SVM model from training data. The hinge loss function is used as it approximates the 0-1 loss and results in a convex optimization problem.

Uploaded by

MohitRajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

439 views40 pages

SVM Notes

Uploaded by

MohitRajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Lecture 2: The SVM classifier

C19 Machine Learning

Hilary 2015

Review of linear classifiers

Linear separability
Perceptron

Support Vector Machine (SVM) classifier

Wide margin
Cost function
Slack variables
Loss functions revisited
Optimization

A. Zisserman

Binary Classification
Given training data (xi, yi) for i = 1 . . . N , with
xi Rd and yi {1, 1}, learn a classifier f (x)
such that
f (x i )

0 yi = +1
< 0 yi = 1

i.e. yif (xi) > 0 for a correct classification.

Linear separability

linearly
separable

not
linearly
separable

Linear classifiers
A linear classifier has the form

f (x) = 0
X2

f (x) = w>x + b
f (x) < 0

f (x) > 0
X1

in 2D the discriminant is a line

is the normal to the line, and b the bias

is known as the weight vector

Linear classifiers
A linear classifier has the form

f (x) = 0

f (x) = w>x + b

in 3D the discriminant is a plane, and in nD it is a hyperplane

For a K-NN classifier it was necessary to `carry the training data

For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data

The Perceptron Classifier

Given linearly separable data xi labelled into two categories yi = {-1,1} ,
find a weight vector w such that the discriminant function

f (x i ) = w > x i + b
separates the categories for i = 1, .., N
how can we find this separating hyperplane ?

The Perceptron Algorithm

>
x i + w0 = w > x i
Write classifier as f (xi) = w
where w = (w
, w0), xi = (
xi, 1)
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then

w w + sign(f (xi)) xi

Until all the data is correctly classified

For example in 2D
Initialize w = 0
Cycle though the data points { xi, yi }
if xi is misclassified then

w w + sign(f (xi)) xi

Until all the data is correctly classified

before update

after update

w
w

NB after convergence w

PN
= i i xi

Perceptron
example

-2

-4

-6

-8

-10
-15

-10

-5

if the data is linearly separable, then the algorithm will converge

convergence can be slow
separating line close to training data
we would prefer a larger margin for generalization

What is the best w?

maximum margin solution: most stable under perturbations of the inputs

Support Vector Machine

linearly separable data
wTx + b = 0
b
||w||

Support Vector
Support Vector

f (x) =

X
i

i yi (xi > x) + b
support vectors

SVM sketch derivation

Since w>x + b = 0 and c(w>x + b) = 0 define the same
plane, we have the freedom to choose the normalization
of w
Choose normalization such that w>x++b = +1 and w>x+
b = 1 for the positive and negative support vectors respectively
Then the margin is given by

w
. x+ x =
||w||

x+ x
||w||

2
=
||w||

Support Vector Machine

linearly separable data
Margin =

Support Vector
Support Vector

wTx + b = 1

wTx + b = 0
wTx + b = -1

2
||w||

SVM Optimization
Learning the SVM can be formulated as an optimization:
2
1
max
subject to w>xi+b
w ||w||
1
Or equivalently
min ||w||2
w

if yi = +1
for i = 1 . . . N
if yi = 1

>
subject to yi w xi + b 1 for i = 1 . . . N

This is a quadratic optimization problem subject to linear

constraints and there is a unique minimum

Linear separability again: What is the best w?

the points can be linearly separated but

there is a very narrow margin

but possibly the large margin solution is

better, even though one constraint is violated

In general there is a trade off between the margin and the number of
mistakes on the training data

Introduce slack variables

i
2
>
||w||
||w||

i 0

Margin =

Misclassified
point

for 0 < 1 point is between

margin and correct side of hyperplane. This is a margin violation

1
i
<
||w||
||w||

for > 1 point is misclassified

Support Vector
Support Vector
=0

wTx + b = 1
wTx + b = 0
wTx + b = -1

2
||w||

Soft margin solution

The optimization problem becomes
min

wRd ,

R+

||w|| +C

N
X

subject to

>
yi w xi + b 1i for i = 1 . . . N

Every constraint can be satisfied if i is suciently large

C is a regularization parameter:
small C allows constraints to be easily ignored large margin
large C makes constraints hard to ignore narrow margin
C = enforces all constraints: hard margin
This is still a quadratic optimization problem and there is a
unique minimum. Note, there is only one parameter, C.

0.8
0.6

feature y

0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1

-0.8

-0.6

-0.4

-0.2
0
feature x

data is linearly separable

but only with a narrow margin

0.2

0.4

0.6

0.8

C = Infinity

hard margin

C = 10

soft margin

Application: Pedestrian detection in Computer Vision

Objective: detect (localize) standing humans in an image
cf face detection with a sliding window classifier

reduces object detection to

binary classification

does an image window

contain a person or not?

Method: the HOG detector

Training data and features

Positive data 1208 positive window examples

Negative data 1218 negative window examples (initially)

Feature: histogram of oriented gradients (HOG)

image

dominant
direction

HOG

each cell represented by HOG

frequency

tile window into 8 x 8 pixel cells

orientation

Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

Averaged positive examples

Algorithm
Training (Learning)
Represent each example window by a HOG feature vector

xi Rd , with d = 1024

Train a SVM classifier

Testing (Detection)
Sliding window classifier

f (x) = w> x + b

Dalal and Triggs, CVPR 2005

Learned model

f (x) = w>x + b

Slide from Deva Ramanan

Optimization
Learning an SVM has been formulated as a constrained optimization problem over w and
min

wRd ,i R+

||w||2 + C

N
X
i

>
i subject to yi w xi + b 1 i for i = 1 . . . N

>
The constraint yi w xi + b 1 i, can be written more concisely as

yif (xi) 1 i

which, together with i 0, is equivalent to

i = max (0, 1 yif (xi))
Hence the learning problem is equivalent to the unconstrained optimization problem over w
min ||w||2 + C

wRd

regularization

N
X
i

max (0, 1 yif (xi))

loss function

Loss function
N
X
2
min ||w|| + C
max (0, 1 yif (xi))
wRd
i

wTx + b = 0

loss function
Points are in three categories:

Support Vector

1. yif (xi) > 1

Point is outside margin.
No contribution to loss
2. yif (xi) = 1
Point is on margin.
No contribution to loss.
As in hard margin case.
3. yif (xi) < 1
Point violates margin constraint.
Contributes to loss

Support Vector

Loss functions

yif (xi)
SVM uses hinge loss max (0, 1 yi f (xi))
an approximation to the 0-1 loss

Optimization continued

min C

wRd

N
X
i

max (0, 1 yif (xi)) + ||w||2

local
minimum

global
minimum

Does this cost function have a unique solution?

Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal (provided
the optimization is over a convex set, which it is in our case)

Convex functions

Convex function examples

convex

Not convex

A non-negative sum of convex functions is convex

SVM
min C

wRd

N
X
i

max (0, 1 yif (xi)) + ||w||2

convex

Gradient (or steepest) descent algorithm for SVM

To minimize a cost function C(w) use the iterative update

wt+1 wt tw C(wt)
where is the learning rate.
First, rewrite the optimization problem as an average
N
X

1
max (0, 1 yif (xi))
||w||2 +
min C(w) =
w
2
N i

N
1 X

=
||w||2 + max (0, 1 yif (xi))
N i 2

(with = 2/(N C) up to an overall scale of the problem) and

f (x ) = w > x + b
Because the hinge loss is not dierentiable, a sub-gradient is
computed

Sub-gradient for hinge loss

L(xi, yi; w) = max (0, 1 yif (xi))

f (xi) = w>xi + b

L
= yixi
w

L
=0
w

yif (xi)

Sub-gradient descent algorithm for SVM

N
1 X

||w||2 + L(xi, yi; w)

C(w) =
N i 2

The iterative update is

wt+1 wt wt C(wt)
N
1X
wt
(wt + w L(xi, yi; wt))
N i
where is the learning rate.
Then each iteration t involves cycling through the training data with the
updates:

wt+1 wt (wt yixi)

wt wt

if yif (xi) < 1

otherwise

1
In the Pegasos algorithm the learning rate is set at t = t

Pegasos Stochastic Gradient Descent Algorithm

Randomly sample from the training data
2

4
1

energy

2
0

-2
-1

-4

-2

100

150

200

250

300

-6
-6

-4

-2

Background reading and more

Next lecture see that the SVM can be expressed as a sum over the
support vectors:

f (x) =

i yi (xi > x) + b

support vectors

On web page:
http://www.robots.ox.ac.uk/~az/lectures/ml
links to SVM tutorials and video lectures
MATLAB SVM demo

Unit 5
No ratings yet
Unit 5
61 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
Jntuk R20 ML Unit-Ii
No ratings yet
Jntuk R20 ML Unit-Ii
37 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
R22 ML Question Bank For It and CSM
No ratings yet
R22 ML Question Bank For It and CSM
4 pages
NN Unit - 1
No ratings yet
NN Unit - 1
27 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Unit 4
No ratings yet
Unit 4
24 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
79 pages
Gaussian Mixture Models Unit-III
No ratings yet
Gaussian Mixture Models Unit-III
13 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
18 pages
21CS54 Aiml Module3 PPT
No ratings yet
21CS54 Aiml Module3 PPT
102 pages
Bai602 ML Lesson Plan 2024-25 Even Aiml Dept
No ratings yet
Bai602 ML Lesson Plan 2024-25 Even Aiml Dept
5 pages
Inductive Bias
No ratings yet
Inductive Bias
3 pages
ML Lab Observation
100% (1)
ML Lab Observation
44 pages
Machine Learning Theory Essentials
No ratings yet
Machine Learning Theory Essentials
9 pages
Naïve Bayes Classifier Algorithm
No ratings yet
Naïve Bayes Classifier Algorithm
10 pages
Lecture Notes - Random Forests PDF
100% (1)
Lecture Notes - Random Forests PDF
4 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
15 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
UNIT2
No ratings yet
UNIT2
25 pages
Jntuk R20 ML Unit-Iii
100% (1)
Jntuk R20 ML Unit-Iii
21 pages
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
100% (1)
Chap 11 12 - Practical Methodology and Applications - Heechul Lim
60 pages
Regression Notes
100% (1)
Regression Notes
20 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Linear Regression Analysis. Statistics 2 Notes
No ratings yet
Linear Regression Analysis. Statistics 2 Notes
20 pages
Distance Based Models
No ratings yet
Distance Based Models
58 pages
Question Bank AML
No ratings yet
Question Bank AML
4 pages
Neural Networks & SVMs in AI
No ratings yet
Neural Networks & SVMs in AI
19 pages
Machine Learning Assignment Guide
No ratings yet
Machine Learning Assignment Guide
8 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
Deep Learning Exp
No ratings yet
Deep Learning Exp
25 pages
Vanishing and Exploding
No ratings yet
Vanishing and Exploding
9 pages
ML Question Bank
No ratings yet
ML Question Bank
29 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
12 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
Linear Regression and SVM Concepts
No ratings yet
Linear Regression and SVM Concepts
8 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
37 pages
Machine Learning Question Bank-Unit 3
No ratings yet
Machine Learning Question Bank-Unit 3
6 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Machine Learning: in Telugu
No ratings yet
Machine Learning: in Telugu
14 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
ML Unit 1
100% (1)
ML Unit 1
44 pages
Deep Learning Applications & CNNs
No ratings yet
Deep Learning Applications & CNNs
14 pages
ML Unit-3 Notes
No ratings yet
ML Unit-3 Notes
26 pages
Machine Learning Full Question Bank
No ratings yet
Machine Learning Full Question Bank
14 pages
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
No ratings yet
Evaluation Metrics For Regression: Dr. Jasmeet Singh Assistant Professor, Csed Tiet, Patiala
13 pages
ML Module 2 New
No ratings yet
ML Module 2 New
36 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
PFD For Upload - 4
No ratings yet
PFD For Upload - 4
11 pages
BPSK AWGN BER Analysis
No ratings yet
BPSK AWGN BER Analysis
1 page
ITC Report 22
No ratings yet
ITC Report 22
4 pages
TestFunda General Knowledge Quizzes Vol 2
No ratings yet
TestFunda General Knowledge Quizzes Vol 2
92 pages
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
No ratings yet
Unbiased Estimation of A Sparse Vector in White Gaussian Noise
46 pages
SVM Basics Paper
No ratings yet
SVM Basics Paper
7 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
Osu Adjustment Computation Notes Parts 1 and 2 0
No ratings yet
Osu Adjustment Computation Notes Parts 1 and 2 0
335 pages
Chapter 8 MultiFreedom Constraints Solutions
No ratings yet
Chapter 8 MultiFreedom Constraints Solutions
3 pages
GP - DSA - Dynamic Programming Notes
No ratings yet
GP - DSA - Dynamic Programming Notes
17 pages
Genetic Deep Neural Networks Using Different Activation Functions For Financial Data Mining
No ratings yet
Genetic Deep Neural Networks Using Different Activation Functions For Financial Data Mining
3 pages
Fourier Transform: The Forward and Inverse Are Defined For Signal As
No ratings yet
Fourier Transform: The Forward and Inverse Are Defined For Signal As
8 pages
4 Retiming
No ratings yet
4 Retiming
36 pages
Low Bit-Rate Satellite Demodulator
No ratings yet
Low Bit-Rate Satellite Demodulator
4 pages
Clustering Algorithms Explained
No ratings yet
Clustering Algorithms Explained
3 pages
5.LU Decomposition Method
No ratings yet
5.LU Decomposition Method
3 pages
Algebra Problem Solving Set
No ratings yet
Algebra Problem Solving Set
7 pages
1.7 Recursion
No ratings yet
1.7 Recursion
23 pages
3.0 C DesignSolution - Selection
No ratings yet
3.0 C DesignSolution - Selection
80 pages
Jomar Fajardo Rabajante MATH 235 2002-96301 Problem Set
No ratings yet
Jomar Fajardo Rabajante MATH 235 2002-96301 Problem Set
8 pages
Profit Payoffs: Jelian Keith - Cabauatan Code: 612
No ratings yet
Profit Payoffs: Jelian Keith - Cabauatan Code: 612
7 pages
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
No ratings yet
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
30 pages
4 4 Lagrange-Presentation
No ratings yet
4 4 Lagrange-Presentation
18 pages
Name of The Experiment Understanding The Effect of AWGN in A Message Signal
No ratings yet
Name of The Experiment Understanding The Effect of AWGN in A Message Signal
6 pages
Assignment 2
No ratings yet
Assignment 2
12 pages
Smude Mba Solved Assignments of MB0048
No ratings yet
Smude Mba Solved Assignments of MB0048
2 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
MSc-210916 (Genetic Algorith)
No ratings yet
MSc-210916 (Genetic Algorith)
16 pages
Transportation Model
No ratings yet
Transportation Model
7 pages
Filter Design
No ratings yet
Filter Design
9 pages
LABEX4 Solved DSP
No ratings yet
LABEX4 Solved DSP
16 pages
Digital Signal Processing Lab 4: Figure 3.1: Basic View of Sampling Theorem
No ratings yet
Digital Signal Processing Lab 4: Figure 3.1: Basic View of Sampling Theorem
3 pages
Dimensionality Reduction Lecture Slide
No ratings yet
Dimensionality Reduction Lecture Slide
27 pages
Numerical Methods for PDEs
No ratings yet
Numerical Methods for PDEs
43 pages
Support Vector Machine
0% (1)
Support Vector Machine
7 pages
MLready
No ratings yet
MLready
3 pages
AI Game Playing Seminar Overview
No ratings yet
AI Game Playing Seminar Overview
27 pages

SVM Notes

Uploaded by

SVM Notes

Uploaded by

Lecture 2: The SVM classifier

C19 Machine Learning

Review of linear classifiers

Support Vector Machine (SVM) classifier

i.e. yif (xi) > 0 for a correct classification.

in 2D the discriminant is a line

is the normal to the line, and b the bias

is known as the weight vector

in 3D the discriminant is a plane, and in nD it is a hyperplane

For a K-NN classifier it was necessary to `carry the training data

The Perceptron Classifier

The Perceptron Algorithm

Until all the data is correctly classified

Until all the data is correctly classified

if the data is linearly separable, then the algorithm will converge

What is the best w?

maximum margin solution: most stable under perturbations of the inputs

Support Vector Machine

SVM sketch derivation

Support Vector Machine

This is a quadratic optimization problem subject to linear

Linear separability again: What is the best w?

the points can be linearly separated but

but possibly the large margin solution is

Introduce slack variables

for 0 < 1 point is between

for > 1 point is misclassified

Soft margin solution

Every constraint can be satisfied if i is suciently large

data is linearly separable

but only with a narrow margin

Application: Pedestrian detection in Computer Vision

reduces object detection to

does an image window

Method: the HOG detector

Training data and features

Negative data 1218 negative window examples (initially)

Feature: histogram of oriented gradients (HOG)

each cell represented by HOG

tile window into 8 x 8 pixel cells

Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

Averaged positive examples

Train a SVM classifier

Dalal and Triggs, CVPR 2005

Slide from Deva Ramanan

Slide from Deva Ramanan

which, together with i 0, is equivalent to

max (0, 1 yif (xi))

1. yif (xi) > 1

max (0, 1 yif (xi)) + ||w||2

Does this cost function have a unique solution?

Convex function examples

A non-negative sum of convex functions is convex

max (0, 1 yif (xi)) + ||w||2

Gradient (or steepest) descent algorithm for SVM

(with = 2/(N C) up to an overall scale of the problem) and

Sub-gradient for hinge loss

Sub-gradient descent algorithm for SVM

||w||2 + L(xi, yi; w)

The iterative update is

wt+1 wt (wt yixi)

if yif (xi) < 1

Pegasos Stochastic Gradient Descent Algorithm

Background reading and more

You might also like