Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
42 views34 pages

SVM - Feb 15

Uploaded by

nidhinb200723cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views34 pages

SVM - Feb 15

Uploaded by

nidhinb200723cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Support Vector Machine

1
Jayaraj P B
Outline
1. Finite Dimensional Vector Space
2. Hyper plane
3. SVM – overview
4. Mathematical formulation of SVM
problem
Linear Separators
Binary classification can be viewed as the task of
separating classes in feature space:
w Tx + b = 0
w Tx + b > 0
w Tx + b < 0

f(x) = sign(wTx + b)
Linear Separators

Which of the linear separators is optimal?


Optimal Hyperplane
SVM
Support vector machine is a supervised learning algorithm used in
classification.

It selects a small number of boundary instances called support


vectors and build a discriminant function that separates the training
examples with a wide boundary.

This function is then used for prediction purpose. Support vectors


play a key role in SVM algorithm.

To find the support vectors we use Lagrange duality.


Example
SVM can be understood with an example

Suppose we see a strange cat that also has some features of dogs, so
if we want a model that can accurately identify whether it is a cat or
dog, so such a model can be created by using the SVM algorithm.

We will first train our model with lots of images of cats and dogs so
that it can learn about different features of cats and dogs, and then
we test it with this strange creature.

So as support vector creates a decision boundary between these


two data (cat and dog) and choose extreme cases (support vectors),
it will see the extreme case of cat and dog.

On the basis of the support vectors, it will classify it as a cat.


Consider the below diagram:
Maximum Margin
Classification Margin
wT xi  b
Distance from example xi to the separator is r
w
Examples closest to the hyperplane are support vectors.

Margin ρ of the separator is the distance between support vectors.


ρ

r
Hyperplane
There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best
decision boundary that helps to classify the data points.

This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in


the dataset, which means if there are 2 features (as shown in
image), then hyperplane will be a straight line.

And if there are 3 features, then hyperplane will be a 2-dimension


plane.

We always create a hyperplane that has a maximum margin, which


means the maximum distance between the data points.
Maximum Margin Classification
Maximizing the margin is good according to intuition and PAC
theory.

Implies that only support vectors matter; other training examples


are ignorable
Support Vectors:

The data points or vectors that are the closest


to the hyperplane and which affect the
position of the hyperplane are termed as
Support Vector. Since these vectors support
the hyperplane, hence called a Support
vector.
Hyperplanes in 2D and 3D feature space
The goal of SVM algorithm is to find the
direction of the margin such that it
separates the data with widest gap. By
finding such margin, we will find both
the vector w and the offset b.
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated


data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and
classifier used is called as Non-linear SVM classifier.
Derivation
Maximum Margin

We have to maximize the margin while making sure


that all the training data points are on the correct
side of the separating hyperplane. This formulation
is known as the primal problem.
We can construct a LaGrangian for this optimization problem.

There will be a Lagrangian multiplier corresponding to each


constraint.

The Lagrangian L will be,


We can apply Karush-Kuhn-Tucker (KKT) conditions on the above
constrained quadratic programming problem. The KKT conditions
are as follows,
By applying the KKT conditions to the Lagrangian L on w
And applying the same for b,
Soft margin SVM
In previous section we assumed that the data points were perfectly
separable by a hyperplane. However, in most of the real world
situations that is not the case.

So in Soft margin SVM we allow some misclassifications by


penalizing those on the wrong side of the decision boundary and we
have to minimize this amount.

Now the support vectors need not be on the hyperplane, they can
go beyond the hyperplane of the other class.
Soft Margin Classification
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of difficult
or noisy examples, resulting margin called soft.

ξi
ξi
We add a penalization parameter, C, to set a ceiling in the amount
of training error.

Now by inclusion of slack variable our optimization problem


becomes,
Going through the same steps as Hard margin by solving the
Lagrangian and merging the KKT conditions in dual problem we get
the following optimization problem,
Kernels
Soft margin SVM can improve the performance when the data
points are non linearly separable, we can further improve the
performance by mapping the data points to a different feature
space of higher dimension using a mapping function.

The hardmargin optimization task appears in the form of inner


product between xi and xj .

This allows us to make the SVM non-linear.

Instead of applying it on the original input space x, we apply on the


new feature space (x). Given a feature mapping
SVM - Pros
• Effective on datasets with multiple features, like financial or
medical data.

• Effective in cases where number of features is greater than the


number of data points.

• Uses a subset of training points in the decision function called


support vectors which makes it memory efficient.

• Different kernel functions can be specified for the decision


function. You can use common kernels, but it's also possible to
specify custom kernels.
Cons
• If the number of features is a lot bigger than the
number of data points, avoiding over-fitting when
choosing kernel functions and regularization term is
crucial.

• SVMs don't directly provide probability estimates.


Those are calculated using an expensive five-fold cross-
validation.

• Works best on small sample sets because of its high


training time.
• Since SVMs can use any number of kernels, it's
important that you know about a few of them.
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992
and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of
classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors
(e.g. graphs, sequences, relational data) by designing kernel
functions for such data.
• SVM techniques have been extended to a number of tasks such as
regression [Vapnik et al. ’97], principal component analysis
[Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition
to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]
• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.

You might also like