Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
71 views52 pages

ML - 05 - Support Vector Machines

The document discusses support vector machines, including how they find the optimal separating hyperplane to maximize the margin between classes of data. SVMs are a popular machine learning model well-suited for classification tasks. They work by mapping data to higher dimensions and finding a hyperplane that separates classes with the largest possible margin.

Uploaded by

Jok eR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views52 pages

ML - 05 - Support Vector Machines

The document discusses support vector machines, including how they find the optimal separating hyperplane to maximize the margin between classes of data. SVMs are a popular machine learning model well-suited for classification tasks. They work by mapping data to higher dimensions and finding a hyperplane that separates classes with the largest possible margin.

Uploaded by

Jok eR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Machine Learning

Chapter 05
Support Vector Machines
Prepared by: Ziad Doughan
Email: [email protected]

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 1


Introduction

A Support Vector Machine (SVM) is a powerful Machine


Learning model, capable of performing linear or nonlinear
classification, regression, and even outlier detection.
It is one of the most popular models in Machine Learning,
particularly well suited for classification of complex small or
medium sized datasets.
SVMs are known for their robustness, good generalization
ability, and unique global optimum solution.
SVMs are probably the most popular ML approach for
supervised learning yet their principle is very simple.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 2


Introduction

SVM is a sparse technique


(light):
It requires all the training
data during the training
phase.
But it only requires a subset
of instances for future
prediction.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 3


Introduction

SVM is a kernel technique:


It maps the data into a higher-dimensional space before
solving the ML task.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 4


Introduction

SVM is a maximum margin


separator:
The hyper-plane separating
two classes needs to be
situated at a maximum
distance from the different
classes.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 5


Multi-Class Classification Using ANN

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 6


Multi-Class Classification Using SVM

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 7


ANN VS SVM graphically

ANN use non-linear activation SVMs only draw straight lines,


functions so they can draw but they transform the data
complex boundaries and keep first to a higher-dimensional
the data unchanged. space.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 8


SVM From Logistic Regression Perspective

Consider classification task with a two classes c1 & c2.


Suppose you are using a logistic regression with a hypothesis
hθ(x) = g(θTX), where:

1
𝑔 𝑧 =
1+𝑒 −𝑧

So, it predicts “ y = 1 if hθ(x) >= 0.5 thus θTX > 0“

And it predicts “ y = 0 if hθ(x) < 0.5 thus θTX < 0“

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 9


SVM From Logistic Regression Perspective

Logistic Regression vs. SVM cost

Cost1 Cost0

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 10


SVM From Logistic Regression Perspective

Logistic Regression cost function:


Cx 𝟏 𝒎 𝒊 𝝀
𝒏

𝑱 𝜽 =− 𝒊
෍ 𝒚 . 𝐥𝒐𝒈 𝒉𝜽 𝒙 + 𝟏−𝒚 𝒊
. 𝐥𝒐𝒈 𝟏 − 𝒉𝜽 𝒙 𝒊
+ ෍ 𝜽𝟐𝒋
𝒎 𝟐𝒎
𝒊=𝟏 𝒋=𝟏
Cost1 Cost0
SVM cost function:
𝒎 𝒏
𝟏
𝑱 𝜽 = 𝑪. ෍ 𝒚 𝒊 . 𝒄𝒐𝒔𝒕𝟏 𝜽𝑻 𝒙 𝒊
+ 𝟏−𝒚 𝒊
. 𝒄𝒐𝒔𝒕𝟎 𝜽𝑻 𝒙 𝒊
+ ෍ 𝜽𝟐𝒋
𝟐
𝒊=𝟏 𝒋=𝟏

With a SVM hypothesis:


1 , 𝑖𝑓𝜃 𝑇 𝑥 ≥ 1
ℎ𝜃 𝑥 = ൝
0 , 𝑖𝑓𝜃 𝑇 𝑥 ≤ −1

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 11


Linear SVM Classifier

Suppose we have the below dataset:

We can use the dataset for training as:


• 𝑋𝑖 , 𝑖 ∈ 1, 𝑛1 with class 1 labels
• 𝑋𝑗 , 𝑗 ∈ 1, 𝑛2 with class 2 labels

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 12


Linear SVM Classifier

Our target is to design a model that will distinguish between


these two classes.
Then, given an unknown instance, the model will be able to
predict its class label.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 13


Linear SVM Classifier

SVMs are sensitive to the feature scales.


In the left plot, the vertical scale is much larger than the
horizontal scale.
After feature scaling, the decision boundary in the right plot
looks much better.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 14


What is the goal of the SVM?

The goal of a support vector machine is to find the optimal


separating hyperplane which maximizes the margin of the
training data.
A SVM needs training data, thus it is a supervised learning
algorithm.
Also, SVM is a classification algorithm used to predict if
something belongs to a particular class.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 15


What is a Separating Hyperplane?

The word Hyperplane is a


generalization of a plane:
• In one dimension, it is called
a point.
• In two dimensions, it is a
line.
• In three dimensions, it is a
plane.
• In more dimensions you can
call it a hyperplane.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 16


What is the Optimal Separating Hyperplane?

We can compute the distance


between the hyperplane and
the closest data point.
We double it to get what is
called the margin.
The margin is a no man's
land.
The optimal hyperplane is the There will never be any data
one with the biggest margin. point inside the margin.
Objective: to maximizes the When data is noisy, we use
margin on the training data. soft margin classifier.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 17


What is a Vector?

A vector: is a norm and a direction.


Example, X has two coordinates (x1, x2).

The magnitude of X is: 𝑋 = 𝑥12 + 𝑥22


𝑋 = 32 + 42 = 5
𝑥1 𝑥2
The direction of X is the vector: 𝑢 ,
𝑋 𝑋
𝑥1 𝑥2 3 4
𝑢 , =𝑢 , = 𝑢 0.6, 0.8
𝑋 𝑋 5 5

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 18


The Dot Product of Two Vectors

The Dot product of 2 vectors X and Y (aka. inner or scalar


product) is:
X⋅Y= 𝑋 𝑌 cos(𝜃)
Suppose we have 2 perpendicular vectors W(-b, -a, 1) and X(1,
x, y) then:
W⋅X= 𝑊 𝑋 cos(𝜃)
𝑊𝑇 ⋅ X = 𝑊 𝑋 cos(90)
𝑦 − 𝑎𝑥 − 𝑏 = 0
𝑦 = 𝑎𝑥 − 𝑏 line equation

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 19


Distance From a Point To a Hyperplane

(2, 1)

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 20


Distance From a Point To a Hyperplane

𝑊 = 22 + 12 = 5
2 1
𝑢 ,
5 5

𝑝= 𝑢⋅𝑎 𝑢
2 1
𝑝= ×3+ ×4 𝑢
5 5
(2, 1)
10 10 2 1
𝑝= 𝑢 = ,
5 5 5 5

𝑝 = 4, 2
𝑃 = 42 + 22 = 2 5 𝑚𝑎𝑟𝑔𝑖𝑛 = 2 𝑃 = 4 5
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 21
How Can We Find The Biggest Margin?

Step 1: You have a dataset D, and you want to classify it.


𝑛
𝐷= 𝑥𝑖 , 𝑦𝑖 |𝑥𝑖 ∈ ℝ𝑝 , 𝑦𝑖 ∈ −1, 1 𝑖=1

Step 2: You need to select two hyperplanes separating the


data with no points between them.
Any hyperplane H0 can be written as the set of points x
satisfying w⋅x + b = 0.
We can get two hyperplanes H1 and H2 which separate the
data and have the following equations:
w⋅x + b = 1 and w⋅x + b = -1

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 22


How Can We Find The Biggest Margin?

Now we want to be sure that H1 and H2 have no points


between them.
So, for each point xi either:
• w⋅xi + b ≥ 1 for xi having the class 1, or
• w⋅xi + b ≤ −1 for xi having the class −1.
Combining both constraints:
yi⋅(w⋅xi + b) ≥ 1 for all 1 ≤ i ≤ n

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 23


How Can We Find The Biggest Margin?

Now we want to be sure that H1 and H2 have no points


between
w⋅x them.
i + b ≤ −1 w⋅xi + b ≥ 1
So, for each point xi either:
• w⋅xi + b ≥ 1 for xi having the class 1, or
• w⋅xi + b ≤ −1 for xi having the class −1.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 24


SVM Problem Formulation

𝑤⋅𝑧+𝑏 =1
𝑤 ⋅ (𝑥 + 𝑘) + 𝑏 = 1
𝑤
𝑤⋅ 𝑥+𝑚 +𝑏 =1
𝑤
𝑤 2
𝑤⋅𝑥+𝑚 +𝑏 =1
𝑤

𝑤⋅𝑥+𝑏 =1−𝑚 𝑤
−1 = 1 − 𝑚 𝑤
2
To maximize m, minimize ||w|| 𝑚=
𝑤

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 25


SVM Problem Formulation

This leads to an optimization problem that minimizes the


objective function:
1
𝐽 𝑤 = 𝑤 subject to the constraints:
2

yi⋅( w⋅xi + b ) ≥ 1 , for any i = 1, … , n


The Lagrangian function of the problem is formed by
augmenting the objective function with a weighted sum of
the constraint:
𝑛
1 𝑇
ℒ 𝑤, 𝑏, 𝜆 = 𝑤 𝑤 − ෍ 𝜆𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1
2
𝑖=1

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 26


Dual Lagrangian - General Approach

Dual Lagrangian provides a way to simplify search for solution


of quadratic optimization under inequality constraints.
The idea is to transform problem to a max(min) under a
Complementary Slackness:
𝜆𝑖 × 𝑔 𝑥𝑖 = 0 , Karush Kuhn Tucker (KKT) Dual
Complementarity Condition
Lagrangian multipliers (𝜆𝑖 ) are called dual parameters.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 27


Linear SVM

The closest vectors (or points) from each class to the classifier
are known as Support Vectors.
Once w is computed, b is determined from Complementary
Slackness conditions.
The optimal hyperplane classifier of a support vector machine
is unique.
However, the resulting Lagrange multipliers are not unique.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 28


Problem Is Solved Subject To KKT Constraints

Primal constraints:
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 ≤ 0 , ∀ 𝑖 = 1, … , 𝑛
Dual constraints:
𝜆𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Complementarity slackness:
𝜆𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 = 0 , ∀ 𝑖 = 1, … , 𝑛
Gradient of the Lagrangian w.r.t primal variables is zero:
𝑤 − σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖
𝛻ℒ 𝑤, 𝑏, 𝜆 = 𝑛 =0
− σ𝑖=1 𝜆𝑖 𝑦𝑖

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 29


Problem Is Solved Subject To KKT Constraints

The solution of the previous problem is:


𝑤 = σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 , σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0

The dual problem of SVM optimization is to find:


𝑛 𝑛 𝑛
𝑚𝑎𝑥 1
෍ 𝜆𝑖 − ෍ ෍ 𝜆𝑖 𝜆𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝜆 2
𝑖=1 𝑖=1 𝑗=1

Subject to: σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0 and 𝜆𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 30


How to Determine The Classifier Equation?

Case Study - Assume the solution to the dual Lagrangian for a


given dataset (x1, x2, y) as following:

𝑤1 = σ8𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 = 65.5261 × 1 × 0.3858 + 65.5261 × (−1) × 0.4871

𝑤2 = σ8𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 = 65.5261 × 1 × 0.4687 + 65.5261 × (−1) × 0.6110

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 31


How to Determine The Classifier Equation?

𝑤1 = −6.64 , 𝑤2 = −9.32

The bias term b can be computed for each support vector:


𝑏1 = 1 − 𝑤 ⋅ 𝑥 = 1 − −6.64 × 0.3858 − −9.32 × 0.4871 = 7.93

𝑏2 = −1 − 𝑤 ⋅ 𝑥 = −1 − −6.64 × 0.4687 − −9.32 × 0.611 = 7.9289

Averaging these values, we obtain b = 7.93

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 32


How to Determine The Classifier Equation?

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 33


Soft Margin SVM

When the data is not completely separable, SVM will not


search for a hard margin that classifies all the data.
Instead, SVM provides a soft margin that classifies most of the
data correctly while allowing to misclassify a couple of points.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 34


Slack Variables

Slack variables 𝜉𝑖 (pronounced Xi of i) are added to the SVM


function to allow some misclassifications.
The problem in primal form now becomes a minimization of
the following objective function:
1
𝐽 𝑤, 𝜉 = 𝑤 + 𝐶 σ𝑛𝑖=1 𝜉𝑖
2

subject to two constraints:


𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 , ∀ 𝑖 = 1, … , 𝑛
𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 35


Regularization Term C
𝑛
1
𝐽 𝑤, 𝜉 = 𝑤 + 𝐶 ෍ 𝜉𝑖
2
𝑖=1

The regularization term C, is a parameter that varies


depending on the optimization goal:
As C is increased, a tighter margin is obtained, and more focus
is given to minimize the number of misclassification. (Lower
Bias, Higher Variance)
While, as C is decreased, more violations is allowed to
maximize the margin between the two classes. (Higher Bias,
Lower Variance)

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 36


Regularization Term C

For C1 < C2, less training points are within the margin for C2
than C1, but C1 has a wider margin.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 37


Soft Margin Dual Problem

In dual form, the soft margin SVM formulation is:


𝑛 𝑛 𝑛
𝑚𝑎𝑥 1
෍ 𝜆𝑖 − ෍ ෍ 𝜆𝑖 𝜆𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝜆 2
𝑖=1 𝑖=1 𝑗=1

Subject to: σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0 and 0 ≤ 𝜆𝑖 ≤ 𝐶 , ∀ 𝑖 = 1, … , 𝑛


Soft margin dual problem is equivalent to hard margin dual
problem with bounded dual variables.
Adding slack variables does not affect the complexity of
solving the dual problem but makes the solution more robust.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 38


Non-Linear SVM and Kernel

A Kernel: is central component of an algorithm, system or


application.

Its role is to bridge between applications and actual data


processing (Ex: OS is a kernel).

So, the Kernel's responsibilities are to manage system's


resources between different components.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 39


Kernel Techniques in SVM

The Kernel maps data into higher dimensional feature space,


where each coordinate corresponds to one feature of the
data, items, transforming the data into set of points in a
Euclidean space.
In that space, many methods can be used to find relations in
data.
Mapping can be quite general, thus not necessarily linear.
Relations found in this way are accordingly called the Kernel
Trick.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 40


Kernel Techniques in SVM

When problem is not linearly separable, soft margin SVM


cannot find an efficient and robust separating hyperplane.
For that, the kernel takes the data to a higher dimensional
space, referred to as kernel space, where it will be linearly
separable.

Gaussian RBF
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 41
Kernel Techniques in SVM

Gaussian RBF (Radial Basis Function) Kernel:

The radius of the base is 𝜎 and l vector (landmark) is the


center of the data.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 42


Kernel Techniques in SVM

Polynomial Kernel:

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 43


Popular Kernel Functions

Polynomial function:
𝐾 𝑥, 𝑦 = (𝑎𝑥 𝑇 𝑦 + 𝑐)𝑞 , 𝑞 > 0
Hyperbolic Tangent (sigmoid):
𝐾 𝑥, 𝑦 = 𝑡𝑎𝑛ℎ(𝛽𝑥 𝑇 𝑦 + 𝛾)
Gaussian Radial Basis Function (RBF):
𝑥−𝑦 2

𝐾 𝑥, 𝑦 = 𝑒 𝜎2

Laplacian Radial Basis Function:


𝑥−𝑦

𝐾 𝑥, 𝑦 = 𝑒 𝜎

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 44


Popular Kernel Functions

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 45


Kernel Parameters

The choice of kernel parameters affects the distribution of


data in the kernel space.
Example: If 𝜎 is large, features vary smoothly. (Higher Bias,
Lower Variance)
If 𝜎 is small, features vary rapidly. (Lower Bias,
Higher Variance)
Thus, a parameter value is essential for transforming data to a
linearly separable presentation.
Grid search is performed to find the most suitable values for
the kernel parameters.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 46


Kernel SVM Model

Primal formulation of the kernel SVM is:


𝑛
𝑚𝑖𝑛 1 𝑇
𝑤 𝑤 + 𝐶 ෍ 𝜉𝑖
𝑤 ,𝜉 2
𝑖=1

Subject to: 𝑦𝑖 𝑤 𝑇 𝜑(𝑥𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 , ∀ 𝑖 = 1, … , 𝑛


𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Where 𝜑 is such that: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜑 𝑥𝑖 . 𝜑(𝑥𝑗 )
Kernel functions need to satisfy Mercer’s Theorem.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 47


Solution of Optimization Problem

The SVM solution should satisfy the KKT conditions below:


• 𝑤 = σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝜑(𝑥𝑖 )
• σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0
• 𝐶 − 𝜇𝑖 − 𝜆𝑖 = 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜆𝑖 [𝑦𝑖 𝑤 𝑇 𝜑 𝑥𝑖 + 𝑤0 − 1 + 𝜉𝑖 ] = 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜇𝑖 𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜇𝑖 , 𝜉𝑖 > 0 , ∀ 𝑖 = 1, … , 𝑛

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 48


Multiclass SVM

A Multi-class SVM can be solved by one of the below


strategies:
• One versus one (OVO): we find hyper-planes separating
combinations of 2 classes. During testing, the model
assigns a label using majority vote.
• One versus all (OVA): we find hyper-planes separating
each class from remaining classes. During testing, the
model assigns a label based on a multi-stage classification.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 49


One Versus One (OVO)

One vs. One: N-class then N*(N-1)/2 binary classifier models.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 50


One Versus All (OVA)

One vs. All: N-class then N binary classifier models.

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 51


End of Chapter 05

Chapter 05 - Machine Learning - Faculty of Engineering - BAU 52

You might also like