Machine Learning
Chapter 05
Support Vector Machines
Prepared by: Ziad Doughan
Email: [email protected]
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 1
Introduction
A Support Vector Machine (SVM) is a powerful Machine
Learning model, capable of performing linear or nonlinear
classification, regression, and even outlier detection.
It is one of the most popular models in Machine Learning,
particularly well suited for classification of complex small or
medium sized datasets.
SVMs are known for their robustness, good generalization
ability, and unique global optimum solution.
SVMs are probably the most popular ML approach for
supervised learning yet their principle is very simple.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 2
Introduction
SVM is a sparse technique
(light):
It requires all the training
data during the training
phase.
But it only requires a subset
of instances for future
prediction.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 3
Introduction
SVM is a kernel technique:
It maps the data into a higher-dimensional space before
solving the ML task.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 4
Introduction
SVM is a maximum margin
separator:
The hyper-plane separating
two classes needs to be
situated at a maximum
distance from the different
classes.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 5
Multi-Class Classification Using ANN
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 6
Multi-Class Classification Using SVM
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 7
ANN VS SVM graphically
ANN use non-linear activation SVMs only draw straight lines,
functions so they can draw but they transform the data
complex boundaries and keep first to a higher-dimensional
the data unchanged. space.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 8
SVM From Logistic Regression Perspective
Consider classification task with a two classes c1 & c2.
Suppose you are using a logistic regression with a hypothesis
hθ(x) = g(θTX), where:
1
𝑔 𝑧 =
1+𝑒 −𝑧
So, it predicts “ y = 1 if hθ(x) >= 0.5 thus θTX > 0“
And it predicts “ y = 0 if hθ(x) < 0.5 thus θTX < 0“
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 9
SVM From Logistic Regression Perspective
Logistic Regression vs. SVM cost
Cost1 Cost0
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 10
SVM From Logistic Regression Perspective
Logistic Regression cost function:
Cx 𝟏 𝒎 𝒊 𝝀
𝒏
𝑱 𝜽 =− 𝒊
𝒚 . 𝐥𝒐𝒈 𝒉𝜽 𝒙 + 𝟏−𝒚 𝒊
. 𝐥𝒐𝒈 𝟏 − 𝒉𝜽 𝒙 𝒊
+ 𝜽𝟐𝒋
𝒎 𝟐𝒎
𝒊=𝟏 𝒋=𝟏
Cost1 Cost0
SVM cost function:
𝒎 𝒏
𝟏
𝑱 𝜽 = 𝑪. 𝒚 𝒊 . 𝒄𝒐𝒔𝒕𝟏 𝜽𝑻 𝒙 𝒊
+ 𝟏−𝒚 𝒊
. 𝒄𝒐𝒔𝒕𝟎 𝜽𝑻 𝒙 𝒊
+ 𝜽𝟐𝒋
𝟐
𝒊=𝟏 𝒋=𝟏
With a SVM hypothesis:
1 , 𝑖𝑓𝜃 𝑇 𝑥 ≥ 1
ℎ𝜃 𝑥 = ൝
0 , 𝑖𝑓𝜃 𝑇 𝑥 ≤ −1
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 11
Linear SVM Classifier
Suppose we have the below dataset:
We can use the dataset for training as:
• 𝑋𝑖 , 𝑖 ∈ 1, 𝑛1 with class 1 labels
• 𝑋𝑗 , 𝑗 ∈ 1, 𝑛2 with class 2 labels
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 12
Linear SVM Classifier
Our target is to design a model that will distinguish between
these two classes.
Then, given an unknown instance, the model will be able to
predict its class label.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 13
Linear SVM Classifier
SVMs are sensitive to the feature scales.
In the left plot, the vertical scale is much larger than the
horizontal scale.
After feature scaling, the decision boundary in the right plot
looks much better.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 14
What is the goal of the SVM?
The goal of a support vector machine is to find the optimal
separating hyperplane which maximizes the margin of the
training data.
A SVM needs training data, thus it is a supervised learning
algorithm.
Also, SVM is a classification algorithm used to predict if
something belongs to a particular class.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 15
What is a Separating Hyperplane?
The word Hyperplane is a
generalization of a plane:
• In one dimension, it is called
a point.
• In two dimensions, it is a
line.
• In three dimensions, it is a
plane.
• In more dimensions you can
call it a hyperplane.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 16
What is the Optimal Separating Hyperplane?
We can compute the distance
between the hyperplane and
the closest data point.
We double it to get what is
called the margin.
The margin is a no man's
land.
The optimal hyperplane is the There will never be any data
one with the biggest margin. point inside the margin.
Objective: to maximizes the When data is noisy, we use
margin on the training data. soft margin classifier.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 17
What is a Vector?
A vector: is a norm and a direction.
Example, X has two coordinates (x1, x2).
The magnitude of X is: 𝑋 = 𝑥12 + 𝑥22
𝑋 = 32 + 42 = 5
𝑥1 𝑥2
The direction of X is the vector: 𝑢 ,
𝑋 𝑋
𝑥1 𝑥2 3 4
𝑢 , =𝑢 , = 𝑢 0.6, 0.8
𝑋 𝑋 5 5
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 18
The Dot Product of Two Vectors
The Dot product of 2 vectors X and Y (aka. inner or scalar
product) is:
X⋅Y= 𝑋 𝑌 cos(𝜃)
Suppose we have 2 perpendicular vectors W(-b, -a, 1) and X(1,
x, y) then:
W⋅X= 𝑊 𝑋 cos(𝜃)
𝑊𝑇 ⋅ X = 𝑊 𝑋 cos(90)
𝑦 − 𝑎𝑥 − 𝑏 = 0
𝑦 = 𝑎𝑥 − 𝑏 line equation
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 19
Distance From a Point To a Hyperplane
(2, 1)
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 20
Distance From a Point To a Hyperplane
𝑊 = 22 + 12 = 5
2 1
𝑢 ,
5 5
𝑝= 𝑢⋅𝑎 𝑢
2 1
𝑝= ×3+ ×4 𝑢
5 5
(2, 1)
10 10 2 1
𝑝= 𝑢 = ,
5 5 5 5
𝑝 = 4, 2
𝑃 = 42 + 22 = 2 5 𝑚𝑎𝑟𝑔𝑖𝑛 = 2 𝑃 = 4 5
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 21
How Can We Find The Biggest Margin?
Step 1: You have a dataset D, and you want to classify it.
𝑛
𝐷= 𝑥𝑖 , 𝑦𝑖 |𝑥𝑖 ∈ ℝ𝑝 , 𝑦𝑖 ∈ −1, 1 𝑖=1
Step 2: You need to select two hyperplanes separating the
data with no points between them.
Any hyperplane H0 can be written as the set of points x
satisfying w⋅x + b = 0.
We can get two hyperplanes H1 and H2 which separate the
data and have the following equations:
w⋅x + b = 1 and w⋅x + b = -1
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 22
How Can We Find The Biggest Margin?
Now we want to be sure that H1 and H2 have no points
between them.
So, for each point xi either:
• w⋅xi + b ≥ 1 for xi having the class 1, or
• w⋅xi + b ≤ −1 for xi having the class −1.
Combining both constraints:
yi⋅(w⋅xi + b) ≥ 1 for all 1 ≤ i ≤ n
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 23
How Can We Find The Biggest Margin?
Now we want to be sure that H1 and H2 have no points
between
w⋅x them.
i + b ≤ −1 w⋅xi + b ≥ 1
So, for each point xi either:
• w⋅xi + b ≥ 1 for xi having the class 1, or
• w⋅xi + b ≤ −1 for xi having the class −1.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 24
SVM Problem Formulation
𝑤⋅𝑧+𝑏 =1
𝑤 ⋅ (𝑥 + 𝑘) + 𝑏 = 1
𝑤
𝑤⋅ 𝑥+𝑚 +𝑏 =1
𝑤
𝑤 2
𝑤⋅𝑥+𝑚 +𝑏 =1
𝑤
𝑤⋅𝑥+𝑏 =1−𝑚 𝑤
−1 = 1 − 𝑚 𝑤
2
To maximize m, minimize ||w|| 𝑚=
𝑤
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 25
SVM Problem Formulation
This leads to an optimization problem that minimizes the
objective function:
1
𝐽 𝑤 = 𝑤 subject to the constraints:
2
yi⋅( w⋅xi + b ) ≥ 1 , for any i = 1, … , n
The Lagrangian function of the problem is formed by
augmenting the objective function with a weighted sum of
the constraint:
𝑛
1 𝑇
ℒ 𝑤, 𝑏, 𝜆 = 𝑤 𝑤 − 𝜆𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1
2
𝑖=1
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 26
Dual Lagrangian - General Approach
Dual Lagrangian provides a way to simplify search for solution
of quadratic optimization under inequality constraints.
The idea is to transform problem to a max(min) under a
Complementary Slackness:
𝜆𝑖 × 𝑔 𝑥𝑖 = 0 , Karush Kuhn Tucker (KKT) Dual
Complementarity Condition
Lagrangian multipliers (𝜆𝑖 ) are called dual parameters.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 27
Linear SVM
The closest vectors (or points) from each class to the classifier
are known as Support Vectors.
Once w is computed, b is determined from Complementary
Slackness conditions.
The optimal hyperplane classifier of a support vector machine
is unique.
However, the resulting Lagrange multipliers are not unique.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 28
Problem Is Solved Subject To KKT Constraints
Primal constraints:
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 ≤ 0 , ∀ 𝑖 = 1, … , 𝑛
Dual constraints:
𝜆𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Complementarity slackness:
𝜆𝑖 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 − 1 = 0 , ∀ 𝑖 = 1, … , 𝑛
Gradient of the Lagrangian w.r.t primal variables is zero:
𝑤 − σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖
𝛻ℒ 𝑤, 𝑏, 𝜆 = 𝑛 =0
− σ𝑖=1 𝜆𝑖 𝑦𝑖
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 29
Problem Is Solved Subject To KKT Constraints
The solution of the previous problem is:
𝑤 = σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 , σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0
The dual problem of SVM optimization is to find:
𝑛 𝑛 𝑛
𝑚𝑎𝑥 1
𝜆𝑖 − 𝜆𝑖 𝜆𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝜆 2
𝑖=1 𝑖=1 𝑗=1
Subject to: σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0 and 𝜆𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 30
How to Determine The Classifier Equation?
Case Study - Assume the solution to the dual Lagrangian for a
given dataset (x1, x2, y) as following:
𝑤1 = σ8𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 = 65.5261 × 1 × 0.3858 + 65.5261 × (−1) × 0.4871
𝑤2 = σ8𝑖=1 𝜆𝑖 𝑦𝑖 𝑥𝑖 = 65.5261 × 1 × 0.4687 + 65.5261 × (−1) × 0.6110
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 31
How to Determine The Classifier Equation?
𝑤1 = −6.64 , 𝑤2 = −9.32
The bias term b can be computed for each support vector:
𝑏1 = 1 − 𝑤 ⋅ 𝑥 = 1 − −6.64 × 0.3858 − −9.32 × 0.4871 = 7.93
𝑏2 = −1 − 𝑤 ⋅ 𝑥 = −1 − −6.64 × 0.4687 − −9.32 × 0.611 = 7.9289
Averaging these values, we obtain b = 7.93
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 32
How to Determine The Classifier Equation?
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 33
Soft Margin SVM
When the data is not completely separable, SVM will not
search for a hard margin that classifies all the data.
Instead, SVM provides a soft margin that classifies most of the
data correctly while allowing to misclassify a couple of points.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 34
Slack Variables
Slack variables 𝜉𝑖 (pronounced Xi of i) are added to the SVM
function to allow some misclassifications.
The problem in primal form now becomes a minimization of
the following objective function:
1
𝐽 𝑤, 𝜉 = 𝑤 + 𝐶 σ𝑛𝑖=1 𝜉𝑖
2
subject to two constraints:
𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 , ∀ 𝑖 = 1, … , 𝑛
𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 35
Regularization Term C
𝑛
1
𝐽 𝑤, 𝜉 = 𝑤 + 𝐶 𝜉𝑖
2
𝑖=1
The regularization term C, is a parameter that varies
depending on the optimization goal:
As C is increased, a tighter margin is obtained, and more focus
is given to minimize the number of misclassification. (Lower
Bias, Higher Variance)
While, as C is decreased, more violations is allowed to
maximize the margin between the two classes. (Higher Bias,
Lower Variance)
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 36
Regularization Term C
For C1 < C2, less training points are within the margin for C2
than C1, but C1 has a wider margin.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 37
Soft Margin Dual Problem
In dual form, the soft margin SVM formulation is:
𝑛 𝑛 𝑛
𝑚𝑎𝑥 1
𝜆𝑖 − 𝜆𝑖 𝜆𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝜆 2
𝑖=1 𝑖=1 𝑗=1
Subject to: σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0 and 0 ≤ 𝜆𝑖 ≤ 𝐶 , ∀ 𝑖 = 1, … , 𝑛
Soft margin dual problem is equivalent to hard margin dual
problem with bounded dual variables.
Adding slack variables does not affect the complexity of
solving the dual problem but makes the solution more robust.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 38
Non-Linear SVM and Kernel
A Kernel: is central component of an algorithm, system or
application.
Its role is to bridge between applications and actual data
processing (Ex: OS is a kernel).
So, the Kernel's responsibilities are to manage system's
resources between different components.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 39
Kernel Techniques in SVM
The Kernel maps data into higher dimensional feature space,
where each coordinate corresponds to one feature of the
data, items, transforming the data into set of points in a
Euclidean space.
In that space, many methods can be used to find relations in
data.
Mapping can be quite general, thus not necessarily linear.
Relations found in this way are accordingly called the Kernel
Trick.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 40
Kernel Techniques in SVM
When problem is not linearly separable, soft margin SVM
cannot find an efficient and robust separating hyperplane.
For that, the kernel takes the data to a higher dimensional
space, referred to as kernel space, where it will be linearly
separable.
Gaussian RBF
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 41
Kernel Techniques in SVM
Gaussian RBF (Radial Basis Function) Kernel:
The radius of the base is 𝜎 and l vector (landmark) is the
center of the data.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 42
Kernel Techniques in SVM
Polynomial Kernel:
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 43
Popular Kernel Functions
Polynomial function:
𝐾 𝑥, 𝑦 = (𝑎𝑥 𝑇 𝑦 + 𝑐)𝑞 , 𝑞 > 0
Hyperbolic Tangent (sigmoid):
𝐾 𝑥, 𝑦 = 𝑡𝑎𝑛ℎ(𝛽𝑥 𝑇 𝑦 + 𝛾)
Gaussian Radial Basis Function (RBF):
𝑥−𝑦 2
−
𝐾 𝑥, 𝑦 = 𝑒 𝜎2
Laplacian Radial Basis Function:
𝑥−𝑦
−
𝐾 𝑥, 𝑦 = 𝑒 𝜎
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 44
Popular Kernel Functions
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 45
Kernel Parameters
The choice of kernel parameters affects the distribution of
data in the kernel space.
Example: If 𝜎 is large, features vary smoothly. (Higher Bias,
Lower Variance)
If 𝜎 is small, features vary rapidly. (Lower Bias,
Higher Variance)
Thus, a parameter value is essential for transforming data to a
linearly separable presentation.
Grid search is performed to find the most suitable values for
the kernel parameters.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 46
Kernel SVM Model
Primal formulation of the kernel SVM is:
𝑛
𝑚𝑖𝑛 1 𝑇
𝑤 𝑤 + 𝐶 𝜉𝑖
𝑤 ,𝜉 2
𝑖=1
Subject to: 𝑦𝑖 𝑤 𝑇 𝜑(𝑥𝑖 ) + 𝑤0 ≥ 1 − 𝜉𝑖 , ∀ 𝑖 = 1, … , 𝑛
𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
Where 𝜑 is such that: 𝐾 𝑥𝑖 , 𝑥𝑗 = 𝜑 𝑥𝑖 . 𝜑(𝑥𝑗 )
Kernel functions need to satisfy Mercer’s Theorem.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 47
Solution of Optimization Problem
The SVM solution should satisfy the KKT conditions below:
• 𝑤 = σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 𝜑(𝑥𝑖 )
• σ𝑛𝑖=1 𝜆𝑖 𝑦𝑖 = 0
• 𝐶 − 𝜇𝑖 − 𝜆𝑖 = 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜆𝑖 [𝑦𝑖 𝑤 𝑇 𝜑 𝑥𝑖 + 𝑤0 − 1 + 𝜉𝑖 ] = 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜇𝑖 𝜉𝑖 ≥ 0 , ∀ 𝑖 = 1, … , 𝑛
• 𝜇𝑖 , 𝜉𝑖 > 0 , ∀ 𝑖 = 1, … , 𝑛
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 48
Multiclass SVM
A Multi-class SVM can be solved by one of the below
strategies:
• One versus one (OVO): we find hyper-planes separating
combinations of 2 classes. During testing, the model
assigns a label using majority vote.
• One versus all (OVA): we find hyper-planes separating
each class from remaining classes. During testing, the
model assigns a label based on a multi-stage classification.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 49
One Versus One (OVO)
One vs. One: N-class then N*(N-1)/2 binary classifier models.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 50
One Versus All (OVA)
One vs. All: N-class then N binary classifier models.
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 51
End of Chapter 05
Chapter 05 - Machine Learning - Faculty of Engineering - BAU 52