UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3405 - Machine Learning
Lecture 5: Classification (P3) - SVM
Duc-Trong Le & Hoang Van Xiem
Hanoi, 03/2024
Outline
● Problem and Intuition
● Formulation of Linear SVM
○ Hard Margin SVM
○ Soft Margin SVM
○ Primal/dual Problems
● Nonlinear SVM with Kernel
○ Kernel Tricks
○ SVM with Kernel
● Multi-class classification
FIT-CS INT3405 - Machine Learning 2
Recap: Bayes Theorem & Decision Boundary
Posterior Likelihood Prior Decision Boundary
FIT-CS INT3405 - Machine Learning 3
History
● SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather
popular since.
● Theoretically well motivated algorithm: developed from Statistical
Learning Theory (Vapnik & Chervonenkis) since the 60s
● Empirically good performance: successful applications in many fields
(bioinformatics, text, image recognition, . . . )
● Centralized website: www.kernel-machines.org
FIT-CS INT3405 - Machine Learning 4
Problem Setting
● Problem Setting
○ Training data
○ For two-class (binary)
classification
● Goal
○ To find an optimal linear
hyperplane (decision boundary)
that separates all the data
FIT-CS INT3405 - Machine Learning 5
Intuition
● One possible solution
FIT-CS INT3405 - Machine Learning 6
Intuition
● Another possible
solution
FIT-CS INT3405 - Machine Learning 7
Intuition
● Too many other
possible solutions
FIT-CS INT3405 - Machine Learning 8
Intuition
● Which one is better
than the other?
● How to define better?
FIT-CS INT3405 - Machine Learning 9
Intuition: Maximum Margin
● Intuition of “Margin”
○ The margin of a linear classifier as the
width that the boundary could be
increased by before hitting a data point.
● Idea of SVM
○ Find the separating hyperplane
maximizing the margin
FIT-CS INT3405 - Machine Learning 10
Support Vector Machines (SVM)
Support
Vectors
FIT-CS INT3405 - Machine Learning 11
SVM: Optimization Formulation (1)
●From Margin to Norm
■ Margin: distance between
■ Maximizing margin is equivalent to minimizing
●Constraints
○ Separation with margin, i.e.,
○ Simplified as the equivalent constraint
FIT-CS INT3405 - Machine Learning 12
SVM: Optimization Formulation (2)
●SVM as a Quadratic Programming (QP) problem
○ Convex problem, has unique minimum
○ Quadratic objective function
○ Linear equality and inequality constraints
FIT-CS INT3405 - Machine Learning 13
Linearly Non-separable Cases
● What if the data cannot be linearly separable?
● For such case,
Hard margin SVM cannot be applied
directly
FIT-CS INT3405 - Machine Learning 14
Soft Margin SVM
●Standard Linear SVM
○Introduce slack variables
○Relax the constraints
○Penalize the relaxation
Primal
Problem:
C is a regularization parameter. Soft margin SVM trade off between
maximizing the margin and minimizing the misclassification error rate
FIT-CS INT3405 - Machine Learning 15
Linearly Non-separable Case
● Re-written as an unconstrained optimization:
FIT-CS INT3405 - Machine Learning 16
Linearly Non-separable Case
Model Training
Complexity Error
Support
Vector
Machine
Regularized
logistic
regression
Choice of Parameter C:
Large C: Lower bias, high variance
Small C: Higher bias, low variance
FIT-CS INT3405 - Machine Learning 17
Dual Form of SVM
https://www.quora.com/What-is-primal-and-dual-formulation-in-SVM
FIT-CS INT3405 - Machine Learning 18
Suppose we’re in 1-dimension
What would SVMs
do with this data?
FIT-CS INT3405 - Machine Learning 19
Suppose we’re in 1-dimension
Not a big surprise
FIT-CS INT3405 - Machine Learning 20
Harder 1-dimensional Dataset
FIT-CS INT3405 - Machine Learning 21
Harder 1-dimensional Dataset
FIT-CS INT3405 - Machine Learning 22
Harder 1-dimensional Dataset
FIT-CS INT3405 - Machine Learning 23
SVM: Nonlinear Case
● Limitation of linear SVM
○ Linear SVM classifiers sometimes are restricted for some complex classification tasks
where data are not linearly separable in input space
● Basic Idea of Nonlinear SVM
○ Map data into a richer feature space including nonlinear features, then construct a
linear hyperplane in that space (using the same way)
FIT-CS INT3405 - Machine Learning 24
SVM: Nonlinear Case
● First, define a feature mapping
● Then learns a hyperplane in the feature space
● Almost the same Primal form of SVM
FIT-CS INT3405 - Machine Learning 25
SVM: Nonlinear Case
● The dual problem
● The optimal solution
FIT-CS INT3405 - Machine Learning 26
How to choose the feature mapping?
• Polynomial mapping
• Example:
• Problem of using explicit feature mapping:
• The dimensionality of can be very large, making w hard to
represent explicitly in memory, and hard for the QP to solve
FIT-CS INT3405 - Machine Learning 27
Kernel Tricks
• Idea: Replacing dot product with a kernel function
• Not all functions are kernel functions
• A function could be a kernel if it is
○ Symmetric:
○ Positive semi-definite (PSD): the “Gram matrix” K
defined by is PSD
(the PSD means )
• Benefits
○ Efficiency: Computing kernel is often more efficient than compute
and the dot product
○ Flexibility: can choose various kernel functions as long as the
existence of is guaranteed (Mercer’ condition)
FIT-CS INT3405 - Machine Learning 28
Kernel Functions
• Linear Kernel
• Polynomial Kernel (degree d)
• Gaussian / RBF Kernel
FIT-CS INT3405 - Machine Learning 29
Kernel Functions
● Example: Polynomial Kernels
FIT-CS INT3405 - Machine Learning 30
Gaussian/RBF Kernel
●The kernel can be inner product in the infinite dimensional space.
Assume x∈R.
FIT-CS INT3405 - Machine Learning 31
Nonlinear SVM with Kernel (1)
● Introducing nonlinearity into the model
● Computationally efficient
● The dual form
● The decision function
FIT-CS INT3405 - Machine Learning 32
Nonlinear SVM with Kernel (2)
FIT-CS INT3405 - Machine Learning 33
Nonlinear SVM with Kernel (3)
FIT-CS INT3405 - Machine Learning 34
Nonlinear SVM with Kernel (4)
●The inner product in the feature space (similarity score) is performed
implicitly
●Any linear classification method can be easily extended to nonlinear
feature space (e.g., kernelized logistic regression)
●Non-vectorial data can be utilized (as long as kernel matrix is PSD)
●Questions:
○ Which kernel to use? How to set the parameters?
○ One kernel for each feature type or for all?
FIT-CS INT3405 - Machine Learning 35
Curse of Kernalization
●Challenge
○ Training kernel classifiers is often much more computationally
expensive
○ For kernel SVM, if one solves it by typical QP solvers, it will need
O(N^3). Even for faster solvers (SMO) or others, it typically needs
at least O(N^2) time cost.
○ But linear classifiers can be trained in much fasters, typically in
linear time O(N)
●Question
○ How to train kernel machines for large-scale datasets?
FIT-CS INT3405 - Machine Learning 36
Kernel Approximation
● Our goal
○ To construct a new representation
so that :
● Linear model
○ The hypothesis can be rewritten:
where
○ Then apply linear classifiers on the new representation z
● Two methods
○ Kernel Functional Approximation: Fourier method
○ Kernel Matrix Approximation: Nystrӧm method
FIT-CS INT3405 - Machine Learning 37
Multi-class Classification
● Consider k classes
● One-against-the rest: Train k binary SVMs:
○ 1st class vs. (2 − k)th class
○ 2nd class vs. (1, 3 − k)th class
○…
● k decision functions
FIT-CS INT3405 - Machine Learning 38
Multi-class Classification
● Prediction
● Reason: If it’s the 1st class, then we should have
FIT-CS INT3405 - Machine Learning 39
Multi-class Classification
● One-against-one: train k(k − 1)/2 binary SVMs
(1,2), (1,3), . . . , (1,k), (2,3), (2,4), . . . , (k−1,k)
● Example: if 4 classes 6 binary SVMs
FIT-CS INT3405 - Machine Learning 40
Multi-class Classification
● For a testing data, predict all binary SVMs
● Select the one with the
largest vote
● May use decision values as well
FIT-CS INT3405 - Machine Learning 41
Multi-class Classification
● There are many other methods
● A comparison in [Hsu and Lin, 2002]
● Accuracy similar for many problems
● But 1-against-1 fastest for training
● Assume the SVM optimization with size n is
● 1 vs. all
○ k problems, each has N data
● 1 vs. 1
○ k(k − 1)/2 problems, each 2N/k data on average
Chih-Wei Hsu and Chih-Jen Lin, "A comparison of methods for multiclass support vector machines," in IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415-
425, March 2002
FIT-CS INT3405 - Machine Learning 42
Summary
● Problem and Intuition
● Formulation of Linear SVM
○ Hard Margin SVM
○ Soft Margin SVM
○ Primal/dual Problems
● Nonlinear SVM with Kernel
○ Kernel Tricks
○ SVM with Kernel
● Multi-class classification
FIT-CS INT3405 - Machine Learning 43
UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
Thank you
Email me
[email protected]