Support vector machines (SVMs)
Dr. Saifullah Khalid
[email protected]Slides Credit: Mostly based on UofT intro to machine learning course
Sequence
ﺓSupport vector machine (SVM)
ﺓOptimal separating hyper planes
ﺓNon-seperable data
ﺓKernel Method
ﺓDual formulation of SVM
ﺓInner product of kernels
2
Separating Hyperplane?
Separating Hyperplane?
Separating Hyperplane?
Support Vector Machine (SVM)
Support vectors
Maximize
• SVMs maximize the margin (or the margin
street) around the separating hyperplane
• The decision function is fully specified
by a (usually very small) subset of
training samples, the support vectors
Support Vectors
d
X X
v1
v2
X X
v3
X
X
Three support vectors: v1, v2, v3, instead of just the 3 circled points at the tail ends of the
support vectors. d denotes 1/2 of the street ‘width’
Optimal Separating Hyperplane
ﺓOptimal Separating Hyperplane: A hyperplane that
separates two classes and maximizes the distance to the
closest point from either class, i.e., maximize the margin of
the classifier
ﺓIntuitively, ensuring that a classifier is not too close to any
data points leads to better generalization on the test data.
Geometry of Points and Planes
Geometry of Points and Planes
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Maximizing Margin as an Optimization Problem
Algebraic max-margin objective:
ﺓThis is a Quadratic Program: Quadratic objective + Linear inequality constraints.
ﺓThe important training examples are the ones with algebraic margin 1, and are
called support vectors
ﺓHence, this algorithm is called the (hard) Support Vector Machine (SVM)
ﺓSVM-like algorithms are often called max-margin or large-margin
Non-Separable Data Points
ﺓHow can we apply the max-margin principle if the
data are not linearly separable?
Maximizing Margin for Non-Separable Data
Points
Main Idea: ﺓAllow some points to be within the margin or even be
misclassified; we represent this with slack variables ξi.
ﺓBut constrain or penalize the total amount of slacks
Maximizing Margin for Non-Separable Data Points
Maximizing Margin for Non-Separable Data Points
ﺓSoft-margin SVM objective:
• 𝛾 is a hyper parameter that trades off the margin with the
amount of slack.
► For 𝛾 = 0, we’ll get 𝒘 = 0. (Why?)
► As 𝛾 → ∞ we get the hard-margin objective.
• Note: It is also possible to constrain 𝑖 𝜉𝑖 instead of penalizing it
From Margin Violation to Hinge Loss
Let’s simplify the soft margin constraint by eliminating ξi.
Recall: 𝑡 𝑖 𝒘𝑇𝒙𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ∀𝑖 ∈ 𝑁
𝜉𝑖 ≥ 0 ∀𝑖 ∈ 𝑁
ﺓWe would like to find a smallest slack variable ξi that satisfy both
𝜉𝑖 ≥ 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 and 𝜉𝑖 ≥ 0
ﺓCase 1: 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 ≤ 0
The smallest non-negative ξi that satisfies the constraint is 𝜉𝑖 = 0
ﺓCase 2: 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 > 0
The smallest 𝜉𝑖 that satisfies the constraint is 𝜉𝑖 = 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏
ﺓHence, 𝜉𝑖 = max {0, 1 − 𝑡 𝑖 𝒘 𝑇 𝒙 𝑖 + 𝑏 }
ﺓTherefore, the slack penalty can be written as
𝑁 𝑁
𝜉𝑖 = 𝑚𝑎𝑥 {0, 1 − 𝑡 𝑖 𝑤 𝑇 𝑥 𝑖 + 𝑏 }
𝑖 𝑖
From Margin Violation to Hinge Loss
Kernel Methods
or
Kernel Trick
Nonlinear Decision Boundaries
ﺓSV Classifier: Margin maximizing linear classifier
ﺓLinear models are restrictive
ﺓQ: How can we get nonlinear decision boundaries?
ﺓFeature mapping 𝒙 → 𝜑(𝒙)
ﺓQ: How do we find good features?
Feature Maps
ﺓFor a quadratic decision boundary
ﺓWhat feature mapping do we need?
ﺓOne possibility (ignore √2 for now)
ﺓWe have dim 𝜑 𝒙 = 𝑂 𝑑2 ; in a high dimension, the
computation cost might be large
ﺓCan we avoid the high computation cost?
ﺓLet us take a closer look at SVM
From Primal to Dual Formulation of SVM
ﺓRecall that the SVM is defined using the following constrained
optimization problem:
ﺓWe can instead solve a dual optimization problem to obtain 𝒘
► We do not derive it here in detail. The basic idea is to form the following
Lagrangian, find w as a function of 𝛼 (and other variables), and express the
Lagrangian only in terms of the dual variables:
From Primal to Dual Formulation of SVM
ﺓPrimal Optimization Problem:
ﺓDual Optimization Problem:
ﺓThe weights become:
which is a function of the dual variables 𝛼𝑖 ∀𝑖 ∈ 𝑁
From Primal to Dual Formulation of SVM
ﺓDual Optimization Problem:
ﺓThe weights become:
ﺓThe non-zero weights 𝛼i corresponds to observations that satisfy
𝑡 𝑖 𝑤𝑇𝑥 𝑖 + 𝑏 = 1 − 𝜉𝑖 . These are the support vectors
ﺓObservation: The input data only appears in the form of inner
products 𝒙𝑖 𝒙𝑗
SVM in Feature Space
From Inner Products to Kernels
From Inner Products to Kernels
Kernels
Kernelizing SVM
Example: Linear SVM
• Solid line - decision boundary. Dashed - +1/-1 margin. Purple - Bayes optimal
• Solid dots - Support vectors on margin
Example: Degree-4 Polynomial Kernel SVM
Example: Gaussian Kernel SVM