Machine Learning
Support Vector Machine
Lecturer: Duc Dung Nguyen, PhD.
Contact:
[email protected]Faculty of Computer Science and Engineering
Hochiminh city University of Technology
Contents
1. Analytical Geometry
2. Maximum Margin Classifiers
3. Lagrange Multipliers
4. Non-linearly Separable Data
5. Soft-margin
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 1 / 33
Analytical Geometry
Analytical Geometry
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 2 / 33
Analytical Geometry
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 3 / 33
Maximum Margin Classifiers
Maximum margin classifiers
• Assume that the data are linearly separable
• Decision boundary equation:
y(x) = w.x + b
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 4 / 33
Maximum margin classifiers
• Margin: the smallest distance between the decision boundary and any of the samples.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 5 / 33
Maximum margin classifiers
• Margin: the smallest distance between the decision boundary and any of the samples.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 5 / 33
Maximum margin classifiers
• Support vectors: samples at the two margins.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 6 / 33
Maximum margin classifiers
• Scaling y (support vectors) to be 1 or -1:
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 7 / 33
Maximum margin classifiers
• Signed distance between the decision boundary and a sample xn :
y(xn )
||w||
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 8 / 33
Maximum margin classifiers
• Signed distance between the decision boundary and a sample xn :
y(xn )
||w||
• Absolute distance between the decision boundary and a sample xn :
tn .y(xn )
||w||
tn = +1 iff y(xn ) > 0 and tn = −1 iff y(xn ) < 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 8 / 33
Maximum margin classifiers
• Maximum margin:
1
arg max minn (tn .(w.xn + b))
w,b ||w||
with the constraint:
tn .(w.xn + b) ≥ 1
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 9 / 33
Maximum margin classifiers
• To be optimized:
1
arg min kwk2
w,b 2
with the constraint:
tn .(w.xn + b) ≥ 1
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 10 / 33
Lagrange Multipliers
Optimization using Lagrange multipliers
Joseph-Louis Lagrange born 25 January 1736 – Paris, 10
April 1813; also reported as Giuseppe Luigi Lagrange,
was an Italian Enlightenment Era mathematician and as-
tronomer. He made significant contributions to the fields
of analysis, number theory, and both classical and celestial
mechanics.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 11 / 33
Optimization using Lagrange multipliers
• Problem:
arg max f (x)
x
with the constraint:
g(x) = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 12 / 33
Optimization using Lagrange multipliers
• Solution is the stationary point of the Lagrange function:
L(x, λ) = f (x) + λ.g(x)
such that:
∂L(x, λ)/∂xn = ∂f (x)/∂xn + λ.∂g(x)/∂xn = 0
and
∂L(x, λ)/∂λ = g(x) = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 13 / 33
Optimization using Lagrange multipliers
• Example:
f (x) = 1 − u2 − v 2
with the constraint:
g(x) = u + v − 1 = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 14 / 33
Optimization using Lagrange multipliers
• Lagrange function:
L(x, λ) = f (x) + λ.g(x) = (1 − u2 − v 2 ) + λ.(u + v − 1)
∂L(x, λ)/∂u = ∂f (x)/∂u + λ.∂g(x)/∂u = −2u + λ = 0
∂L(x, λ)/∂v = ∂f (x)/∂v + λ.∂g(x)/∂v = −2v + λ = 0
∂L(x, λ)/∂λ = g(x) = u + v − 1 = 0
• Solution: u = 1/2 and v = 1/2
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 15 / 33
Optimization using Lagrange multipliers
• Example:
f (x) = 1 − u2 − v 2
with the constraint:
g(x) = u + v − 1 = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 16 / 33
Optimization using Lagrange multipliers
• Problem:
arg max f (x)
x
with the inequality constraint:
g(x) ≥ 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 17 / 33
Optimization using Lagrange multipliers
Solution is the stationary point of the Lagrange function:
L(x, λ) = f (x) + λ.g(x)
such that:
∂L(x, λ)/∂xn = ∂f (x)/∂xn + λ.∂g(x)/∂xn = 0
and
g(x) ≥ 0
λ≥0
λ.g(x) = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 18 / 33
Optimization using Lagrange multipliers
• To be optimized:
1
arg min kwk2
w,b 2
with the constraint:
tn .(w.xn + b) ≥ 1)
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N
tn .(w.xn + b) − 1 ≥ 0
an ≥ 0
an .(tn .(w.xn + b) − 1) = 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 19 / 33
Optimization using Lagrange multipliers
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N
• Solution for w:
∂(w, b, a)/∂w = 0
X
w= an .tn .xn
n=1..N
X
∂L(w, b, a)/∂b = an .tn = 0
n=1..N
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 20 / 33
Optimization using Lagrange multipliers
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N
• Solution for a: dual representation to be optimized
X 1 X X
L∗ (a) = an − an .am .tn .tm .xn .xm
2
n=1..N n=1..N m=1..N
with the constraints:
an ≥ 0
X
an .tn = 0
n=1..N
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 21 / 33
Optimization using Lagrange multipliers
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N
• Solution for a: dual representation to be optimized
X 1 X X
L∗ (a) = an − an .am .tn .tm .xn .xm
2
n=1..N n=1..N m=1..N
Why optimization via dual representation?
• Sparsity: an = 0 if xn is not a support vector.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 22 / 33
Optimization using Lagrange multipliers
• Lagrange function for maximum margin classifier:
1 X
L(w, b, a) = kwk2 − an .(tn .(w.xn + b) − 1)
2
n=1..N
an .(tn .(w.xn + b) − 1) = 0
• Solution for b:
1 X
b= am .tm .xm .xn
|S|
n∈S
where S is the set of support vectors (an 6= 0)
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 23 / 33
Optimization using Lagrange multipliers
• Classification: X
y(x) = w.x + b = an .tn .xn .x + b
n=1..N
y(x) > 0 → +1
y(x) < 0 → −1
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 24 / 33
Non-linearly Separable Data
Kernel trick for non-linearly separable data
• Mapping the data points into a high dimensional feature space.
• Example 1:
• Original space: (x)
• New space: (x, x2 )
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 25 / 33
Kernel trick for non-linearly separable data
• Example 2:
• Original space: (u, v)
• New space: ((u2 + v 2 )1/2 , arctan(v/u))
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 26 / 33
Kernel trick for non-linearly separable data
Example 3: XOR function
In1 In2 t
0 0 0
0 1 1
1 0 1
1 1 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 27 / 33
Kernel trick for non-linearly separable data
Example 3: XOR function
In1 In2 In3 Output
0 0 1 1
0 1 0 0
1 0 0 0
1 1 0 1
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 28 / 33
Kernel trick for non-linearly separable data
• Classification in the new space:
X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33
Kernel trick for non-linearly separable data
• Classification in the new space:
X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N
• Computational complexity of φ(xn ).φ(x) is high due to the high dimension of φ(.).
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33
Kernel trick for non-linearly separable data
• Classification in the new space:
X
y(x) = w.φ(x) + b = an .tn .φ(xn ).φ(x) + b
n=1..N
• Computational complexity of φ(xn ).φ(x) is high due to the high dimension of φ(.).
• Kernel trick:
φ(xn ).φ(xm ) = K(xn , xm )
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 29 / 33
Kernel trick for non-linearly separable data
• A typical kernel function:
K(u, v) = (1 + u.v)2
√ √ √
φ((u1 .u2 , ..., ud )) = (1, 2u1 , 2u2 , ..., 2ud ,
√ √ √
2u1 .u2 , 2u1 .u3 , ..., 2ud−1 .ud ,
u21 , u22 , ..., u2d )
X X X X
φ(u).φ(v) = 1 + 2 ui .vi + 2 ui .vi .uj .vj + u2i vi2
i=1..d i=1..d−1 j=i+1..d i=1..d
φ(u.φ(v) = K(u, v)
• Is φ(x) guaranteed to be separable?
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 30 / 33
Soft-margin
Soft margin SVM
• Soft-margin SVM: to allow some of the training samples to be misclassified.
• Slack variable: ξ
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 31 / 33
Soft margin SVM
• New constraints:
tn .(w.xn + b) ≥ 1 − ξn
ξn ≥ 0
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 33
Soft margin SVM
• New constraints:
tn .(w.xn + b) ≥ 1 − ξn
ξn ≥ 0
• To be minimized:
1 X
||w||2 = C ξn
2
n=1..N
C > 0: controls the trade-off between the margin and slack variable penalty
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 32 / 33
Summary
• SVM is a sparse kernel method.
• Soft margin SVM is to deal with non-linearly separable data after kernel mapping.
Lecturer: Duc Dung Nguyen, PhD. Contact: [email protected] Machine Learning 33 / 33