ML Lecture06 2
ML Lecture06 2
n
!d
X
K(x, z) = xi z i
i=1
• If we expanded the sum above in the obvious way, we get nd terms (i.e.
feature expansion)
• Terms are monomials (products of xi) with total power equal to d.
• Curse of dimensionality: it is very expensive both to optimize and to
predict in primal form
• However, evaluating the dot-product of any two feature vectors can be
done using K in O(n)!
1
hw (x) =
1 + ewT φ(x)+w0
• Again,
Pm we can define the weights in terms of support vectors: w =
i=1 αi φ(xi )
• The prediction can now be computed as:
1
h(x) = Pm
1+e ı=1 αi K(xi ,x)+w0
• αi are the new parameters (one per instance) and can be derived using
gradient descent
arg min ` ((x1, y1, f (x1)), · · · , (xm, ym, f (xm))) + g(kf kH)
f ∈H
• We introduce slack variables, ξi+, ξi− to account for errors outside the
tolerance area
• We need two kinds of variables to account for both positive and negative
errors
• Like before, we can write the Lagrangian and solve the dual form of the
problem
• Kernels can be used as before to get non-linear functions
!
!
! !"# !" # $" % ## #! $! $ !
# !! !"" Effect of
"!"
1.5 1.5
-0.5 -0.5
1.5 -0.5 -1.5
-0.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x -1 -1
-1 -1
-1.5
-1.5 -1.5 0 0.1 0.2 epsilon
0.3 0.4 = 0.01
0.5 0.6 0.7 0.8 0.9 1
-1.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6
x
0.8 1 1.2 1.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2
• Validation set fit is a search
epsilon = 0.01
As over both increases:
epsilon C and sigma
• As increases, the function
• Validation set is
• fitboth
over becomes
fit is allowed
a search
looser
C and sigma
to move away from the data
points, the number of •support vectors
less data points decreases
are support vectorsand the fit gets worse
2
Zisserman course notes
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 Loss functions for regression 22
-
- -
-
-
• For a given separating hyperplane, the margin is two times the (Euclidean)
distance from the hyperplane to the nearest training example.
+ + +
+
+ +
+ + +
+
- -
- - - -
- -
- -
A w
(i)
w w0
γi = · xi +
||w|| ||w||
• This suggests:
maximize M
with respect to w,w0
subject to yi w
kw k · xi + kww0k ≥ M for all i
• Problems:
– w appears nonlinearly in the constraints.
– This problem is underconstrained. If (w, w0, M ) is an optimal solution,
then so is (βw, βw0, M ) for any β > 0.
0.9 0.8
0.8 0.7
0.7
0.6
0.6
0.5
x2
x2
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x1 x1
min f (w)
w
such that gi(w) ≤ 0, i = 1 . . . k
k
X
L(w, α) = f (w) + αigi(w), (1)
i=1
• If f , gi are convex and the gi can all be satisfied simultaneously for some
w, then we have equality: d∗ = p∗ = L(w∗, α∗)
• Moreover w∗, α∗ solve the primal and dual if and only if they satisfy the
following conditions (called Karush-Kunh-Tucker):
∂
L(w∗, α∗) = 0, i = 1 . . . n (2)
∂wi
αi∗gi(w∗) = 0, i = 1 . . . k (3)
gi(w∗) ≤ 0, i = 1 . . . k (4)
αi∗ ≥ 0, i = 1 . . . k (5)
1 X
2
L(w, w0, α) = kwk + αi(1 − yi(w · xi + w0))
2 i
⇒ Just like for the perceptron with zero initial weights, the optimal solution
for w is a linear combination of the xi, and likewise for w0.
• The output is m
!
X
hw,w0 (x) = sign αiyi(xi · x) + w0
i=1
X 1X
max αi − yiyj αiαj (xi · xj )
α
i
2 i,j
P
with constraints: αi ≥ 0 and i αi yi =0
m
!
X
sgn αiyi(xi · x) + w0
i=1
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
• The
Pm optimal weights, in the expanded feature space, are w =
i=1 αi yi φ(xi ).
m
!
X
hw,w0 (x) = sign αiyiφ(xi) · φ(x) + w0
i=1
⇒ Note that to solve the SVM optimization problem in dual form and to
make a prediction, we only ever need to compute dot-products of feature
vectors.
• If we work with the dual, we do not actually have to ever compute the
feature mapping φ. We just have to compute the similarity K.
• That is, we can solve
Pm the dual forPthe αi :
1 m
max i=1 αi − 2 i,j=1 yi yj αi αj K(xi , xj )
w.r.t. αi
s.t. 0P≤ αi
m
i=1 αi yi = 0
• The class of a new input x is computed as:
m
! m
!
X X
hw,w0 (x) = sign ( αiyiφ(xi)) · φ(x) + w0 = sign αiyiK(xi, x) + w0
i=1 i=1
• Recall that in the linearly separable case, we compute the solution to the
following optimization problem:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
• If we want to allow misclassifications, we can relax the constraints to:
yi(w · xi + w0) ≥ 1 − ξi
• Instead of:
min 12 kwk2
w.r.t. w, w0
s.t. yi(w · xi + w0) ≥ 1
we want to solve: P
min 12 kwk2 + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi, ξi ≥ 0
• Note that soft errors include points that are misclassified, as well as
points within the margin
• There is a linear penalty for both categories
• The choice of the constant C controls overfitting
1 2
P
min 2 kwk + C i ξi
w.r.t. w, w0, ξi
s.t. yi(w · xi + w0) ≥ 1 − ξi
ξi ≥ 0
• Like before, we can write a Lagrangian for the problem and then use the
dual formulation to find the optimal parameters:
1 X
2
L(w, w0, α, ξ, µ) = ||w|| + C ξi
2 i
X X
+ αi (1 − ξi − yi(wi · xi + w0)) + µ i ξi
i i
• All the previously described machinery can be used to solve this problem
• Note that in addition to αi we have coefficients µi, which ensure that
the errors are positive, but do not participate in the decision boundary
• Suppose you want to find the maximum of some function F (α1, . . . αn)
• Coordinate ascent optimizes the function by repeatedly picking an αi
and optimizing it, while all other parameters are fixed
• There are different ways of looping through the parameters:
– Round-robin
– Repeatedly pick a parameter at random
– Choose next the variable expected to make the largest improvement
– ...
2.5
1.5
0.5
−0.5
−1
−1.5
−2
The ellipses in the figure are the contours of a quadratic function that
we want to optimize. Coordinate ascent was initialized at (2, −2), and also
plotted in and
COMP-652 theECSE-608,
figureLecture
is the path28,that
6 - January 2016 it took on its way to the global maximum.
55
Notice that on each step, coordinate ascent takes a step that’s parallel to one
of the axes, since only one variable is being optimized at a time.
Our optimization problem (dual form)
1XX
max αi − yiyj αiαj (φ(xi) · φ(xj ))
α
i
2 i,j
P
with constraints: 0 ≤ αi ≤ C and i αiyi = 0
H α1y(1)+ α2y(2)=ζ
α2
L
α1 C
From the constraints (18), we know that α1 and α2 must lie within the box
[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we
know α1 and α2 must lie. Note also that, from these constraints, we know
L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box
COMP-652 and ECSE-608, Lecture 6 - January 28, 2016 58
and the straight line constraint. In this example, L = 0. But depending on
what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be
the case; but more generally, there will be some lower-bound L and some
upper-bound H on the permissable values for α2 that will ensure that α1 , α2
SMO(III)
• SVMs are not very intuitive, but typically are more interpretable than
neural nets, if you look at the machine and the misclassifications
• E.g. Ovarian cancer data (Haussler) - 31 tissue samples of 3 classes,
misclassified examples wrongly labelled
• But no biological plausibility!
• Hard to interpret if the percentage of instances that are recruited as
support vectors is high