Linear classifiers:
Parameter learning
Quality metric for logistic regression:
Maximum likelihood estimation
5 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
⌃
P(y=+1|x,ŵ) = 1 T
.
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
6 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning problem
Training data:
N observations (xi,yi)
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
Optimize
4 1 +1
1 1 +1
quality metric
on training ŵ
2 4 -1 data
0 3 -1
0 1 -1
2 1 +1
7 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
4 1 +1
1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1
9 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
0
3 2
3 -1 4 1 +1
2
3 4
3 -1 1 1 +1
0 3 -1 4
2 1 +1
0 1 -1 1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1
10 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1
P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0
Pick ŵ that makes
11 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric = Likelihood function
Negative data points Positive data points
P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0
No ŵ achieves perfect predictions (usually)
Likelihood ℓ(w): Measures quality of
fit for model with coefficients w
12 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Data likelihood
14 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric: probability of data
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1
If model good, should predict: If model good, should predict:
Pick w to maximize: Pick w to maximize:
15 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood
(probability of data)
Data
x[1] x[2] y Choose w to maximize
point
x1,y1 2 1 +1
x2,y2 0 2 -1
x3,y3 3 3 -1 Must combine into single
x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1
x6,y6 2 4 -1
x7,y7 0 3 -1
x8,y8 0 1 -1
x9,y9 2 1 +1
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
Data
x[1] x[2] y Choose w to maximize
point
x1,y1 2 1 +1 P(y=+1|x[1]=2, x[2]=1,w)
x2,y2 0 2 -1 P(y=-1|x[1]=0, x[2]=2,w)
x3,y3 3 3 -1 P(y=-1|x[1]=3, x[2]=3,w)
x4,y4 4 1 +1 P(y=+1|x[1]=4, x[2]=1,w)
ℓ(w) =
P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w)
N
Y
P (yi | xi , w)
i=1
17 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Self-test question
Finding best linear classifier
with gradient ascent
19 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Find “best” classifier
Maximize likelihood over all possible w0,w1,w2
N
Y
`(w) = P (yi | xi , w)
i=1
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
ℓ(w0=1, w1=1, w2=-1.5) = 10-5
#awful
…
Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2
1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
22 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood
Maximize function over all
possible w0,w1,w2
N
Y
max P (yi | xi , w)
w0,w1,w2 i=1
No closed-form solution è use gradient ascent ℓ(w0,w1,w2) is a
23
function of 3 variables
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Review of gradient ascent
25 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding the max
via hill climbing
Algorithm:
while not converged!
w(t+1) ß w(t) + η dℓ
dw
w(t)
27 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Convergence criteria
For concave
convex functions,
optimum occurs when
Algorithm:
In practice, stop when
while not converged!
w(t+1) ß w(t) + η dℓ
dw
w(t)
28 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Moving to multiple dimensions:
Gradients
Δ
ℓ(w)
(w) =
29 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Contour plots
30 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Gradient ascent
Algorithm:
while not converged Δ
w(t+1) ß w(t) + η ℓ(w(t))
31 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning algorithm for
logistic regression
33 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Derivative of (log-)likelihood
Sum over Feature
data points value
Difference between truth and prediction
@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1
Indicator function:
@`(w) X
N⇣
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w)
@wj i=1
35 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Computing derivative
@`(w(t) ) X
N⇣ ⌘
(t)
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
@wj i=1
(t)
w0 0
(t)
w1 1
(t)
w2 -2
Contribution to
x[1] x[2] y P(y=+1|xi,w)
derivative for w1
Total derivative:
2 1 +1 0.5
0 2 -1 0.02
3 3 -1 0.05
4 1 +1 0.88
36 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Derivative of (log-)likelihood: Interpretation
Sum over Feature
data points value
Difference between truth and prediction
@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1
If hj(xi)=1: P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0
yi=+1
yi=-1
37 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Summary of gradient ascent
for logistic regression
init w(1)=0 (or randomly, or smartly), t=1
Δ
while || ℓ(w(t))|| > ε
for j=0,…,D N ⇣ ⌘
X
partial[j] = i=1 hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
(t)
wj(t+1) ß wj(t) + η partial[j]
tßt+1
38 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Choosing the step size η
40 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning curve:
Plot quality (likelihood) over iterations
42 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
If step size is too small, can take a
long time to converge
43 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Compare converge with
different step sizes
44 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Careful with step sizes that are too large
45 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Very large step sizes can even
cause divergence or wild oscillations
46 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Simple rule of thumb for picking step size η
• Unfortunately, picking step size
requires a lot of trial and error L
• Try a several values, exponentially spaced
- Goal: plot learning curves to
• find one η that is too small (smooth but moving too slowly)
• find one η that is too large (oscillation or divergence)
• Try values in between to find “best” η
• Advanced tip: can also try step size that decreases with
iterations, e.g.,
47 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Summary of logistic
regression classifier
68 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
⌃
P(y=+1|x,ŵ) = 1 T
.
1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data
y ŵ
ML algorithm
Quality
metric
70 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
What you can do now…
• Measure quality of a classifier using
the likelihood function
• Interpret the likelihood function as
the probability of the observed data
• Learn a logistic regression model with
gradient descent
ascent
• (Optional) Derive the gradient
descent update rule for logistic
regression
72 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Simplest link function: sign(z)
z
-∞ +∞
0.0 1.0
sign(z)
⇢
+1 if z 0 But, sign(z) only outputs -1 or +1,
sign(z) =
1 otherwise no probabilities in between
73 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1
0.0 P(y=+1|xi,ŵ) 1.0
Score(xi) = ŵTh(xi)
-∞
74
+∞
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric: probability of data
⌃
P(y=+1|x,ŵ) = 1 .
T
1 + e-ŵ h(x)
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1
If model good, should predict If model good, should predict
Increase probability y=+1 when Increase probability y=-1 when
Choose w to make Choose w to make
75 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood
(probability of data)
Data Choose w to
x[1] x[2] y
point maximize
x1,y1 2 1 +1
x2,y2 0 2 -1
x3,y3 3 3 -1 Must combine into single
x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1
x6,y6 2 4 -1
x7,y7 0 3 -1
x8,y8 0 1 -1
x9,y9 2 1 +1
76 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
• Choose coefficients w that maximize likelihood:
N
Y
P (yi | xi , w)
i=1
• No closed-form solution è use gradient ascent
77 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization