Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views42 pages

Lecture 6 Linear Classifier 2

Lecture-6-linear-classifier-2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views42 pages

Lecture 6 Linear Classifier 2

Lecture-6-linear-classifier-2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Linear classifiers:

Parameter learning
Quality metric for logistic regression:
Maximum likelihood estimation

5 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 T
.

1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
6 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learning problem
Training data:
N observations (xi,yi)
x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1
0 2 -1
3 3 -1
Optimize
4 1 +1
1 1 +1
quality metric
on training ŵ
2 4 -1 data
0 3 -1
0 1 -1
2 1 +1

7 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients

x[1] = #awesome x[2] = #awful y = sentiment


2 1 +1
0 2 -1
3 3 -1
4 1 +1
1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1

9 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
0
3 2
3 -1 4 1 +1
2
3 4
3 -1 1 1 +1
0 3 -1 4
2 1 +1
0 1 -1 1 1 +1
2 4 -1
0 3 -1
0 1 -1
2 1 +1

10 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding best coefficients
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1

P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0

Pick ŵ that makes


11 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric = Likelihood function

Negative data points Positive data points

P(y=+1|xi,w) = 0.0 P(y=+1|xi,w) = 1.0


No ŵ achieves perfect predictions (usually)
Likelihood ℓ(w): Measures quality of
fit for model with coefficients w
12 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Data likelihood

14 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Quality metric: probability of data

x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1

If model good, should predict: If model good, should predict:

Pick w to maximize: Pick w to maximize:

15 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Maximizing likelihood
(probability of data)
Data
x[1] x[2] y Choose w to maximize
point

x1,y1 2 1 +1

x2,y2 0 2 -1

x3,y3 3 3 -1 Must combine into single


x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1

x6,y6 2 4 -1

x7,y7 0 3 -1

x8,y8 0 1 -1

x9,y9 2 1 +1
16 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)
Data
x[1] x[2] y Choose w to maximize
point

x1,y1 2 1 +1 P(y=+1|x[1]=2, x[2]=1,w)

x2,y2 0 2 -1 P(y=-1|x[1]=0, x[2]=2,w)

x3,y3 3 3 -1 P(y=-1|x[1]=3, x[2]=3,w)

x4,y4 4 1 +1 P(y=+1|x[1]=4, x[2]=1,w)

ℓ(w) =
P(y1|x1,w) P(y2|x2,w) P(y3|x3,w) P(y4|x4,w)
N
Y
P (yi | xi , w)
i=1
17 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Self-test question
Finding best linear classifier
with gradient ascent

19 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Find “best” classifier
Maximize likelihood over all possible w0,w1,w2
N
Y
`(w) = P (yi | xi , w)
i=1
ℓ(w0=0, w1=1, w2=-1.5) = 10-6
ℓ(w0=1, w1=1, w2=-1.5) = 10-5
#awful


Best model:
4
Highest likelihood ℓ(w)
3
ŵ = (w0=1, w1=0.5, w2=-1.5)
2

1
ℓ(w0=1, w1=0.5, w2=-1.5) = 10-4
0
0 1 2 3 4 …
#awesome
22 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Maximizing likelihood

Maximize function over all


possible w0,w1,w2
N
Y
max P (yi | xi , w)
w0,w1,w2 i=1

No closed-form solution è use gradient ascent ℓ(w0,w1,w2) is a


23
function of 3 variables
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Review of gradient ascent

25 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Finding the max
via hill climbing

Algorithm:

while not converged!


w(t+1) ß w(t) + η dℓ
dw
w(t)

27 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Convergence criteria
For concave
convex functions,
optimum occurs when

Algorithm:
In practice, stop when

while not converged!


w(t+1) ß w(t) + η dℓ
dw
w(t)

28 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Moving to multiple dimensions:
Gradients

Δ
ℓ(w)
(w) =

29 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Contour plots

30 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Gradient ascent

Algorithm:

while not converged Δ


w(t+1) ß w(t) + η ℓ(w(t))

31 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Learning algorithm for
logistic regression

33 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Derivative of (log-)likelihood
Sum over Feature
data points value
Difference between truth and prediction

@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1

Indicator function:

@`(w) X
N⇣
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w)
@wj i=1

35 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Computing derivative
@`(w(t) ) X
N⇣ ⌘
(t)
= hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
@wj i=1

(t)
w0 0
(t)
w1 1
(t)
w2 -2

Contribution to
x[1] x[2] y P(y=+1|xi,w)
derivative for w1
Total derivative:
2 1 +1 0.5

0 2 -1 0.02

3 3 -1 0.05

4 1 +1 0.88

36 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Derivative of (log-)likelihood: Interpretation
Sum over Feature
data points value
Difference between truth and prediction

@`(w) X
@`(w) XNN
N ⇣⇣ ⌘⌘
=
= (xiii)) 1[y
hhjjj(x 1[yiii =
= +1]
+1] P
P(y
(y = +1 || xxiii,,w)
= +1 w)
@w
@wjjj i=1
i=1
i=1

If hj(xi)=1: P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0

yi=+1

yi=-1

37 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Summary of gradient ascent
for logistic regression

init w(1)=0 (or randomly, or smartly), t=1


Δ
while || ℓ(w(t))|| > ε
for j=0,…,D N ⇣ ⌘
X
partial[j] = i=1 hj (xi ) 1[yi = +1] P (y = +1 | xi , w )
(t)

wj(t+1) ß wj(t) + η partial[j]


tßt+1

38 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Choosing the step size η

40 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Learning curve:
Plot quality (likelihood) over iterations

42 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


If step size is too small, can take a
long time to converge

43 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Compare converge with
different step sizes

44 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Careful with step sizes that are too large

45 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Very large step sizes can even
cause divergence or wild oscillations

46 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simple rule of thumb for picking step size η
• Unfortunately, picking step size
requires a lot of trial and error L
• Try a several values, exponentially spaced
- Goal: plot learning curves to
• find one η that is too small (smooth but moving too slowly)
• find one η that is too large (oscillation or divergence)
• Try values in between to find “best” η

• Advanced tip: can also try step size that decreases with
iterations, e.g.,

47 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Summary of logistic
regression classifier

68 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization



P(y=+1|x,ŵ) = 1 T
.

1 + e-ŵ h(x)
x Feature h(x) ML
Training
extraction model
Data

y ŵ

ML algorithm

Quality
metric
70 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
What you can do now…
• Measure quality of a classifier using
the likelihood function
• Interpret the likelihood function as
the probability of the observed data
• Learn a logistic regression model with
gradient descent
ascent
• (Optional) Derive the gradient
descent update rule for logistic
regression

72 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Simplest link function: sign(z)

z
-∞ +∞

0.0 1.0
sign(z)

+1 if z 0 But, sign(z) only outputs -1 or +1,
sign(z) =
1 otherwise no probabilities in between
73 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Finding best coefficients

x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1 2 1 +1
3 3 -1 4 1 +1
2 4 -1 1 1 +1
0 3 -1 2 1 +1
0 1 -1

0.0 P(y=+1|xi,ŵ) 1.0


Score(xi) = ŵTh(xi)
-∞
74
+∞
©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Quality metric: probability of data

P(y=+1|x,ŵ) = 1 .

T
1 + e-ŵ h(x)
x[1] = #awesome x[2] = #awful y = sentiment x[1] = #awesome x[2] = #awful y = sentiment
2 1 +1 0 2 -1

If model good, should predict If model good, should predict

Increase probability y=+1 when Increase probability y=-1 when

Choose w to make Choose w to make

75 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization


Maximizing likelihood
(probability of data)
Data Choose w to
x[1] x[2] y
point maximize

x1,y1 2 1 +1

x2,y2 0 2 -1

x3,y3 3 3 -1 Must combine into single


x4,y4 4 1 +1 measure of quality
x5,y5 1 1 +1

x6,y6 2 4 -1

x7,y7 0 3 -1

x8,y8 0 1 -1

x9,y9 2 1 +1
76 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization
Learn logistic regression model with
maximum likelihood estimation (MLE)

• Choose coefficients w that maximize likelihood:

N
Y
P (yi | xi , w)
i=1

• No closed-form solution è use gradient ascent

77 ©2015-2016 Emily Fox & Carlos Guestrin Machine Learning Specialization

You might also like