0% found this document useful (0 votes)

13 views63 pages

7 Logistic-Regression

Uploaded by

luciefokou96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views63 pages

7 Logistic-Regression

Uploaded by

luciefokou96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Logistic Regression

Machine Learning

1
Where are we?

We have seen the following ideas

– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)

2
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

3
This lecture

• Logistic regression

• Training a logistic regression classifier

• Back to loss minimization

4
Logistic Regression: Setup

• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}

• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples

5
Classification, but…

The output 𝑦 is discrete: Either −1 or +1

Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)

Expand hypothesis space to functions whose output is [0 − 1]

• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem

Many hypothesis spaces possible

6
Classification, but…

The output 𝑦 is discrete: Either −1 or +1

Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)

Expand hypothesis space to functions whose output is [0 − 1]

• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem

Many hypothesis spaces possible

7
The Sigmoid function

The hypothesis space for logistic regression: All

functions of the form

That is, a linear function, composed with a sigmoid

function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?

This is a reasonable choice. We will see why later

8
The Sigmoid function

The hypothesis space for logistic regression: All

functions of the form

That is, a linear function, composed with a sigmoid

function (the logistic function), defined as

This is a reasonable choice. We will see why later

9
The Sigmoid function

The hypothesis space for logistic regression: All

functions of the form

That is, a linear function, composed with a sigmoid

function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?

This is a reasonable choice. We will see why later

10
The Sigmoid function

¾(z)

11
The Sigmoid function

What is its derivative with respect to z?

12
The Sigmoid function

What is its derivative with respect to z?

13
Predicting probabilities

According to the logistic regression model, we have

14
Predicting probabilities

According to the logistic regression model, we have

15
Predicting probabilities

According to the logistic regression model, we have

16
Predicting probabilities

According to the logistic regression model, we have

Or equivalently

17
Predicting probabilities

According to the logistic regression model, we have

Note that we are directly modeling

Or equivalently 𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)

18
Predicting a label with logistic regression

• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)

• If this is greater than half, predict +1 else predict −1

– What does this correspond to in terms of 𝐰 ) 𝐱?

19
Predicting a label with logistic regression

• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)

• If this is greater than half, predict +1 else predict −1

– What does this correspond to in terms of 𝐰 ) 𝐱?

– Prediction = sgn(𝐰 ) 𝐱)

20
This lecture

• Logistic regression

• Training a logistic regression classifier

– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation

• Back to loss minimization

21
Maximum likelihood estimation

Let’s address the problem of learning

• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples

• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?

22
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,

The usual trick: Convert products to sums by taking log

Recall that this works only because log is an increasing

function and the maximizer will not change

23
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,

Equivalent to solving

-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"

24
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
But (by definition) we know that

1
𝑃 𝑦" 𝐰, 𝐱 " = 𝜎 𝑦" 𝐰)𝐱 " =
1 + exp(−𝑦" 𝐰 ) 𝐱 " )

25
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
Equivalent to solving

-
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

26
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

27
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"

Equivalent to: Training a linear classifier by minimizing the logistic loss.

28
Maximum a posteriori estimation

We could also add a prior on the weights

Suppose each weight in the weight vector is drawn

independently from the normal distribution with zero
mean and standard deviation 𝜎
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

29
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

Let us work through this procedure again

30
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

Let us work through this procedure again

to see what changes from maximum likelihood
estimation

What is the goal of MAP estimation?

(In maximum likelihood estimation, we maximized the likelihood of the data)

31
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#

What is the goal of MAP estimation?

To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)

𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)

32
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving

argmax 𝑃(𝐰|𝑆) = argmax 𝑃 𝑆 𝐰 𝑃(𝐰)

𝐰 𝐰

33
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving