Logistic Regression
Machine Learning
1
Where are we?
We have seen the following ideas
– Linear models
– Learning as loss minimization
– Bayesian learning criteria (MAP and MLE estimation)
2
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
3
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
4
Logistic Regression: Setup
• The setting
– Binary classification
– Inputs: Feature vectors 𝐱 ∈ ℜ!
– Labels: 𝑦 ∈ {−1, +1}
• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples
5
Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively make the problem a regression problem
Many hypothesis spaces possible
6
Classification, but…
The output 𝑦 is discrete: Either −1 or +1
Instead of predicting a label, let us try to predict P(𝑦 = +1 ∣ 𝐱)
Expand hypothesis space to functions whose output is [0 − 1]
• Original problem: ℜ! → {−1, +1}
• Modified problem: ℜ! → [0 − 1]
• Effectively, make the problem a regression problem
Many hypothesis spaces possible
7
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function) ¾
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
8
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
This is a reasonable choice. We will see why later
9
The Sigmoid function
The hypothesis space for logistic regression: All
functions of the form
That is, a linear function, composed with a sigmoid
function (the logistic function), defined as
What is the domain
and the range of the
sigmoid function?
This is a reasonable choice. We will see why later
10
The Sigmoid function
¾(z)
11
The Sigmoid function
What is its derivative with respect to z?
12
The Sigmoid function
What is its derivative with respect to z?
13
Predicting probabilities
According to the logistic regression model, we have
14
Predicting probabilities
According to the logistic regression model, we have
15
Predicting probabilities
According to the logistic regression model, we have
16
Predicting probabilities
According to the logistic regression model, we have
Or equivalently
17
Predicting probabilities
According to the logistic regression model, we have
Note that we are directly modeling
Or equivalently 𝑃(𝑦 | 𝑥) rather than 𝑃(𝑥 |𝑦) and 𝑃(𝑦)
18
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰 ) 𝐱?
19
Predicting a label with logistic regression
• Compute 𝑃(𝑦 = +1 | 𝑥; 𝐰)
• If this is greater than half, predict +1 else predict −1
– What does this correspond to in terms of 𝐰 ) 𝐱?
– Prediction = sgn(𝐰 ) 𝐱)
20
This lecture
• Logistic regression
• Training a logistic regression classifier
– First: Maximum likelihood estimation
– Then: Adding priors à Maximum a Posteriori estimation
• Back to loss minimization
21
Maximum likelihood estimation
Let’s address the problem of learning
• Training data
– S = {(𝐱 " , 𝑦" )}, consisting of 𝑚 examples
• What we want
– Find a weight vector 𝐰 such that P(S ∣ 𝐰) is maximized
– We know that our examples are drawn independently and
are identically distributed (i.i.d)
– How do we proceed?
22
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
The usual trick: Convert products to sums by taking log
Recall that this works only because log is an increasing
function and the maximizer will not change
23
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
Equivalent to solving
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
24
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
But (by definition) we know that
1
𝑃 𝑦" 𝐰, 𝐱 " = 𝜎 𝑦" 𝐰)𝐱 " =
1 + exp(−𝑦" 𝐰 ) 𝐱 " )
25
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
max A log 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰
"
Equivalent to solving
-
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
26
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
27
1
𝑃 𝑦 𝐰, 𝐱 =
1 + exp(−y! 𝐰 " 𝐱 # )
Maximum likelihood estimation
-
argmax 𝑃 𝑆 𝐰 = argmax @ 𝑃 𝑦" 𝐱 " , 𝐰)
𝐰 𝐰
"+,
-
The goal: Maximum max A log 𝑃 𝑦" 𝐱 " , 𝐰)
likelihood training of a 𝐰
discriminative "
probabilistic classifier Equivalent to solving
under the logistic
model for the posterior
distribution. -
max A −log 1 + exp −𝑦" 𝐰 ) 𝐱 "
𝐰
"
Equivalent to: Training a linear classifier by minimizing the logistic loss.
28
Maximum a posteriori estimation
We could also add a prior on the weights
Suppose each weight in the weight vector is drawn
independently from the normal distribution with zero
mean and standard deviation 𝜎
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
29
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Let us work through this procedure again
30
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Let us work through this procedure again
to see what changes from maximum likelihood
estimation
What is the goal of MAP estimation?
(In maximum likelihood estimation, we maximized the likelihood of the data)
31
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
What is the goal of MAP estimation?
To maximize the posterior probability of the model given the data (i.e. to find the
most probable model, given the data)
𝑃 𝐰 𝑆 ∝ 𝑃 𝑆 𝐰 𝑃(𝐰)
32
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃(𝐰|𝑆) = argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰 𝐰
33
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
34
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
We have already expanded out the first term.
-
A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " )
"
35
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
Expand the log prior
- ! ;
−𝑤 "
A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " ) + A ; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
𝜎
" :+,
36
MAP estimation for logistic regression
$ $
1 −𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
- !;
−𝑤 "
max A −log(1 + exp(−𝑦" 𝐰 ) 𝐱 " ) + A ; + 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑠
𝐰 𝜎
" :+,
37
MAP estimation for logistic regression
$ $
1−𝑤%&
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"
38
MAP estimation for logistic regression
$ $
−𝑤%&
1
𝑝 𝐰 = $ 𝑝(𝑤% ) = $ exp
𝜎 2𝜋 𝜎&
!"# !"#
Learning by solving
argmax 𝑃 𝑆 𝐰 𝑃(𝐰)
𝐰
Take log to simplify
max log 𝑃 𝑆 𝐰 + log 𝑃(𝐰)
𝐰
-
1 )
max A −log(1 + exp(−𝑦" 𝐰)𝐱 ") − ; 𝐰 𝐰
𝐰 𝜎
"
Maximizing a negative function is the same as minimizing the function
39
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"
40
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"
Where have we seen this before?
41
Learning a logistic regression classifier
Learning a logistic regression classifier is equivalent to
solving
-
1 )
min A log(1 + exp(−𝑦" 𝐰)𝐱 ") + ; 𝐰 𝐰
𝐰 𝜎
"
Where have we seen this before?
Exercise: Write down the stochastic gradient descent (SGD) algorithm for this?
Other training algorithms exist. For example, the LBFGS algorithm is an example of
a quasi-Newton method. But gradient based methods like SGD and its variants are
way more commonly used.
42
Logistic regression is…
• A classifier that predicts the probability that the label is
+1 for a particular input
• The discriminative counter-part of the naïve Bayes
classifier
• A discriminative classifier that can be trained via MAP or
MLE estimation
• A discriminative classifier that minimizes the logistic loss
over the training set
43
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
44
Learning as loss minimization
• The setup
– Examples x drawn from a fixed, unknown distribution D
– Hidden oracle classifier f labels examples
– We wish to find a hypothesis h that mimics f
• The ideal situation
– Define a function L that penalizes bad hypotheses
– Learning: Pick a function ℎ ∈ 𝐻 to minimize expected loss
But distribution D is unknown
• Instead, minimize empirical loss on the training set
45
Empirical loss minimization
Learning = minimize empirical loss on the training set
Is there a problem here?
46
Empirical loss minimization
Learning = minimize empirical loss on the training set
Is there a problem here? Overfitting!
We need something that biases the learner towards simpler
hypotheses
• Achieved using a regularizer, which penalizes complex
hypotheses
47
Regularized loss minimization
• Learning:
• With linear classifiers:
(using ℓ! regularization)
• What is a loss function?
– Loss functions should penalize mistakes
– We are minimizing average loss over the training data
• What is the ideal loss function for classification?
48
The 0-1 loss
Penalize classification mistakes between true label y and
prediction y’
• For linear classifiers, the prediction y’ = sgn(wTx)
– Mistake if 𝑦 𝒘𝑇𝒙 ≤ 0
Minimizing 0-1 loss is intractable. Need surrogates
49
The loss function zoo
Many loss functions exist
– Perceptron loss
– Hinge loss (SVM)
– Exponential loss (AdaBoost)
– Logistic loss (logistic regression)
50
The loss function zoo
51
The loss function zoo
Zero-one
52
The loss function zoo
Hinge: SVM
Zero-one
53
The loss function zoo
Hinge: SVM
Perceptron
Zero-one
54
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron
Zero-one
55
The loss function zoo
Hinge: SVM
Exponential: AdaBoost
Perceptron
Zero-one
Logistic regression
56
The loss function zoo Zoomed out
57
The loss function zoo Zoomed out even more
58
This lecture
• Logistic regression
• Training a logistic regression classifier
• Back to loss minimization
• Connection to Naïve Bayes
59
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)
Here, the P’s represent the Naïve Bayes posterior distribution,
and w can be used to calculate the priors and the likelihoods.
That is, 𝑃(𝑦 = 1 | 𝐰, 𝐱) is computed using
𝑃(𝐱 | 𝑦 = 1, 𝐰) and 𝑃(𝑦 = 1 | 𝐰)
60
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
61
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Substituting in the above expression, we will get
1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)
Exercise: Show this formally
62
Naïve Bayes and Logistic regression
Remember that the naïve Bayes decision is a linear function
𝑃(𝑦 = −1|𝐱, 𝐰)
log = 𝐰)𝐱
𝑃(𝑦 = +1|𝐱, 𝐰)
That is, both naïve Bayes and logistic regression try to
compute the same posterior distribution over the outputs
But we also know that 𝑃 𝑦 = +1 𝐱, 𝐰 = 1 − 𝑃(𝑦 = −1|𝐱, 𝐰)
Naïve Bayes is a generative model.
Substituting in the above
Logistic expression,
Regression we get version.
is the discriminative
1
𝑃 𝑦 = +1 𝐰, 𝐱 = 𝜎 𝐰)𝐱 =
1 + exp(−𝐰 ) 𝐱)
63