1/29
Statistics
Logistic Regression
Shaheena Bashir
FALL, 2019
2/29
Outline
Background
Introduction
Logit Transformation
Assumptions
Estimation
Example
Analysis
How Good is the Fitted Model?
Single Categorical Predictor
Types of Logistic Regression Models
o
3/29
Background
Motivating Example
o
4/29
Background
Scatter Plot
Relationship between Age & CHD
1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0.8
Coronary heart disease
0.6
0.4
0.2
0.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
20 30 40 50 60 70
Age (years)
o
Not informative!!
5/29
Background
Regression Model: Objective
I Describe the relationship between an outcome (dependent or
response) variable and a set of independent (predictor or
explanatory) variables by some regression model (equation).
I Predict some future outcome based on the regression model
How to model the relationship of CHD with age?
o
6/29
Background
Background
I What distinguishes a logistic regression model from the linear
regression model is that the outcome variable is binary (or
dichotomous).
I Whether the tumor is malignant (Yes=1) or not (No=0)
I Whether a newborn baby with low birth weight (yes=1) or not
(No=0)
I A student gets admission at LUMS (Yes=1) vs not (No=0)
For categorical response variable, the assumption that the
errors follow a normal distribution fails.
o
7/29
Background
Tabular Form of CHD Data
Age Group n CHD Present % CHD Present
20-29 10 1 0.10
30-34 15 2 0.13
35-39 12 3 0.25
40-44 15 5 0.33
45-49 13 6 0.46
50-54 8 5 0.63
55-59 17 13 0.76
60-69 10 8 0.80
100 43
o
8/29
Background
Proportion of Individuals with CHD
Relationship between Age & CHD
1.0
0.8
●
●
●
0.6
% with CHD
●
0.4
●
0.2
●
●
0.0
20 30 40 50 60 70
Age (years)
o
9/29
Introduction
Logistic Regression Model
I The response variable in logistic regression is categorical. The
linear regression model, i.e., Y = X β + does not work well
for a few reasons.
I The response values, 0 and 1, are arbitrary, so modeling the
actual values of Y is not exactly of interest.
I Our interest is in modeling the probability of each individual in
the population who responds with 0 or 1,
I The error terms in this case do not follow a normal distribution.
Thus, we might consider modeling P, the probability, as the
response variable.
o
10/29
Introduction
Sigmoid Function
Modeling the probability as response, some problems
I Although the general increase in probability is accompanied by
a general increase in age, we know that P, like all
probabilities, can only fall within the boundaries of 0 and 1.
I It is better to assume that the relationship between age and P
is sigmoidal (S-shaped), rather than a straight line.
I It is possible, however, to find a linear relationship between
age and a function of P. Although a number of functions
work, one of the most useful is the logit function.
o
11/29
Introduction
Logit Transformation
Logit Function
p
The logit function ln 1−p (also called log-odds) is simply the log of
ratio of P(Y = 1) divided by P(Y = 0).
p
ln = Xβ
1−p
The odds
p
= exp(X β).
(1 − p)
Solving
exp(y ) 1
p = Pr (Y = 1|X = x) = =
[1 + exp(y )] 1 + exp(−y )
gives the standard logistic function, while y = X β.
o
12/29
Introduction
Logit Transformation
Logit Function
p
g (x) = ln 1−p has many of the desirable properties of a linear
regression model.
I It may be continuous
I It is linear in the parameters
I It has the potential for a range between −∞ and +∞
depending on the range of x .
o
13/29
Introduction
Logit Transformation
Summary: Logit Transformation
Quantity Formula min max
Probability p 0 1
p
Odds 1−p 0 ∞
p
Logit or ’Log-Odds’ loge 1−p −∞ ∞
Logit stretches the probability scale
o
14/29
Introduction
Assumptions
Assumptions
Linear Regression Logistic Regression
∼ N(0; σ 2 ) ∼ Bin(p)
p
Y = Xβ + ln 1−p = Xβ +
Y |X ∼ N(X β; σ 2 ) Y |X ∼ Bin(p)
o
15/29
Estimation
Estimation of Parameters of Regression Model: β
I The method of maximum likelihood yields values for the
unknown parameters that maximize the probability of
obtaining the observed set of data.
I For logistic regression the likelihood equations are non-linear
in the parameters β’s and require special methods for their
solution.
I These methods are iterative in nature and have been
programmed into available logistic regression software
o
16/29
Example
Example: CHD Data
I Is age a risk factor of CHD? How the probability of CHD
changes by age?
I Outcome variable: CHD (Yes, No)
I Predictor: Age (in years)
Logistic regression models the probability of some event occurring
as a linear function of a set of predictors.
o
17/29
Example
Analysis
CHD Analysis
ln 1−p̂ p̂ = −5.31 + 0.11Age
I The coefficient is interpreted as the MARGINAL increase in
the log odds of CHD when age increases by 1 year.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.31 1.13 -4.68 0.00
age 0.11 0.02 4.61 0.00
OR = exp(0.11) = 1.116
The odds of getting CHD are · · · · · · when age increases by 1 year
o
18/29
Example
Analysis
Fitted Values
exp(βo + β1 X )
p =
[1 + exp(βo + β1 X )]
exp(−5.31 + 0.11Age)
=
[1 + exp(−5.31 + 0.11Age)]
o
19/29
Example
Analysis
R Software
mod1<-glm(chd ∼ age, family=’binomial’, data=chdage)
summary(mod1)
predict(mod1, type = ’response’)
anova(mod1, test=’Chisq’)
plot(mod1)
o
20/29
Example
Analysis
Predicted Probabilities
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
predicted probabilities
●
0.6
●
●
●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
0.2
●
●
●
●
●
●
●
●
●●
●●
●
20 30 40 50 60 70
Age
o
21/29
Example
How Good is the Fitted Model?
Analysis of Deviance
Model: binomial, link: logit
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr (> Chi)
NULL 99 136.66
Age 1 29.31 98 107.35 6.168e − 08 ∗ ∗∗
I Deviance is a measure of goodness of fit of a generalized
linear model. Or rather, it’s a measure of badness of fit.
I If our new model explains the data better than the null model,
there should be a significant reduction in the deviance which
can be tested against the chi-square distribution to give a
p-value
o
22/29
Example
How Good is the Fitted Model?
Hosmer-Lemeshow Goodness of Fit
How well our model fits depends on the difference between the
model and the observed data.
library(ResourceSelection)
hoslem.test(as.numeric(chdage$chd)-1, fitted(mod1))
R Output
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(chdage$chd) - 1, fitted(mod1)
X-squared = 2.2243, df = 8, p-value = 0.9734
Our model appears to fit well because we have no significant
difference between the model and the observed data (i.e. the
p-value > 0.05).
o
23/29
Example
How Good is the Fitted Model?
o
24/29
Single Categorical Predictor
Simple Logistic Regression Model with a Categorical
Predictor
I How some function of the probability of categorical response
is linearly related to a predictor
I Interpretation of the resulting intercept βo & the slope β1
where predictor variable is also binary.
o
25/29
Single Categorical Predictor
Case-Control Study: A Recap Example
Past exposure CHD Cases Controls (without disease)
Smokers 112 176
Non-smokers 88 224
Totals 200 400
Odds of CHD for Smokers = · · ·
Odds of CHD for Non-Smokers = · · ·
o
26/29
Single Categorical Predictor
Case-Control Study: A Recap Example Cont’d
Let yi is binary response variable
I yi = 1; if CHD=yes
I yi = 0; if CHD=no
Past exposure yi ni
Smokers 112 288
Non-smokers 88 312
Then yi ∼ Bin(ni , pi )
xi is the binary predictor of past smoking
I xi = 1; if past smoker
I xi = 0; if non-smoker in the past
o
27/29
Single Categorical Predictor
Case-Control Study: A Recap Example Cont’d
The probability of CHD pi can be modeled as:
logit(pi ) = βo + β1 xi
I xi = 1, then logit(pi |xi = 1) = βo + β1 (1)
I xi = 0, then logit(pi |xi = 0) = βo
pi |xi = 1
β1 = logit(pi |xi = 1) − logit(pi |xi = 0) = log
pi |xi = 0
∴ OR = · · · · · ·
o
28/29
Single Categorical Predictor
Example: Logistic Regression
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.93 0.13 -7.43 0.00
pastsmoke1 0.48 0.17 2.76 0.01
I For past smokers, xi = 1 then
ln(odds of CHD) = βo + β1 ∴ Odds for smokers = · · ·
I For past non-smokers, xi = 0 then
ln(odds of CHD) = βo ∴ Odds for non-smokers = · · ·
OR = · · ·
o
29/29
Types of Logistic Regression Models
Types of Logistic Regression Model
I Binary Logistic Regression Model: The categorical
response is dichotomous (has only two 2 possible outcomes),
e.g., an email is a Spam or Not
I Multinomial Logistic Regression Model: Three or more
categories without ordering (polytomous response), e.g.,
Predicting food choices (Veg, Non-Veg, Vegan)
I Ordinal Logistic Regression Model: Three or more
categories with ordering, e.g., Movie rating from 1 to 5,
teaching evaluation by students, etc.