05‐11‐2020
Logistic regression
1 Applied Methods in Biostatistics - Week 2 2019
Generalization of the
Linear regression model
In many practical situation linear regression model is inadequate.
For example in case where: the outcome has two possible responses (binary data)
or the outcome represents count data (positive integers)
it makes no sense to model the outcome as normally distributed
Generalized linear models (GLMs) are an extension of linear regression
Regression models to model non-normally distributed outcome variables.
2 Applied Methods in Biostatistics - Week 2
1
05‐11‐2020
Binary outcome variable
In many studies the outcome of interest is the presence or absence of some condition.
Examples:
smoking status
responding to a treatment
presence or absence of cancer
survival status of a subject after a surgery: dead or alive
having myocardial infarction or CHD: yes/no
success (’yes’/1) and failure (’no’/0) are often used as generic terms of the two possible
responses
the interest is in quantifying the risk or odds of success or occurrence of some event of
interest
3 Applied Methods in Biostatistics - Week 2
Example: Prostatic cancer
A study of 53 prostate cancer patients. Before surgery two continuous exposure variables (age,
serumacid, phosphatase) and three categorical (binary) exposure variables (X-ray, tumour size,
tumour grade) were measured. The patients then had surgery (laparotomy) to determine whether
there was nodal involvement, i. e. lymph node metastases (NI = 1) or not (NI = 0) in the cancer to
adopt the treatment regimen for the patient.
Pat NI Age Acid Xray Size Grade
(pos) (large) (serious)
1 0 66 0.48 0 0 0
2 0 68 0.56 0 0 0
…
52 1 64 0.89 1 1 0
53 1 68 1.26 1 1 1
Brown, B.W. (1980)
4 Applied Methods in Biostatistics - Week 2
2
05‐11‐2020
Risk outcome: odds
Studies
• Case-control / Cross sectional
• Cohort: cumulative incidence rate
Simple (exploratory) inference
• Confidence intervals & hypothesis tests
• comparing risks between exposed/unexposed groups
• Test for association (two or more groups)
• Chi-square-tests/ Fishers exact tests
5 Applied Methods in Biostatistics - Week 2
Logistic regression model
The model is based on:
• Relationship
• logit (p) = log (p/(1-p))
= log-odds (p) = β0 + β1 x1 + β2x2 + … + βkxk
• E.g: p not linear in βs, but logit(p) linear
• Data from binomial distribution
Inference similar to linear model
• Allows many categorical & numerical indep. variables
Estimation & inference: computer
6 Applied Methods in Biostatistics - Week 2
3
05‐11‐2020
Purposes of logstic regression
Effect estimation
• exp (β1) = OR1 = Effect of variable
• Stata: logistic calculates effect estimates exp (β1) directly!
Prediction:
• Best model for predicting risk p of disease for new cases
• Stata: logit calculates parameter estimates of β0, β1, β2, …
• Rule of thumb: at least 10 cases and 10 controls for each indep. var. in model
7 Applied Methods in Biostatistics - Week 2
Estimation:
Interpretation of the coefficients
Interpretations of coefficients is similar to linear regression. However since the logit is linear, the coefficients we
have an analogous interpretation on the logit or log odds scale.
Logit (πNI(Xray, Size, Age)) = β0 + β1Xray + β2 Size + β3 Age
Binary exposure (Comparing Xray examination (1 = positive finding, 0 = negative finding) for Size and Age
fixed)
| , , β0 + β1Xray + β2 Size + β3 Age
𝑂𝑅 xray 𝑒 β1
| , , β0 + β2 Size + β3 Age
Continous exposure variable
| , , β0 + β1Xray + β2 Size + β3 Age+1
𝑂𝑅age 𝑒 β3
| , , β0 + β2 Size + β3 Age
8 Applied Methods in Biostatistics - Week 2
4
05‐11‐2020
Estimation Example (1):
Model: Logit (πNI(Xray, Size, Age)) = Log odds (NI=1|Xray, Size,Age)
= β0 + β1Xray + β2 Size + β3 Age
. logit NI Xray Size Age
Iteration 0: log likelihood = -35.126076
Iteration 1: log likelihood = -26.176433
Iteration 2: log likelihood = -26.042916
Iteration 3: log likelihood = -26.04263
Iteration 4: log likelihood = -26.04263
Logistic regression Number of obs = 53
LR chi2(3) = 18.17
Prob > chi2 = 0.0004
Log likelihood = -26.04263 Pseudo R2 = 0.2586
-------------------------------------------------------------------------
NI | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------+-----------------------------------------------------------------
Xray | 2.175658 .7644116 2.85 0.004 .6774385 3.673877
Size | 1.596897 .7079243 2.26 0.024 .2093913 2.984403
Age | -.0604558 .054447 -1.11 0.267 -.16717 .0462584
_cons | 1.518419 3.22939 0.47 0.638 -4.811069 7.847908
------------------------------------------------------------------------
9 Applied Methods in Biostatistics - Week 2
Estimation Example (2):
Model: Logit (πNI(Xray, Size, Age)) = Log odds (NI=1|Xray, Size,Age)
= β0 + β1Xray + β2 Size + β3 Age
. logistic NI Xray Size Age
Logistic regression Number of obs = 53
LR chi2(3) = 18.17
Prob > chi2 = 0.0004
Log likelihood = -26.04263 Pseudo R2 = 0.2586
-------------------------------------------------------------------------
NI | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
------+------------------------------------------------------------------
Xray | 8.807976 6.732919 2.85 0.004 1.968828 39.40437
Size | 4.937689 3.49551 2.26 0.024 1.232927 19.7747
Age | .9413353 .0512529 -1.11 0.267 .8460557 1.047345
-------------------------------------------------------------------------
10 Applied Methods in Biostatistics - Week 2
10
5
05‐11‐2020
Inferences - Testing overall regression
Hypotheses: H0 : β1 = β2 = . . . = βn = 0
(e. g., Xray, Size and Age are not of predictable value for prostatic cancer
Likelihood ratio (LR) statistic compares two models
1. minimal model = logistic regression model under H0
2. full model = logistic regression model taking account for (all) the exposure variables of interest
for each model the maximum likelihood function L is calculated:
1. Lm := L( 𝛽 0) for the minimal model
2. Lf := L(𝛽 0 , 𝛽 1 , 𝛽 2. … 𝛽 n) for the full model
Likelihood ratio statistic
LR = 2{log(Lf ) − log(Lm)} = 2 log~ chi square distributed
11 Applied Methods in Biostatistics - Week 2
11
Estimation Example (overall test):
Model: Logit (πNI(Xray, Size, Age)) = Log odds (NI=1|Xray, Size,Age)
= β0 + β1Xray + β2 Size + β3 Age
. logistic NI Xray Size Age
Logistic regression Number of obs = 53
LR chi2(3) = 18.17
Prob > chi2 = 0.0004
Log likelihood = -26.04263 Pseudo R2 = 0.2586
-------------------------------------------------------------------------
NI | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
------+------------------------------------------------------------------
Xray | 8.807976 6.732919 2.85 0.004 1.968828 39.40437
Size | 4.937689 3.49551 2.26 0.024 1.232927 19.7747
Age | .9413353 .0512529 -1.11 0.267 .8460557 1.047345
-------------------------------------------------------------------------
12 Applied Methods in Biostatistics - Week 2
12
6
05‐11‐2020
Inferences - Wald-test
Which factors had a significant effect on the dependent variable adjusted for all the other
independent variables?
Hypotheses: H 0 : i 0 vs . H 1 : i 0
ˆ i
Test statistics: N(0,1)-distributed
Z i
~
se ˆ i
with Z ~ -distributed,
2 2
degree of freedom=1
13 Applied Methods in Biostatistics - Week 2
13
Estimation Example (Wald test):
Model: Logit (πNI(Xray, Size, Age)) = Log odds (NI=1|Xray, Size,Age)
= β0 + β1Xray + β2 Size + β3 Age
. logistic NI Xray Size Age
Logistic regression Number of obs = 53
LR chi2(3) = 18.17
Prob > chi2 = 0.0004
Log likelihood = -26.04263 Pseudo R2 = 0.2586
-------------------------------------------------------------------------
NI | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
------+------------------------------------------------------------------
Xray | 8.807976 6.732919 2.85 0.004 1.968828 39.40437
Size | 4.937689 3.49551 2.26 0.024 1.232927 19.7747
Age | .9413353 .0512529 -1.11 0.267 .8460557 1.047345
-------------------------------------------------------------------------
14 Applied Methods in Biostatistics - Week 2
14
7
05‐11‐2020
Maximum likelihood estimation
The idea behind: determine the parameters that maximize the probability
(likelihood) of the sample data.
From a statistical point of view, the method of maximum likelihood is considered
to be more robust and yields estimators with good statistical properties.
An efficient methods for quantifying uncertainty through confidence bounds.
Although the methodology for maximum likelihood estimation is simple, the
implementation is mathematically intense. Using today's computer power,
however, mathematical complexity is not a big obstacle.
Maximize the likelihood function L(ϑ) is equivalent to maximize the log-
Likelihood-function l(ϑ)
15 Applied Methods in Biostatistics - Week 2
15
Estimation Example (ML estimation):
Model: Logit (πNI(Xray, Size, Age)) = Log odds (NI=1|Xray, Size,Age)
= β0 + β1Xray + β2 Size + β3 Age
. logistic NI Xray Size Age
Logistic regression Number of obs = 53
LR chi2(3) = 18.17
Prob > chi2 = 0.0004
Log likelihood = -26.04263 Pseudo R2 = 0.2586
-------------------------------------------------------------------------
NI | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
------+------------------------------------------------------------------
Xray | 8.807976 6.732919 2.85 0.004 1.968828 39.40437
Size | 4.937689 3.49551 2.26 0.024 1.232927 19.7747
Age | .9413353 .0512529 -1.11 0.267 .8460557 1.047345
-------------------------------------------------------------------------
16 Applied Methods in Biostatistics - Week 2
16
8
05‐11‐2020
Prediction
The logistic regression approach is suitable for predicting success probability or the outcome risk
for new cases in dependence of exposures
Example: Prostatic cancer
𝐿𝑜𝑔𝑖𝑡 𝜋𝑁𝐼 𝑋𝑟𝑎𝑦, 𝑆𝑖𝑧𝑒, 𝐴𝑔𝑒 𝛽0 𝛽1𝑋𝑟𝑎𝑦 𝛽2 𝑆𝑖𝑧𝑒 𝛽3 𝐴𝑔𝑒
1.52 2.18Xray 1.60Size-0.06Age
Xray Size Age logit 𝜋𝑁𝐼 π𝑁𝐼 𝑃(NI = 1)
0 0 68 -2.56 0.072
1 0 68 -0.38 0.515
0 1 51 0.06 0.406
1 1 57 1.88 0.868
17 Applied Methods in Biostatistics - Week 2
17