Generalized linear regression models
Lecture 3
September 14, 2017
Generalized linear models (GLM)
• Generalized linear models (GLM) extend ordinary
regression to non-normal response distributions.
• Response distribution come from the exponential
family of distributions
It has 3 components:
Why do we use GLM’s?
• Linear regression assumes that the response is
distributed normally
• GLM’s enable us to analyze the linear relationship
between predictor variables and the mean of the
response variable when it is not reasonable to assume
the data is distributed normally
Form
• The function g(·) is called the “link function”: it links the
E(Yi) to the set of explanatory variables and their
estimates)
• For example, if we let g(x) = x and Y is distributed as a
normal random variable, then we are back to the linear
model
• For GLM, you generally have the flexibility to choose
what ever link you desire
Random Component
• Conditionally Normally distributed response with constant
standard deviation – standard regression models
• Binary outcomes (Success or Failure)- Random
component has Binomial distribution and model is called
Logistic Regression
• Count data (number of events in fixed area and/or length
of time)- Random component has Poisson distribution and
model is called Poisson Regression
Logistic Regression
• Logistic Regression - Dichotomous Response
variable and numeric and/or categorical explanatory
variable(s)
– Model the probability of a particular outcome as a function of
the predictor variable(s)
– Probabilities are bounded between 0 and 1
• Distribution of Responses: Binomial
Logistic regression
• Consider a binary response variable
• Variable with two outcomes: One outcome
represented by a 1 and the other represented
by a 0
• Examples:
Does the person have a disease? Yes or No
Outcome of a baseball game? Win or loss
Logistic regression
Logistic Regression with 1 Predictor
• Response - Presence/Absence of characteristic
• Predictor - Numeric variable observed for each case
• Model - p(x) Probability of presence at predictor level x
0 1 x
e
p ( x) 0 1 x
1 e
• 1 = 0 P(Presence) is the same at each level of x
• 1 > 0 P(Presence) increases as x increases
• 1< 0 P(Presence) decreases as x increases
Logistic Regression with 1 Predictor
0, 1 are unknown parameters and must be
estimated using statistical software such as R or
STATA
· Main interest in estimating and testing hypotheses
regarding 1
· Large-Sample test (Wald Test):
· H 0: 1 = 0 HA: 1 0
Odds Ratio
• Interpretation of Regression Coefficient ():
– In linear regression, the slope coefficient is the change in the
mean response as x increases by 1 unit
– In logistic regression, we can show that:
odds( x 1)
e
odds( x)
• Thus e represents the change in the odds of the outcome
(multiplicatively) by increasing x by 1 unit
• If = 0, the odds and probability are the same at all x levels (e=1)
• If > 0 , the odds and probability increase as x increases (e>1)
• If < 0 , the odds and probability decrease as x increases (e<1)
Multiple Logistic Regression
• Extension to more than one predictor variable (either numeric or
dummy variables).
• With k predictors, the model is written:
e 0 1x1 k xk
p
1 e 0 1x1 k xk
• Adjusted Odds ratio for raising xi by 1 unit, holding
all other predictors constant:
ORi e i
Testing Regression Coefficients
• Testing the overall model:
H 0 : 1 k 0
H A : Not all i 0
Poisson Regression
• Consider a count response variable
• Response variable is the number of occurrences
in a given time frame
• Outcomes equal to 0, 1, 2, ….
• Examples:
Number of penalties during a football game
Number of customers shop at a store on a given
day
Number of car accidents at an intersection
Poisson Regression
• Generally used to model Count data
• Distribution: Poisson
• Link Function: the log link
• One particular feature: E(Y)=var(Y)=μ
g ( ) ln( ) 0 1 X 1 ... k X k
X 1 ,..., X k e 0 1 X 1 ... k X k
• The X’s are variables that might affect the mean
value
Example: If the count variable is number of visits to
a museum in a given year, then X’s can be
variables such as income, admission price, parking
fees etc.
Tests are conducted as in Logistic regression