0 ratings0% found this document useful (0 votes) 672 views33 pagesGarson 2008 Logistic Regression
Garson 2008 Logistic Regression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Logistic Regression: Stainotes, from North Carolina State University, Anipenvwwv2.chass.nesu.edu/garson/PA76S/logistie tn
1033
Pai [syiabus [Stators [Wenner | tah ustacior [|
Logistic Regression
Overview
Binomial (or binary) logistic regression is a form of regression which is used when the dependent
is a dichotomy and the independents are of any type. Multinomial logistic regression exists to
handle the case of dependents with more classes than two. When multiple classes of the dependent
variable can be ranked, then ordinal logistic regression is preferred to multinomial logistic
regression. Continuous variables are not used as dependents in logistic regression. Unlike logit
regression, there can be only one dependent variable.
Logistic reqression can be used to predict a dependent variable on the basis of continuous and/or
categorical independents and to determine the percent of variance in the dependent variable
explained by the independents; to rank the relative importance of independents; to assess
interaction effects; and to understand the impact of covariate control variables.
Logistic regression applies maximum likelihood estimation after transforming the dependent into a
logit variable (the natural log of the odds of the dependent occurring or not). In this way, logistic
regression estimates the probability of a certain event occurring. Note that logistic regression
calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS
regression does.
Logistic regression has many analogies to OLS regression: logit coefficients correspond to b
coefficients in the logistic regression equation, the standardized logit coefficients correspond to
beta weights, and a pseudo R? statistic is available to summarize the strength of the relationship,
Unlike OLS regression, however, logistic regression does not assume linearity of relationship
between the independent variables and the dependent, does not require normally distributed
variables, does not assume homascedasticity, and in general has less stringent requirements, It
does, however, require that observations are independent and that the independent variables be
linearly related to the logit of the dependent. The suecess of the logistic regression can be assessed
by looking at the classification table, showing correct and incorrect classifications of the
dichotomous, ordinal, or polytomous dependent. Also, goodness-of-fit tests such as model
chi-square are available as indicators of model appropriateness as is the Wald statistic to test the
significance of individual independent variables. .
In SPSS, binomial logistic regression is under Analyze - Regression - Binary Logistic, and the
multinomial version is under Analyze - Regression - Multinomial Logistic. Logit regression,
discussed separately, is another related option in SPSS for using loglinear methods to analyze one
or more dependents, Where both are applicable, logit regression has numerically equivalent results
to logistic regression, but with different output options. For the same class of problems, logistic
regression has become more popular among social scientists.
2/1/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, http:dvww2.chass.nesu.edu/garson/PA765/logistic htm
20f33
Key Terms and Concepts
© Design variables are nominal or ordinal independents (factors) entered as dummy variables,
In SPSS binomial logistic regression, categorical independent variables must be declared by
clicking on the "Categorical" button in the Logistic Regression dialog box. SPSS
multinomial logistic regression will convert categorical variables to dummies automatically
by leaving ont the last category, which becomes the reference category. Researchers may
prefer to create dummy variables manually so as to control which category is omitted and
thus becomes the reference category. For more on the selection of dummy variables, click
here
© Covarintes are interval independents in most programs. However, in SPSS dialog all
independents are entered as covariates, then one clicks the Categorical button in the Logistic
Regression dialog to declare any of those entered as categorical.
© Interaction terms. As in OLS regression, one can add interaction terms to the model (ex.,
age*income). For continuous covariates, one simply creates a new variable which is the
product of two existing ones, For categorical variables, one also multiplies but you have to
multiply two category codes as shown in the Categorical Variables Codings table of SPSS
output (ex, race(1)*religion(2)). The codes will be I's and 0's so most of the produets in the
new variables will be 0's unless you recode some products (ex., setting 0*0 to 1).
© Maximum likelihood estimation, MLE, is the method used to calculate the logit
coefficients. This contrasts to the use of ordinary least squares (OLS) estimation of
coefficients in regression. OLS seeks to minimize the sum of squared distances of the data
points to the regression line. MLE seeks to maximize the log likelihood, LL, which reflects
how likely it is (the odds) that the observed values of the dependent may be predicted from
the observed values of the independents.
MLE is an iterative algorithm which starts with an initial arbitrary "guesstimate" of
‘what the logit coefficients should be, the MLE algorithm determines the direction and
size change in the logit coefficients which will increase LL. After this initial function
is estimated, the residuals are tested and a re-estimate is made with an improved
function, and the process is repeated (usually about a half-dozen times) until
convergence is reached (that is, until LL does not change significantly). There are
several alternative convergence criteria.
© Ordinal and multinomial logistic regression are extensions of binary logistic regression
that allow the simultaneous comparison of more than one contrast. That is, the log odds of
three or more contrasts are estimated simultaneously (ex., the probability of A vs. B, A vs.C,
Bys.C,, etc.).
© SPSS and SAS. In SPSS, select Analyze, Regression, Binary (or Multinomial), Logis
select the dependent and the covariates; Continue; OK. SAS's PROC CATMOD computes
both simple and multinomial logistic regression, whereas PROC LOGIST is for simple
Gichotomous) logistic regression. CATMOD uses a conventional model command: ex.,
model wsat*supsat*qman=_response_/nogls ml ;. Note that in the mode! command, nogls
suppresses generalized least squares estimation and m! specifies maximum likelihood
estimation,
27/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State Unive
30f33
Inptiwww2chassnesu.edu/garson/PA765/logistc him
© Significance Tests
= Log likelihood: A "likelihood" is a probability, specifically the probability that the
“observed values of the dependent may be predicted fiom the observed values of the
independents. Like any probability, the likelihood varies from 0 to 1, The log
likelihood (LL) is its log and varies from 0 to minus infinity (it is negative because the
log of any number less than 1 is negative), LL is calculated through iteration, using
maximum likelihood estimation (MLE). Log likelihood is the basis for tests of
logistic model.
* The likelihood ratio is a function of log likelihood, Because -2LL has
approximately a chi-square distribution, -2LL can be used for assessing the
significance of logistic regression, analogous to the use of the sum of squared
errors in OLS regression. The -21.L statistic is the likelihood ratio. Itis also
called goodness of fit, deviance chi-square, scaled deviance, deviation
chi-square, Dé, or L-square, It reflects the significance of the unexplained
variance in the dependent. In SPSS output, this statistic is found in the "-2 Log,
Likelihood” column of the "Iteration History” table or the "Likelihood Ratio
Tests" table, The likelihood ratio is not used directly in significance testing, but
it it the basis for the likelihood ratio test, which is the test of the difference
between two likelihood ratios (two -2.L's), as discussed below. In general, as
the model becomes better, -2LL will decrease in magnitude.
= The likelihood ratio test, also called the log-likelihood test, is based on -2LL
(deviance). The likelihood ratio test is a test of the significance of the difference
between the likelihood ratio (-2LL) for the researcher's model minus the
likelihood ratio for a reduced model. This difference is called "model
chi-square," The likelihood ratio test is generally preferred over its altemative,
the Wald test, discussed below. There are three main forms of the likelihood
ratio test::
1. Test of the overall model. When the reduced model is the baseline model
with the constant only, the likelihood ratio test tests the significance of
the researcher's model as a whole. A well-fitting model is significant at
the ,05 level or better, meaning the researcher's model is significantly
different from the one with the constant only. The likelihood ratio test
appears in the "Model Fitting Information” table in SPSS output.
‘Thus the likelihood ratio test of a model tests the difference between
-2LL for the full model and -2LL for initial chi-square in the null model,
‘This is called the the model chi-square test. The null model, also called
the initial model, is logit(p) = the constant, That is, intial chi-square is
-2LL for the model which accepts the mull hypothesis that all the b
coefficients are 0. This implies that that none of the independents are
linearly related to the log odds of the dependent, Mode! chi-square thus
fests the null hypothesis that all population logistic regression coefficients
except the constant are zero, It is an overall model test which does not
assure that every independent is significant. Warning: If the
log-likelihood test statistic shows a small p value (<=.05) for a model
27712008 1:55 PMLogistic Regression: Statntes om North Carolina State University,
4 0f33
ttp:/Avww2.chass.nesu,edu/garson/PA765/logisti.htm
with a large effect size, ignore contrary findings based on the Wald
statistic discussed below as it is biased toward Type If errors in such
instanees - instead assume good model fit overall.
Degrees of freedom in this test equal the number of terms in the model
minus | (for the constant). This is the same as the difference in the
number of terms between the two models, since the null model has only
one term, Model chi-square measures the improvement in fit that the
explanatory variables make compared to the null model. Model
chi-square is a likelihood ratio test which reflects the difference between
error not knowing the independents (jnitial chi-square) and error when
the independents are included in the model (deviance). When probability
(model chi-square) <= .05, we reject the null hypothesis that knowing the
independents makes no difference in predicting the dependent in logistic
regression,
. Testing for interaction effects, A common use of the likelihood ratio test
is to test the difference between a full model and a reduced model
dropping an interaction effect, If model chi-square (which is -2LL for the
full model minus -2LL for the reduced model) is significant, then the
interaction effect is contributing significantly to the full model and should
be retained.
Test of individual model parameters. The likelihood ratio test assesses the
overall logistic model but does not tell us if particular independents are
‘more important than others, This can be done, however, by comparing the
difference in -2.L for the overall model with a nested model which drops
‘one of the independents. We ean use the likelihood ratio test to drop one
variable from the model to create a nested reduced model. In this
situation, the likelihood ratio test tests if the logistic regression
coefficient for the dropped variable can be treated as 0, thereby justifying
dropping the variable from the model. A nonsignificant likelihood ratio
test indicates no difference between the full and the reduced models,
henee justifying dropping the given variable so as to have a more
parsimonious model that works just as well. Note that the likelihood ratio
test of individual parameters is a better criterion than the alternative Wald
statistic when considering which variables to drop from the logistic
regression model. In SPSS output, the "Likelihood Ratio Tests" table
contains the likelihood ratio tests of individual mode! parameters.
|. Tests for model refinement. In general, the likelihood ratio test can be
used to test the difference between a given model and any nested model
which is a subset of the given model. It cannot be used to compare two
non-nested models, Chi-square is the difference in likelihood ratios
CLL) for the two models, and degrees of freedom is the difference in
degrees of freedom for the two models. If the computed chi-square is,
equal or greater than the critical value of chi-squace (in a chi-square table)
for the given df, then the models are significantly different. If the
difference is significant, then the researcher concludes that the variables
27/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University.
$5 0f33
dropped in the nested model do matter significantly in predicting the
dependent. Ifthe difference is below the critical value, there is a finding
of non-significance and the researcher concludes that dropping the
variables makes no difference in prediction and for reasons of parsimony
the variables are dropped ftom the model, That is, chi-square difference
ccan be used to help decide which variables to drop from or add to the
model, This is discussed further in the next section.
= Chi-square (Hosmer-Lemeshow) test of goodness of fit. If chi-
goodness of fit is not significant, then the model has adequate fit. By the same
token, ifthe testis significant, the model does not adequately fit the data, This
test appears in SPSS multinomial logistic regression output in the "Goodness of
Fit" table, with a Pearson chi-square and a deviance (likelihood ratio)
chi-square version (both usually close). One must check "Goodness of fit"
under the Statistics bution. This test is preferred over classification tables when
assessing model fit,
™ Hosmer and Lemeshow's goodness of it test, not to be confused with a
similarly named, obsolete goodness of fit test discussed below, is another
name for a chi-square goodness of fit test. It is available under the
Options button in the SPSS binary logistic regression dialog, The test
divides subjects into deciles based on predicted probabilities, then
computes a chi-square from observed and expected frequencies. Then a
probability (p) value is computed from the chi-square distribution with 8
degrees of freedom to test the fit of the logistic model. If the H-L,
goodness-of-fit test statistic is greater than .05, as we want for well-fitting
models, we fail to reject the null hypothesis that there is no difference
between observed and model-predicted values, implying that the model's
estimates fit the data at an acceptable level. That is, well-fitting models
show nonsignificance on the H-L. goodness-of-fit test, indicating model
prediction is not significantly different from observed values. This does
not mean that the model necessarily explains much of the variance in the
dependent, only that however much or little it does explain is significant.
As the sample size gets large, the H-L statistic can find smaller and
smaller differences between observed and model-predicted values to be
significant, On the other hand, the H-L statistic assumes sampling
adequacy, with a rule of thumb being enough cases so that no group has
aan expected value <1 and 95% of cells (typically, 10 decile groups times
2 outcome categories = 20 cells) have an expected frequency > 5.
Collapsing groups may not solve a sampling adequacy problem since
when the number of groups is small, the H-L test will be biased toward
nonsignificance (will overestimate model fit).
= More about likelihood ratio tests of chi-square difference between nested.
models. Block chi-square is a synonym for the likelihood ratio (chi-square
difference) test, referring to the change in -2LL, duc to entering a block of
variables. The main Logistic Regression dialog allows the researcher to enter
independent variables in blocks. Blocks may contain one or more variables.
‘There are three major uses of the likelihood ratio test with nested models:
2/7/2008 1:55 PM.
bttpy/hvww2.chass.nesu,edu/garson!PAT6S/logistic. tmLogistic Regression: Statnotes, from North Carolina State University,
6 0f33
™ Stepwise logistic regression: The forward or backward stepwise logistic
regression method utilizes the likelihood ratio test (chi-square difference)
to determine automatically which variables to add or drop from the
model. This brute-force method runs the risk of modeling noise in the
data and is considered useful only for exploratory purposes. Selecting
‘model variables on a theoretic basis and using the "Enter" method is
preferred, However, problems of overfitting the model to noise in the
current data may be mitigated by cross-validation, fitting the model to
one a test subset of the data and validating the model using a hold-out
validation subset. Note that step chi-square is the likelihood ratio test,
‘which tests the change in -2LL between steps. Earlier versions of SPSS
referred to this as "improvement chi-squate." Stepwise procedures are
selected in the Method drop-down ic regression dialog, in
tum giving the following choices:
™ Forward selection vs. backward elimination: Forward selection is
the usual option, starting with the constant-only model and adding
variables one at a time in the order they are best by some criterion
(see below) until some cutoff level is reached (ex., until the step at
which all variables not in the model have a significance higher than
.05). Backward sclection starts with all variables and deletes one at
atime, in the order they are worst by some criterion,
* Variable entry criterion for forward selection. Rao's efficient score
statistic (see below) is used as the forward selection criterion for
adding variables to the model. It is similar but not identical to a
likelihood ratio test of the coefficient for an individual explanatory
variable. It appears in the "score" column of the "Variables not in’
the equation" table of SPSS output.
* Score statistic: Rao’s efficient score, labeled simply "score"
in SPSS output, is test for whether the logistic regression
coefficient for a given explanatory variable is zero. It is
mainly used as the criterion for variable inclusion in forward
stepwise logistic regression (discussed above), because of its
advantage of being a non-iterative and therefore
computationally fast method of testing individual parameters
compared to the likelihood ratio test, In essence, the score
Statistic is similar to the first iteration of the likelihood ratio
method, where LR typically goes on to three or four more
iterations to refine its estimate. In addition to testing the
significance of each variable, the score procedure generates
an “Overall statistics” significance test for the model as a
whole. A finding of nonsignificance (ex., p>.05) on the score
statistic leads to acceptance of the null hypothesis that
coefficients are zero and the variable may be dropped. SPSS
continues by this method until no remaining predictor
variables have a score statistic significance of .05 or better.
27/2008 1:55 PM
bttpulhwww2.chass.nesu.edu/garson/PA765/logisti.htm,Logistic Regression: Statnotes, from North Carolina State University, nttp:/vyw2.chass.nesu.edu/garson/PA765/logistic.itm
# Variable removal criteria for backward elimination. For
climinating variables in backward elimination, the researcher may
choose from among the likelihood ratio test (the preferred method),
the Wald statistic, or the conditional statistic (a computationally
faster approximation to the likelihood ratio test and preferred when
use of the likelihood ratio criterion proves computationally too
‘time consuming). The likelihood ratio test computes -2LL for the
current model, then reestimates -2LL with the target variable
removed. The conditional statistic is similar except that the -2LL.
for the target variable removed model is a one-pass estimate rather
than an interative reestimation as in the likelihood ratio test. The
conditional statistic is considered not as accurate as the likelihood
ratio test but more so than the third possible criterion, the Wald
test.
™ Which step is the best model? Stepwise methods do not necessarily
identify "best models" at all as they work by fitting an automated
model to the current dataset, raising the danger of overfitting to
noise in the particular dataset at hand. However, there are three
possible methods of selecting the "final model" that emerges from
the stepwise procedure:
1. Last step. The final model is the last step model, where
adding another variable would not improve the model
significantly.
2. Lowest AIC. The "Step Summary" table will print the Akaike
Information Criterion (AIC) for each step. AIC is commonly
used to compare models, where the lower the AIG, the better.
The step with the lowest AIC thus becomes the "final
model.”
3. Lowest BIC. The "Step Summary" table will print the
Bayesian Information Criterion (BIC) for each step. BIC is
also used to compare models, again where the lower the BIC,
the better, The step with the lowest BIC thus becomes the
final model." Often BIC will point to a more parsimonious
‘model than will AIC as its formula factors in degrees of
freeclom, which is related to number of variables.
* Click here for further discussion of stepwise methods.
= Sequential logistic regression is analysis of nested models where the
researcher is testing the control effects of a set of covariates. The logistic
regression model is run against the dependent for the full model with
independents and covariates, then is run again with the block of
independents dropped. If chi-square difference is not significant, then the
researcher concludes that the independent variables are controlled by the
covariates (that is, they have no effect once the effect of the covariates is
70633 2/7/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University... bttputivww2.chass.nesu.edu/garson/PAT65 /logisti. htm
8 0f33
taken into account). Altematively, the nested model may be just the
independents, with the covarietes dropped. In that case a finding of
non-significanee implies that the covariates have no control effect.
# Assessing dummy variables: Running a full model and then a model with
all the variables in a dummy set dropped (ex., East, West, North for the
variable Region) allows assessment of dummy variables, using chi-square
difference. Note that even though SPSS computes log-likelihood ratio
tests of individual parameters for each level of a dummy variable, the
log-likelihood ratio tests of individual parameters (discussed below)
should not be used, but rather use the likelihood ratio test (chi-square
difference method) for the set of dummy variables pertaining to a given
variable, Because all dummy variables associated with the categorical
variable are entered as a block this is sometimes called the "block
chi-square" test and its value is considered more reliable than the Wald
test, which can be misleading for large effects in finite samples.
® Wald statistic (test): The Wald statistic is an alternative test which is
commonly used to test the significance of individual logistic regression
coefficients for each independent variable (that is, to test the mull hypothesis in
logistic regression that a particular logit (effect) coefficient is zero). For
dichotomous independents, the Wald statistic is the squared ratio of the
unstandardized logit coefficient to its standard error. The Wald statistic and its
corresponding p probability level is part of SPSS output in the "Variables in the
Equation" table. This comesponds to significance (esting of b coefficients in
OLS regression, The researcher may well want to drop independents from the
model when their effect is not significant by the Wald statistic, The Wald test
appears in the "Parameter Estimates" table in SPSS logistic regression output.
Warning: Menatd (p. 39) warns that for large logit coefficients, standard
error is inflated, lowering the Wald statistic and leading to Type Il errors
(false negatives: thinking the effect is not significant when itis). That is,
there is a flaw in the Wald statistic such that very large effects may lead
to large standard errors and small Wald chi-square values, For models
with large logit coefficients or when dummy variables are involved, itis
better to test the difference using the likelihood ratio test of the difference
of models with and without the parameter. Also note that the Wald
statistic is sensitive to violations of the large-sample assumption of
logistic regression, Put another way, the likelihood ratio test is considered
more reliable for small samples (Agresti, 1996). For these reasons, the
likelihood ratio test of individual model parameters is generally preferred,
= Logistic coefficients and correlation, Note that a logistic coefficient may be
found to be significant when the corresponding correlation is found to be not
significant, and vice verse, To make certain global statements about the
significance of an independent variable, both the correlation and the parameter
estimate (b) should be significant. Among the reasons why correlations and
logistic coefficients may differ in significance are these: (1) logistic coefficients
are partial coefficients, controlling for other variables in the model, whereas
2/7/2008 1:55 PM.Logistic Regression: Statnotes, rom North Carolina State University. http fywww2.chass.nesu.edu/garson/PAT6S/logistic tin
90f33
correlation coefficients are uncontrolled; (2) logistic coefficients reflect linear
and nonlinear relationships, whereas correlation reflects only linear
relationships; and (3) a significant parameter estimate (b) means there is a
relation of the independent variable to the dependent variable for selected
control groups, but not necessarily overall
= Confidence interval for the logistic regression coefficient. The confidence
interval around the logistic regression coefficient is plus or minus 1.96*ASE,
where ASE is the asymptotic standard erzor of logistic b. “Asymptotic” in ASE
means the smallest possible value for the standard eror when the data fit the
model, Itis also the highest possible precision. The real (enlarged) standard
ctror is typically slightly larger than ASE. One typically uses real SE if one
hypothesizes that noise in the data are systematic and one uses ASE if one
hypothesizes that noise in the data are random. As the latter is typical, ASE is
used here,
= Goodness of Fit (obsolete), also known as Hosmer and Lemeshow's Goodness
of Fit Index ot C-hat, is an altemative to model chi-square for assessing the
significance of a logistic regression model. Menard (p. 21) notes it may be
better when the number of combinations of values of the independents is
approximately equal to the number of cases under analysis. This measure was
included in SPSS output as "Goodness of Fit" prior to Release 10. However, it
was removed from the reformatted output for SPSS Release 10 because, as
noted by David Nichols, senior statistician for SPSS, it *is done on individual
cases and does not follow a known distribution under the null hypothesis that
the data were generated by the fitted model, so i's not of any real uso"
(SPSSX-L listserv message, 3 Dec. 1999).
© Interpreting Parameter Estimates
= Logit coefficients (logits), also called unstandardized logistic regression coefficients
or effect coefficients or simply "parameter estimates" in SPSS output, correspond to b
coefficients in OLS regression. Logit coefficients, which are on the right-hand side of
the logistic equation, are not to be confused with logits, which is the term on the
left-hand side, Both logit and regression coefficients can be used to construct
prediction equations and generate predicted values, which in logistic regression are
called logistic scores, The SPSS table which lists the b coefficiences also lists the
standard error of b, the Wald statistic and its significance (discussed below), and the
odds ratio (labeled Exp(b) ) as well as confidence limits on the odds ratio.
Probabilities, odds, and odds ratios axe all important basic terms in logistic
regression. See the more extensive coverage in the separate section on log-linear
analysi
= Parameter estimates and logits, In SPSS and most statistical output for logistic
regression, the "parameter estimate” is the b coefficient used to predict the log odds
(logit) of the dependent variable. Let z be the logit for a dependent variable, then the
logistic prediction equation is:
21772008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, ttp:tavww2.chass.nesu.edu/garson/PA765/logistic. htm
10 0f 33
z= In(odds(event)) = In(prob(event)/prob(nonevent)) = In(prob(eventV[I -
prob(event)))
= bo + DIX] + b2X9 + wee HDRXK
‘where bo is the constant and there are k independent (X) variables. Some of the X
variables may in fact be interaction terms,
Fora one-independent model, z would equal the constant, plus the b coefficient times
the value of X1, when predicting odds(event) for persons with a particular value of
X1.IfX1 is a binary (0,1) variable, then z= Xo (that is, the constant) for the "0” group
on X1 and equals the constant plus the b coefficient for the "1" group. To convert the
log odds (which is 2, which is the logit) back into an odds ratio, the natural
logarithmic base e is raised to the zth power: odds(event) = exp(z) = odds the binary
dependent is 1 rather than 0, If X1 is a continuous variable, then z equals the constant
plus the b coefficient times the value of X;. For models with additional constants, z is
the constant plus the crossproducts of the b coefficients times the values of the X
(independent) variables. Exp(2) is the log odds of the dependent, or the estimate of
odds(event).
‘To summarize, logits are the log odds of the event occurring (usually, that the
dependent = 1 rather than 0), The "2" in the logistic formula above is the logit
Odds(event) = Exp(2). Where OLS regression has an identity link function, logistic
regression has a logit link function (that is, logistic regression calculates changes in
the log odds of the dependent, not changes in the dependent itself as OLS regression
does). Parameter estimates (b coefficients) associated with explanatory variables are
estimators of the change in the logit caused by a unit change in the independent. In
SPSS output, the parameter estimates appear in the "B" column of the "Variables in
the Equation" table. Logits do not appear but must be estimated using the logistic
regression equation above, inserting appropriate values for the constant and X
variable(s). The b coefficients vary between plus and minus infinity, with 0 indicating
the given explanatory variable does not affect the logit (that is, makes no difference in
the probability of the dependent value equaling the value of the event, usually 1);
positive or negative b coefficients indicate the explanatory variable inereases or
decreases the logit of the dependent. Exp(b) is the odds ratio for the explanatory
variable, discussed below. Note that when b=0, Exp(b)=1, so therefore an odds ratio
of 1 corresponds to an explanatory variable which docs not affect the dependent
variable.
Odds ratios. In contrast to exp(2), exp(b) is the odds ratio. The odds ratio is the
natural log base, ¢, to the exponent, b, where b = the parameter estimate. "Exp(B)" in
SPSS output refers to odds ratios. Exp(b), which is the odds ratio for a given
independent variable, represents the factor by which the odds(event) change for a
‘one-unit change in the independent variable. Put another way, Exp(b) is the ratio of
‘odds for two groups where each group has a values of Xj which are one unit apart
from the values of Xj in the other group. A positive Exp(b) means the independent
variable increases the logit and therefore increases odds(event). If Exp(b) = 1.0, the
independent variable has no effect. If Exp(b) is less than 1.0, then the independent
variable decreases the logit and decreases odds(event). For instance, if by = 2.303,
then the corresponding odds ratio (the exponential function, c°) is 10, then we may
217/208 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University,.
11 of 3
bttpv/avww2, chass.nesu.ecu/garson/PA76S/logistic. htm
say that when the independent variable increases one unit, the odds that the dependent
= 1 increase by a factor of 10, when other variables are controlled. In SPSS, odds
ratios appear as "Exp(B)" in the "Variables in the Equation" table,
odds ratio = exp(b)
b= In(odds ratio)
A second simple example: Some 20 people take a performance test, where O=fail and
Isuccess. For males, 3 fail and 7 sueceed. For females, 7 fail and 3 succeed. Then
(success) for males = 7/10 = .70; and q(failure) for males = 3/10 = 30. Likewise
(success) for females = 3/10 = 30, and q(failure) for females = 7/20 = 70, Therefore
the odds of success for males is the ratio of the probabilities = .7/.3 = 2.3333. The
odds of suecess for females = .3/.7 = 4286, rounded off. Then the odds ratio for
success (for performance = 1) for males:females is 2.3333/,4286 = 5.4444, Since the
parameter estimate is the natural log of the odds ratio, therefore b(gender) =
In(5.4444) = 1.6946. Conversely, if the b for gender was 1.6946 we could convert it to
an odds ratio using the function exp(1.6946) = 5.4444. And we would say that the
odds of success (the odds that the dependent variable performance = 1) are 5.4444
‘times as large for males as for females. (If you try this on your calculator results will
not be exact due to rounding error: there are actually more than the four decimal
places shown above, and also delta must be set to 0).
= Comparing the change in odds for different values of X. The odds ratio, which
is Exp(b), is the factor by which odds(event) changes for a 1 unit change in X.
But what if years_education was the X variable and one wanted to know the
change factor for X=12 years vs. X=16 years? Here, the X difference is 4 units
‘The change factor is not Exp(b)*4. Rather, odds(event) changes by a factor of
Exp(b)*, That is, odds(event) changes by a factor of Exp(b) raised to the power
of the number of units change in X.
* Comparing the change in odds when interaction terms are in the model. In
general, Exp(b) is the odds ratio and represents the factor by which odds(event)
is multiplied for a unit increase in the X variable, However, the effect of the X
variable is not properly gauged in this manner if X is also involved in
interaction effects which are also in the model, Before exponentiating, the b
coefficient must be adjusted to include the interaction b terms. Let Xi be
years_education and let X2 be a dichotomous variable called "school type"
coded 0private school, 1=public school, and let the interaction term X1* X2
also be in the model. Let the b coefficients be .864 for X1, 280 for Xo, and .010
for X1*X2, The adjusted b, which we shall label b*, for years education = 864
++ .010*school_type. For private schools, b* =.864. For public schools b*
874. Exp(b*) is then the estimate of the odds ratio, which will be different for
different values of the variable (here, school_type) with which the X variable
(here, years_education) is interacting.
* Logit coefficients vs. logits. Although b coefficients are called logit or logistic
coefficients, they are not logits. That is, b is the parameter estimate and z is the
logit. Parameter estimates (b) have to do with changes in the logit, z, while the
logit has to do with estimates of odds(event), To avoid confusion, many
2712008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University, ttpithyww2.chass.nesu.edu/garson/PA765/logistic htm
120f33
researchers refer to b as the "parameter estimate."
Logit coefficients, and why they are preferred over odds ratios in modeling.
Note that for the case of decrease the odds ratio can vary only from 0 to .999,
while for the case of inerease it can vary from 1.001 to infinity. This asymmetry
is a drawback to using the odds ratio as a measure of strength of relationship.
Odds ratios are preferred for interpretation, but logit coefficients are preferred
in the actual mathematics of logistic models, Warning: The odds ratio is a
diferent way of presenting the same information as the unstandardized logit
(effect) coefficient discussed in the section on logistic regression, and like i, is
not recommended when comparing the relative strengths of the independents,
The standardized logit (effect) coefficient is used for this purpose.
Parameter estimates in multinomial logistic regression. In multinomial logistic
analysis, where the dependent may have more than the usual O-or1 values, the
comparison is always with the last value rather than with the value of 1. The
parameter estimates table for multinomial logistic regression will contain factor
or covariate parameters for each category of the categorical dependent variable
except the last category (by deftult - however SPSS multinomial lets the
researcher set the reference category as the first or other custom category). If the
predictor is a covariate, there will be a single set of parameters for each value of
the categorical dependent except the reference category. If the predictor is a
factor, there will be one parameter row for each of that predictor’s categories
except the reference category. Ifa parameter estimate (b coefficient) is
significant and positive, then that parameter increases the adds of the given
response (category) of the dependent (response) variable compared to the
reference category response. If negative, that parameter decreases to odds of
that response compared to the reference category response. Additional light ean
be thrown on the predictive power of the logits by requesting cell probabilities
under the Statistics button, giving actual and predicted counts and percentages
for each combination of eategories of the dependent and predictor variables.
For example, let "candidate" be a categorical dependent variable with three
levels: the first parameter estimate will be the log of the odds (probability
candidate=1: probebility candidate=3), and the second parameter estimate will
the the log odds of (p(candidate=2):p(candidate=3)). Let the explanatory
variable be gender, with O=female and 1=male, such that the reference eategory
is I=male, Let the reference category of the dependent equal 3, the default.
‘There will be two parameter estimates for gender: one for candidate 1 and one
for candidate 2. Let the parameter estimate for gender=0 for candidate 1 be
.500, Then the odds ratio is exp(.500) = 1.649. We can then say the odds of a
female selecting candidate 1 compared to candidate 3 is 1.649 times (about
{65% greater than) the odds a male would. Warning: This is a statement about
‘odds - do not directly transform it info a statement about
probabilities/likelihood/chances.
Effect size... The odds ratio is a measure of effect size. The ratio of odds ratios
of the independents is the ratio of relative importance of the independent
variables in terms of effect on the dependent variable's odds. (Note standardized
21712008 1:55 PMLogistic Regression: Statnotes, from North Carolina State Universit tepi/hvww2.chass.nesu.edu/garson/PA765/logistie htm
logit coefficients may also be used, as discussed below, but then one is,
isoussing relative importance of the independent variables in terms of effect on
the dependent variable's log odds, which is less intuitive.).
= Confidence interval on the odds ratio. SPSS labels the odds ratio "EExp(B)" and
prints "Low" and "High" confidence levels for it. If the low-high range contains
the value 1.0, then being in that variable value category makes no difference on
the odds of the dependent, compared to being in the reference (usually highest)
value for that variable. That is, when the 95% confidence interval around the
odds ratio includes the value of 1.0, indicating that « change in value of the
dependent variable is not associated in change in the odds of the dependent
variable assuming a given value, then that variable is not considered a useful
predictor in the logistic model.
= Types of variables. Parameter estimates (b) and odds ratios (Exp(b) ) may be
output for dichotomies, categorical variables, or continuous variables. Their
interpretation is similar in each case, though often researchers will not interpret
odds ratios in terms of statements illustrated below, but instead will simply use
odds ratios as effect size measures and comment on their relative sizes when
comparing independent variable effects, or will comment on the change in the
odds ratio for a particular explanatory variable between a model and some
nested altemative model,
= Dichotomies. Ifb is positive, then as the dichotomous independent
variable moves from 0 to 1, the log odds (logit) of the dependent also
increases. If the odds ratio is 4.3 for hs_degree (1=having a high school
degree, 0 = not), for instance, where the dependent is employed (0-not
employed, I~employed), we say that the odds of a person with a high
school degree being employed are 4.3 times the odds of a person without
a high school degree. In multinomial logistic regression, the odds are
those that the dependent=the highest category rather than dependent=I as
in binary logistic regression.
= Categorical variables: Categorical variables must be interpreted in terms
of the left-out reference category, as in OLS regression, If b is positive,
then when the dummy = 1 (that category of the categrorical variable is
present), the log odds (logit) of the dependent also increase, Thus the
parameter estimate for a categorical dummy variable refers to the change
in log odds when the dummy~1, compared to the reference category
equaling 1 (being present). If the odds ratio is 4.3 for religion~
(Catholic), where the reference category is "Agnostics," and the
dependent is 1=attend religious movies and 0=don't attend, then the odds
a Catholic attends religious movies is 4.3 times the odds that an agnostic
does. Warning: Note that dichotomous variables may be entered as
categorical dummies rather than as simple variables (which would be the
norm). If if entered as categorical variables, then their odds ratios will be
computed differently and must be interpreted comparative to the
reference category rather than as simple increase/decrease in odds ratio.
13 0f 33 21772008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, Lntp:/iwww2.chass:nesu.edtu/garson/PA76S/logistic. itm
14 0f 33,
mntinuous covariates: When the parameter estimate, b, is transformed
into an odds ratio, it may be expressed as a percent increase in odds. For
instance, consider the example of number of publications of professors
(see Allison, 1999: 188). Let the logit coefficient for "number of articles
published” be +.0737, where the dependent variable is "being promoted”.
‘The odds ratio which corresponds to +.0737 is approximately 1.08 (e to
the .0737 power). Therefore one may say, "each additional article
published increases the odds of promotion by about 8%, controlling for
other variables in the model." (Obviously, this is the same as saying the
original dependent odds inereases by 108%, or noting that one multiplies
the original dependent odds by 1.08. By the same token, itis not the same
as saying that the probability of promotion increases by 8%.) To take
another example, let income be a continuous explanatory variable
measured in ten thousands of dollars, with a parameter estimate of 1.5 in
a model predicting home ownership=1, no home ownership-0. A 1 unit
inctease in income (one $10,000 unit) is then associated with a 1.5
increase in the log odds of home ownership. However, it is more intuitive
to convert to an odds ratio: exp(1.5) = 4.48, allowing one to say that a
unit ($10,000) change in income increases the odds of the event
ownership=1 about 4.5 times.
« Probability interpretations, While logistic coctfficients are usually interpreted as
odds, not probabilities, itis possible to use probabilities. Exp(z) is odds(event).
Therefore the quantity (1 - Exp(2)) is odds(nonevent). Therefore P(event) = Exp(2)/(1
- Exp(2)). Recall z= the constant plus the sum of crossproducts of the b coefficients
times the values of their respective X (independent) variables, For dichotomous
independents assuming the values (0,1), the crossproduct term is null when X = 0 and
is b when X=1. For continuous independents, different probabilities will be computed
depending on the value of X. That is, P(event) varies depending on the covariates.
© Measures of Model Fit
= The Akaike Information Criterion, AIC, is a common information theory statistic
used when comparing alternative models. It is output by SAS's PROC LOGISTIC.
Lower is better model fit.
= The Schwartz Information Criterion, SIC is a modified version of AIC and is part
of SAS's PROC LOGISTIC output. Compared to AIC, SIC penalizes
overparameterization more (rewards model parsimony). Lower is better model fit. It is
common to use both AIC and SIC when assessing alternative logistic models.
* Model chi-square, Model chi-square is based on log likelihoods and as discussed
above, model chi-square should be significant in a well-fitting model. Significance
‘means that the fit of the researcher's full model differs significantly from that of the
constant-only null model.
* Pearson Goodness of Fit, GOF, also called Peason chi-square, is printed by SPSS
logistic output below model chi-square. This alternative test should be non-significant
21712008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University, hitp:/www2.chass.nesu.ed'garson/PA765/logistic.htm,
15 of 33
for a well-fitting model. Non-significance corresponds to failing to reject the null
hypothesis that the observed likelihood does not differ from 1.
© Measures of Effect Size
* R-squared. There is no widely-accepted direct analog to OLS regression's R, This is
because an R? measure seeks to make a statement about the "percent of variance
explained,” but the variance of a dichotomous or categorical dependent variable
depends on the frequency distribution of that variable. For a dichotomous dependent
variable, for instance, variance is at a maximum for a 50-50 split and the more
lopsided the split, the lower the variance. This means that R-squated measures for
logistic regressions with differing marginal distributions of their respective dependent
variables cannot be compared directly, and compatison of logistic R-squared
measures with R” from OLS regression is also problematic, Nonetheless, a number of
logistic R-squared measures have been proposed, all of which should be reported as
approximations to OLS R?, not as actual percent of variance explained,
Note that R?-like measures below are not goodness-of-fit tests but rather attempt to
‘measure strength of association. For small samples, for instance, an R?-like measure
might be high when goodness of fit was unacceptable by the likelihood ratio test.
1. Cox and Snell's R-Square is an attempt to imitate the interpretation of
multiple R-Square based on the likelihood, but its maximum can be (and
usually is) less than 1.0, making it difficult to interpret. It is part of SPSS output
in the "Model Summary" table.
2. Nagelkerke's R-Square is a further modification of the Cox and Snell
coefficient to assure that it can vary ftom 0 to 1. That is, Nagelkerke's R?
divides Cox and Snell's R? by its maximum in order to achieve a measure that
ranges from 0 to I. Therefore Nagelkerke's R? will normally be higher than the
Cox and Snel! measure but will tend to run lower than the corresponding OLS
R2, Nagelkerke's R? is part of SPSS output in the "Model Summary" table and
is the most-reported of the R-squared estimates, See Nagelkerke (1991)
Pseudo-R-Square is a Aldrich and Nelson's coefficient which serves as an
analog to the squared contingency coefficient, with an interpretation like
R-square. Iis maximum is less than I. It may be used in either dichotomous or
multinomial logistie regression.
4, Hagle and Mitchell's Pseudo-R-Square is an adjustment to Aldrich and
Nelson's Pseudo R-Square and generally gives higher values which compensete
for the tendency of the latter to underestimate model strength.
5. Resquare is OLS R-square, which can be used in binary logistic regression (see
Menard, p. 23) but not in multinomial logistic regression. To obtain R-square,
save the predicted values from logistic regression and run a bivariate regression
on the observed dependent values. Note that logistic regression can yield
27772008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University...
16 of 3,
deceptively high R? values when you have many variables relative to the
number of cases, keeping in mind that the number of variables includes k-1
dummy variables for every categorical independent variable having k
categories,
* Classification tables are the 2 x 2 tables in the logistic regression output for
dichotomous dependents, or the 2 x n tables for ordinal and polytomous logistic
rogression, which tally correct and ineorrect estimates. The columns are the two
predicted values of the dependent, while the rows are the two observed (actual) values
of the dependent. In a perfect model, all cases will be on the diagonal and the overall
percent correct will be 100%, If the logistic model has homoscedasticity (not a
logistic regression assumption), the percent correct will be approximately the same for
both rows. Since this takes the form of a crosstabulation,. measures of association.
(SPSS uses lambda-p and tau-p) may be used in addition to percent correct asa way
of summarizing the strength of the table.
bttp:/ywww2.chass.nesu,edu/garson/PA765/logistic him
= Warning. Classification tables should not be used as goodness-of-fit measures
because they ignore actual predicted probabilities and instead use dichotomized
predictions based on a cutoff (¢x., .5). For instance, in binary logistic
regression, predicting a 0-or-1 dependent, the classification table does not
reveal how close to 1.0 the correct predictions were nor how close to 0.0 the
errors were. A model in which the predictions, correct or not, were mostly close
to the .50 cutoff does not have as good a fit as a model where the predieted
scores cluster either near 1,0 or 0.0, Also, because the hit rate can vary
markedly by sample for the same logistic model, use of the classification table
to compare across samples is not recommended,
* Split of the dependent variable. While no particular split of the dependent
variable is assumed, the split makes a difference in the classification table.
Suppose the dependent is split 99:1. Then one could guess the value of the
dependent correctly 99% of the time just by always selecting the more common
value, The classification table will likely show 0 predictions in the predicted
column for the 1% value of the dependent. The closer to 50:50, the easier itis
for a predictor variable to have an effect, Even at some intermediate but
lopsided split, such as 85:15, it can be difficult for a predietor to improve on
simple guessing (that is, on 85%). A strong predictor variable could improve on.
te 85% but a weak one might not. This does not mean the predictor variables
are non-significant, just that they do not move the estimates enough to make a
difference compared to pure guessing. When the classification table for a
dichotomous dependent has a zero "Predicted" column, itis likely that the raw
correlations of the predictor variables with the dependent variable are not high
enough to make a difference.
= Terms associated with classification tables:
1. Hit rate: Number of correct predictions divided by sample size. The hit
rate for the model should be compared to the hit rate for the classification
table for the constant-only model (Block 0 in SPSS output). The Block 0
rate will be the percentage in the most numerous category (that is, the null
21712008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University. bttp/hvww2.chass.nesu.edu/garson/PA765/logistic.htm
17 0f 3
model predicts the most numerous category for all cases).
2. Sensitivity: Percent of correct predictions in the reference category of the
dependent (ex., | for binary logistic regression),
3. Specificity: Percent of correct predictions in the given category of the
dependent (ex., 0 for binary logistic regression).
4, False positive rate: In binary logistic regression, the number of errors
where the dependent is predicted to be 1, but is in fact 0, as a pervent of
total cases which are observed O's. In multinomial logistic regression, the
number of errors where the predicted value of the dependent is higher
than the observed value, as a percent of all cases on or above the
diagonal.
5. False negative rate: In binary logistic regression, the number of errors
where the dependent is predicted to be 0, but is in fact 1, as a percent of
total cases which are observed 1's, In multinomial logistic regression, the
number of errors where the predicted value of the dependent is lower than
the observed value, as a percent of all cases on or below the diagonal,
* The histogram of predicted probabilities, also called the "classplot" or the "plot of
observed groups and predicted probabilities," is part of SPSS output when one
chooses "Classification plots" under the Options button in the Logistic Regression
dialog, It is an altemative way of assessing correct and incorrect predictions under
logistic regression, The X axis is the predicted probability from 0.0 to 1.0 of the
dependent being classified "1". The ¥ axis is frequency: the number of cases
classified, Inside the plot are columns of observed 1's and 0's (or equivalent symbols).
Thus a column with one "1" and five "0's" set at p ~.25 would mean that six cases
were predicted to be "I's" with a probability of .25, and thus were classified as "0's."
Of these, five actually were "0's" but one (an error) was a"1" on the dependent
variable,
The researcher looks for two things: (1) A U-shaped rather than normal
distribution is desirable. A U-shaped distribution indicates the predictions are
well-differentiated. A normal distribution indicates many predictions are close
to the cut point, which is not as good a model fit.; and (2) There should be few
errors. The 1's to the left are false positives. The 0's to the right are false
negatives. Examining this plot will also tell such things as how well the model
classifies difficult eases (ones near p
= Unstandardized logit coefficients are shown in the "B" column of the "Variables in
the Equation’ table in SPSS binomial logistic regression output. There will be a B
coefficient for each independent and for the constant, The logistic regression model
is log odds(dependent variable) = (B for varl)*Varl + (B for var2)*Var2 +... + (B for
varn)*Vamn + (B for the constant Constant,
= Standardized logit coefficients, also called standardized effect coefficients or beta
‘weights, correspond to beta (standardized regression) coefficients and like them may
be used to compare the relative strength of the independents. However, odds ratios are
2/7/2008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University,
18 0f 33
preferred for this purpose, since when using standardized logit coefficients one is
discussing relative importance of the independent variables in terms of effect on the
dependent variable's logged odds, which is less intuitive than relative to the actual
odds of the dependent variable, which is the referent when odds ratios are used, SPSS
does not output standardized logit coefficients but note that if one standardizes one's
input data first, then the parameter estimates will be standardized logit coefficients.
(SPSS does output odds ratios, which are found in the "Exp(B)" column of the
"Variables in the Equation" table in binomial regression output.) Alternatively, one
may multiply the unstandardized logit coefficients times the standard deviations of the
corresponding variables, giving a result which is not the standardized logit coefficient
but can be used to rank the relative importance of the independent variables. Note:
Menard (p. 48) warned that as of 1995, SAS's "standardized estimate" coefficients
were really only partially standardized. Different authors have proposed different
algorithms for "standardization," and these result in different values, though generally
the same conclusions about the relative importance of the independent variables.
Odds ratios, discussed above, are also used as effect size measures.
Partial contribution, R. Partial R is an alternative method of assessing the relative
importance of the independent variables, similar to standardized partial regression
coefficients (beta weights) in OLS regression. R is a fuetion of the Wald st
(discussed below), and the number of degrees of freedom for the variable. SPSS prints
R in the "Variables in the Equation" section, Note, however, that there is a flaw in the
Wald statistic such that very large effects may lead to large standard errors, small
Wald chi-square values, and small or zero partial R's. For this reason itis better to use
‘odds ratios for comparing the importance of independent variables.
BIC, the Bayes Information Criterion, has been proposed by Raftery (1995) as a third
way of assessing the independent variables in a logistic regression equation, BIC in
the context of logistic regression (and different from its use in SEM) should be greater
than 0 to support retaining the variable in the model, As a rule of thumb, BIC of 0-2 is
weak, 2 - 6 is moderate, 6 - 10 is strong, and over 10 is very strong.
1, Lambda-p is a PRE (proportional reduction in error) measure, which is the
ratio of (errors without the model - errors with the model) to errors without the
model. [f lambda-p is .80, then using the logistic regression model will reduce
our errors in classifying the dependent by 80% compared to classifying the
dependent by always guessing a case is to be classed the same as the most
frequent category of the dichotomous dependent. Lambda-p is an adjustment to
classic lambda to assure that the coefficient will be positive when the model
helps and negative when, as is possible, the model actually leads to worse
predictions than simple guessing based on the most frequent class. Lambda-p
varies from 1 to (1 - N), where N is the number of cases. Lambda-p = (f - e)/,
where f is the smallest row frequncy (smallest row marginal in the classification
table) and e is the number of errors (the 1,0 and 0,1 cells in the classification
table).
2, Tauep is an alternative measure of association. When the classification table
has equal marginal distributions, tau-p varies from -1 to +1, but otherwise may
21712008 1:55 PM
btputhwww2.chass.nesu.edu/garson/PA765 flogistic.htm,Logistic Regression: Statnotes, from North Carolina State University. http:/wwww2.chass.nesu.edu/garson/PA765/logistic.htm
19 0f 33
be less than 1. Negative values mean the logistic model does worse than
expected by chance. Tau-p can be lower than lambda-p because it penalizes
proportional reduetion in error for non-random distribution of errors (that is, it
‘wants an equal number of errors in each of the error quadrants in the table.)
3. Phicp is a third altemative discussed by Menard (pp. 29-30) but is not part of
SPSS output. Phi-p varies from -1 to +1 for tables with equal marginal
distributions,
4. Binomial d is a significance test for any of these measures of association,
‘though in each case the number of "errors" is defined differently (see Menard,
pp. 30-31).
5. Separation: Note that when the independents completely predict the
dependent, the error quadrants in the classification table will contain 0's, which
is called complete separation. When this is neatly the case, as when the error
quadrants have only one case, this is called guasicomplete separation, When
separation occurs, one will get very large logit coefficients with very high
standard errors. While separation may indicate powerful and valid prediction,
often itis a sign ofa problem with the independents, such as definitional
overlap between the indicators for the independent and dependent variables.
= The e statistic is a measure of the discriminative power of the logistic equation. It
varies from .5 (the model's predictions are no better than chance) to 1.0 (the model
always assigns higher probabilities to correct cases than to incorrect cases for any pair
involving dependent=0 and dependent=1), Thus c is the percent of all possible pairs
‘of cases in which the model assigns a higher probability to a correct ease than to an
incorrect case. The ¢ statistic is not part of SPSS logistic output but may be calculated
using the COMPUTE facility, as described in the SPSS manual’s chapter on logistic
regression, Alternatively, save the predicted probabilities and then get the area under
the ROC curve. In SPSS, select Analyze, Regression, Binary (or Multinomial); select
the dependent and covariates; click Save; check to save predicted values (pre_1);
Continue; OK. Then select Graphs, ROC Curve; set pre_I as the test variable; select
standard error and confidence interval; OK. In the output, c is labeled as "Area," Tt
will vary from .5 to 1.0,
* Contrast Analysis
© Repeated contrasts is an SPSS option (called profile contrasts in SAS) which computes the
logit coeflicient for each category of the independent (except the "reference" category,
which is the last one by default). Contrasts are used when one has a categorical independent
variable and wants to understand the effects of various levels of that variable. Specifically, a
"contrast" is a set of coefficients that sum to 0 over the levels of the independent categorical
variable. SPSS automatically creates K-1 internal dummy variables when a covariate is
declared to be categorical with K values (by default, SPSS leaves out the last category,
making it the reference category). The user can choose various ways of assigning values to
these internal variables, including indicator contrasts, deviation contrasts, or simple
contrasts. In SPSS, indicator contrasts are now the default (old versions used deviation
2/1/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, ttp:/Avww2.chass.nesu.edu/garson/PA76Sflogistic. htm
contrasts as default).
= Indicator contrasts produce estimates comparing each other group to the reference
group. David Nichols, senior statistician at SPSS, gives this example of indicator
coding output:
Parancter codings for indicator contrasts
Parameter
Value Freq Coding
a @
croup
1 106 1,000,000
2 116 000 1,000
3 107 000000
This example shows a three-level categorical independent (labeled GROUP), with
category values of 1, 2, and 3. The predictor here is called simply GROUP. It takes on
the values 1-3, with frequencies listed in the "Freq" column. The two "Coding"
columns are the internal values (parameter codings) assigned by SPSS under indicator
coding. There are two columns of codings because two dummy variables are created
for the three-level variable GROUP. For the first variable, which is Coding (1), eases
with a value of 1 for GROUP get a 1, while all other cases get a 0, For the second,
cases with a 2 for GROUP get a 1, with all other cases getting a 0.
* Simple contrasts compare each group to a reference category (like indicator contrasts).
‘The contrasts estimated for simple contrasts are the same as for indicator contrasts,
but the intercept for simple contrasts is an unweighted average of all levels rather than.
the value for the reference group. That is, with one categorical independent in the
‘model, simple contrast coding means that the intercept is the log odds of a response
for an unweighted average over the categories,
* Deviation contrasts compare each group other than the excluded group to the
unweighted average of all groups. The value for the omitted group is then equal to the
negative of the sum of the parameter estimates.
® Contrasts and ordinality: For nominal variables, the pattern of contrast coefficients
for a given independent should be random and nonsystematic, indicating the
nonlinear, nonmonotonic pattern characteristic of a true nominal variable. Contrasts
can thus be used as a method of empirically differentiating categorical independents
into nominal and ordinal classes.
* Analysis of residuals Residuals may be plotted to detect outliers visually. Residual analysis may
lead to development of separate models for different types of cases. For logistic regression, it is
usual to use the standardized difference between the observed and expected probabilities. SPSS
calls this the "standardized residual (ZResid)," while SAS calls this the "chi residual," while
Menard (1995) and at other times (including by SPSS in the table of "Observed and Predicted
Frequencies" in multinomial logistic output) itis called the "Pearson residual." In a model which
fits in every cell formed by the independents, no absolute standardized residual will be > 1.96.
Cells which do not meet this criterion signal combinations of independent variables for which the
model is not working well.
20 of 33, 21712008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University... ‘ntpsthvww2.chass.nesu.edu/garson/PA765 (logistic htm
21 of 33
Note there are other less-used types of residuals in logistic regression: logit residuals, deviance
residuals, Studentized residuals, and of course unstandardized (raw) residuals: see Menard, p. 72.
The Save button in the SPSS logistic dialog will save the standardized residual as ZRE_1. One can
also save predictions as PRE_i. The DfBeta statistic can be saved as DFBO_I for the constant,
DFBI_1 for the first independent, DFB1_2 for the second independent, ete.
© The deta statistic, DBeta, is available to indicate cases which are poorly fitted by the
model. Called DfBeta in SPSS (whose algorithm approximates dbeta), it measures the
change in the logit coefficients for a given variable when a case is dropped. There is a
DfBeta statistic for each case for each explanatory variable and for the constant. An arbitrary
cutoff criterion for cases to be considered outliers is those with dbeta > 1.0 on critical
variables in the model,
© The leverage statistic, h, is available to identify cases which influence the logistic
regression model more than others. The leverage statistic varies from 0 (no influence on the
model) to 1 (completely determines the model). The leverage of any given case may be
compared to the average leverage, which equals p/n, where p = (k+1)/n, where k= the
number of independents and n=the sample size. Note that influential cases may nonetheless
have small leverage values when predicted probabilities are <.1 or >.9. Leverage is an
option in SPSS, in which a plot of leverage by case id will quickly identify cases with
unusual impact.
© Cook's distance, D, is a third measure of the influence of a case. Its value is a function of
the case's leverage and of the magnitude of its standardized residual. It is a measure of how
much deleting a given case affects residuals for all cases. An approximation to Cook's
istance is an option in SPSS logistic regression,
Assumptions
* Logistic regression is popular in part because it enables the researcher to overcome many of the
restrictive assumptions of OLS regression:
1. Logistic regression does not assume a linear relationship between the dependents and the
independents. It may handle nonlinear effects even when exponential and polynomial terms
are not explicitly added as additional independents because the logit link function on the
left-hand side of the logistic regression equation is non-linear, However, itis also possible
and permitted to ada explicit interaction and power terms as variables on the right-hand side
of the logistic equation, as in OLS regression,
2, The dependent variable need not be normally distributed (but does assume its distribution is
within the range of the exponential family of distributions, such as normal, Poisson,
binomial, gamma).
3. The dependent variable need not be homoscedastic for each level of the independents; that
is, there is no homogeneity of variance assumption: variances need not be the same within,
categories.
2/7/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University... http: www2.chass.nesu.edu/garson/PA765/logistc. htm,
4, Normally distributed error terms are not assumed,
5. Logistic regression does not require that the independents be interval.
6, Logistic regression does not require that the independents be unbounded.
However, other assumptions still apply:
1, Meaningful coding. Logistic coefficients will be difficult to interpret if not coded
meaningfully. The convention for binomial logistic regression is to code the dependent class
of greatest interest as | and the other class as 0, and to code its expected correlates also as
+1 to assure positive correlation. For multinomial logistic regression, the class of greatest
interest should be the last class. Logistic regression is predicting the log odds of being in the
class of greatest interest.
2, Inclusion of all relevant variables in the regression model: If relevant variables are
omitted, the common variance they share with included variables may be wrongly attributed
to those variables, or the error term may be inflated.
3. Exelusion of all irrelevant variables: If causally irrelevant variables are included in the
model, the common variance they share with included variables may be wrongly attributed
to the irrelevant variables. The more the correlation of the irrelevant variable(s) with other
independents, the greater the standard errors of the regression coefficients for these
independents.
4, Error terms are assumed to be independent (independent sampling). Violations of this
assumption can have serious effeets. Violations will occur, for instance, in correlated
samples and repeated measures designs, such as before-after or matched-pairs studies,
cluster sampling, or time-series data, That is, subjects cannot provide multiple observations
at different time points Conditional logit models in Cox regression and logistic models for
matched pairs in multinomial logistic regression are available to adapt logistic models to
handle non-independent data.
5. Low error in the explanatory variables. Ideally assumes low measurement error and no
missing cases. See here for further discussion of measurement error in GLM models.
6, Linearity. Logistic regression does not require linear relationships between the independent
factor or covariates and the dependent, as does OLS regression, but it does assume a linear
relationship between the independents and the log odds (logit) of the dependent. When the
assumption of linearity in the logits is violated, then logistic regression will underestimate
the degree of relationship of the independents to the dependent and will lack power
(generating Type Il errors, thinking there is no relationship when there actually is). One
strategy for mitigating lack of linearity in the logit of a continuous covariate is to divide it
into categories and use it as a factor, thereby getting separate parameter estimates for
various levels of the variable
* Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which
are the crossproduct of each independent times its natural logarithm [(X)In(X)]. If
these terms are significant, then there is nonlinearity in the logit. This method is not
sensitive to small nonlinearities.
This option treats a categorical independent as a
* Orthogonal polynomial contrasts:
22 0f33 2/7/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, http:/www2.chass.nesu.edu/garson/PA76S/logiste.him,
23 0f 33,
categorical variable with categories assumed to be equally spaced. The logit (effect)
coefficients for each category of the categorical explanatory variable should not
change over the contrasts. This method is not appropriate when the independent has a
large number of values, inflating the standard errors of the contrasts, Select
polynomial from the contrast list after clicking the Categorical button in the SPSS
logistic regression dialog,
= Logit step tests. Another simple method of checking for linearity between an ordinal
or interval independent variable and the logit of the dependent variable is to (1) ereate
anew variable which divides the existing independent variable into categories of
equal intervals, then (2) run a logistic regression with the same dependent but using
the newly categorized version of the independent as a categorical variable with the
default indicator coding. If there is linearity with the logit, the b coefficients for each
class of the newly categorized explanatory variable should increase (or decrease) in
roughly linear steps.
= The SPSS Visual Bander can be used to create a categorical variable based on
‘an existing interval or ordinal one. It is invoked from the SPSS menu by
selecting Transform, Visual Bander, In the Visual Bander dialog, enter the
variable to categorize and click Continue, In the next dialog, select the variable
just entered and also give a name for the new variable to be created; click the
‘Make Cutpoints bution and select Equal Intervals (the default) then enter the
starting point (ex., 0) and the number of categories (ex., 5) and tab to the Width
box, which will be filled in with a default value, Click Apply to return to the
‘Visual Bander dialog, then click OK. A new categorized variable of the name
provided will be created at the end of the Data Editor spreadsheet.
* A logit graph can be constructed to visually display linearity in the logit, or lack
thereof, for a banded (categorical) version of an underlying continuous variable.
After a banded version of the variable is created and substituted into the logistic
model, separate b coefficients will be output in the "Parameter Estimates" table,
Once can go to this table in SPSS output, double-click it, highlight the b
coefficients for each band, right-click, and select Create Graph, Line.
7. Aaditivity. Like OLS regression, logistic regression does not account for interaction effects
except when interaction terms (usually products of standardized independents) are Greated as
additional variables in the analysis. This is done by using the categorical covariates option in
SPSS's logistic procedure.
8, No multicollinearity: To the extent that one independent is a linear function of another
independent, the problem of multicollinearity will occur in logistic regression, as it docs in
OLS regression, As the independents increase in correlation with each other, the standard
errors of the logit (effect) coefficients will become inflated. Multicollinearity does not
change the estimates of the coefficients, only their reliability. High standard errors flag
possible multicollinearity. Multicollinearity and its handling is discussed more extensively
in the StatNotes section on multiple regression,
9, No outliers. As in OLS regression, outliers can affect results significantly. The researcher
should analyze standardized residuals for outliers and consider removing them or modeling
2/7/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, tsp: fvwvw2chass.nest.eduigarson/PA765/logistic htm
gislic Regres ? ‘
them separately.Standardized residuals >2.58 are outliers at the .01 level, which is the
customary level (standardized residuals > 1.96 are outliers at the less-used .05 level).
Standardized residuals are requested under the "Save" button in the binomial logistic
regression dialog box in SPSS. For multinomial logistic regression, checking "Cell
Probabilities" under the "Statistics" button will generate actual, observed, and residual
values.
10. Large samples. Also, unlike OLS regression, logistic regression uses maximum likelihood
estimation (MLE) rather than ordinary least squares (OLS) to derive parameters, MLE relies
on large-sample asymptotic normality which means that reliability of estimates decline
‘when there are few cases for cach observed combination of independent variables. That is,
in small samples one may get high standard errors. In the extreme, if there are too few cases
in elation to the number of variables, it may be impossible to converge on a solution. Very
high parameter estimates (logistic coefficients) may signal inadequate sample size. As a rule
of thumb, Peduzzi et al. (1996) recommend that the smaller of the classes of the dependent
variable have at least 10 events per parameter in the model,
11, Sampling adequacy. Goodness of fit measures like model chi-square assume that for cells,
formed by the categorical independents, all cell frequencies are >=1 and no more than 20%
of cells are <5. Researchers should run crosstabs to assure this requirement is met.
Sometimes one can compensate for small samples by combining categories of categorical
independents or by deleting independents altogether
12, Expeeted dispersion. In logistic regression the expected variance of the dependent can be
compared to the observed variance, and discrepancies may be considered under- or
overdispersion, If there is moderate discrepancy, standard errors will be over-optimistic and
one should use adjusted standard error. Adjusted standard error will make the confidence
intervals wider. However, if there are large discrepancies, this indicates a need to respecify
the model, or that the sample was not random, or other serious design problems. The
expected variance is ybar*(1 - ybar), where ybar is the mean of the fitted (estimated) y. This
can be compared with the actual variance in observed y to assess under- or overdispersion.
Adjusted SE equals SE * SQRT(D/d®), where D is the scaled deviance, which for logistic
regression is -2LL, which is -2Log Likelihood in SPSS logistic regression output.
SPSS Output for Logistic Regression
* Commented SPSS Output for Logistic Regression
Frequently Asked Questions
© Why not just use regression with dichotomous dependents?
© When is OLS regression preferred to logistic regression?
24 0f 33 2/7/2008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University... tip:iAvww2.chass.nesu.edu/garson/PA76S/logisti.m,
What is the SPSS syntax for logistic regression?
Can [create interaction terms in my logistic model, as with OLS regression?
Will SPSS's logistic regression procedure handle my categorical variables automatically?
Can I handle missing cases the same in logistic regression as in OLS regression?
Is it trne for logistic regression, as it is for OLS regression, that the beta weight
(standardized logit enefficient) for a given independent reflects its explanatory power
controlling for other variables in the equation, and that the betas will change if variables are
added or dropped from the equation?
‘© What is the coefficient in logistic regression which corresponds to R-Square in multiple
regression?
« Is there a logistic regression analogy to adjusted R-square in OLS regression?
© Is multicollinearity a problem for logistic regression the way it is for multiple linear
regression?
© What is the logistic equivalent to the VIF test for multicolli
odds ratios be used?
How can one use estimated variance of residuals to test for model misspecification?
How are interaction effects handled in logistic regression?
Does stepwise logistic regression exist, as it docs for OLS regression?
What if | use the multinomial logistic option when my dependent is binary?
n? Can
arity in OLS regress
How many independents can I have?
How do 1 express the logistic regression equation if one or more of my independents is
categorical?
‘+ How do I compare logit coefficients across groups formed by a categorical independent
variable?
‘* How do I compute the confidence interval for the unstandardized logit (effect) coefficients?
+ SAS's PROC CATMOD for multinomial logistic regression is not user friendly. Where can I
get some help?
FE
a
E
iE
=
ie
a
5.
iE
fe
E
E
fe
E
E
‘* Why not just use regression with dichotomous dependents?
Use of a dichotomous dependent in OLS regression violates the assumptions of normality
and homoscedasticity as a normal distribution is impossible with only two values. Also,
‘when the values can only be 0 or 1, residuals (error) will be low for the portions of the
regression line near Y=0 and Y=1, but high in the middle -- hence the error term will violate
the assumption of homoscedasticity (equal variances) when a dichotomy is used as a
dependent, ven with large samples, standard errors and significance tests will be in error
because of lack of homoscedasticity. Also, for a dependent which assumes values of 0 and
1, the regression model will allow estimates below 0 and above 1. Also, multiple linear
regression does not handle non-linear relationships, whereas log-linear methods do. These
objections to the use of regression with dichotomous dependents apply to polytomous
dependents also.
‘* When is OLS regression preferred to logistic regression?
With & multinomial dependent when all assumptions of OLS regression are met, OLS
25 0f 33, 21772008 1:55 PMLogistic Regression: Statnots, from North Carolina State University. http:/vwww2.chass.nesu.edu/garson/PA765/logistic htm.
regression usually will have more power than logistic regression. That is, there will be fewer
‘Type If errors (thinking there is no relationship when there actually is). OLS assumptions
cannot be met with a binary dependent, Also, the maximum number of independents should
be substantially less for logistic as compared to OLS regression as the categorization of
logistic dependents means less information content. With a binary dependent, it
impossible to meet the normality assumptions of OLS regression, but if the split is not
extreme (not 90:10 or worse), OLS regression will not return dramatically different
substantive results. Still, logistic regression is clearly preferred for binary response
(dependent) variables.
‘* What is the SPSS syntax for logistic regression?
With SPSS, logistic regression is found under Analyze - Regression - Binary Logistic or
Multinomial Logistic.
LOGISTIC REGRESSION /VARIABLES income WITH age SES gender opinionl
opinion region
/CRTBGORICAL=gender, opinion1, opinion2, region
/CONTRAST (region) =INDICATOR (4)
“METHOD FSTEP (TR)
/CLASSPLOT
Above is the SPSS syntax in simplified form. The dependent variable is the variable
immediately after the VARIABLES term. The independent variables are those immediately
after the WITH term. The CATEGORICAL command specifies any categorical variables;
note these must also be listed in the VARIABLES statement. The CONTRAST command
tells SPSS which category of a categorical variable is to be dropped when it automatically
constructs dummy variables (here it is the 4th value of "region’; this value is the fourth one
and is not necessarily coded "4"). The METHOD subcommand sets the method of
computation, here specified as FSTEP to indicate forward stepwise logistic regression.
Alternatives are BSTEP (backward stepwise logistic regression) and ENTER (enter terms as
listed, usually because their order is set by theories which the researcher is testing). ENTER.
is the default method. The (LR) term following FSTEP specifies that likelihood ratio eriteria
are to be used in the stepwise addition of variables to the model. The /CLASSPLOT option
specifies a histogram of predicted probabilities is to output (see above).
* Can I create interaction terms in my logistic model, as with OLS regression?
Yes. As in OLS regression, interaction terms are constructed as crossproduets of the two
interacting variables.
# Will SPSS's logistic regression procedure handly my categorical variables automatically?
‘No, You must declare your categorical variables categorical if they have more than two
values, This is done by clicking on the "Categorical" button in the Logistic Regression
dialog box. After this, SPSS will automatically create dummy variables based on the
categorical variable.
* Can handle missing cases the same in logistic regression as in OLS regression?
No, In the linear model assumed by OLS regression, one may choose to estimate missing
values based on OLS regression of the variable with missing eases, based on non-missing
data, However, the nonlinear mociel assumed by logistic regression requires a full set of,
data. Therefore SPSS provides only for LISTWISE deletion of cases with missing data,
using the remaining full dataset to calculate logistic parameters,
26 of 33, 2/7/2008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University,
270f33
« Isit true for logistic regression, as it is for OLS regression, that the beta weight
(standardized logit coefficient) for a given independent reflects its explanatory power
controlling for other variables in the equation, and that the betas will change if variables are
added or dropped from the equation?
Yes, the same basic logic applies. This is why it is best in either form of regression to
compare two or more models for their relative fit to the data rather than simply to show the
data are not inconsistent with a single model. The model, of course, dictates which variables
are entered and one uses the ENTER method in SPSS, which is the default method
‘+ What is the coefficient in logistic regression which corresponds to R-Square in multiple
regression?
‘There is no exactly analogous coefficient. See the discussion of Ri-squared, above. Cox and
Snell's R-Square is an attempt to imitate the interpretation of multiple R-Square, and
Nagelkerke's R-Square is a further modification of the Cox and Snell coefficient to assure
that it can vary from 0 to 1
« Is there a logistic regression analogy to adjusted R-square in OLS regression?
Yes. RiA-squared is adjusted Ri-squared, and is similar to adjusted R-squere in OLS
regression. RLA-squared penalizes RL-squared for the number of independents on the
assumption that R-square will become artificially high simply because some independents!
chance variations "explain" small parts of the variance of the dependent. RLA-squared = (GM
+ 2k)/Do, where k = the number of independents.
+ Is multicollinearity a problem for logistic regression the way it is for multiple linear
regression?
Absolutely. The discussion in "Statnotes" under the "Regression" topic is relevant to logistic
regression.
‘* What is the logistic equivalent to the VIF test for mul
odds ratios be used?
Multicollinearty is a problem when high in cither logistic or OLS regression because in
either case standard errors of the b coefficients will be high and interpretations of the
relative importance of the independent variables will be unreliable, In an OLS regression
context, recall that VIF is the reciprocal of tolerance, which is 1 - R-squared, When there is
high multicollinearity, R-squared will be high also, so tolerance will be low, and thus VIF
will be high. When VIF is high, the b and beta weights are unreliable and subject to
misinterpretation. For typical social science research, multicollinearity is considered not a
problem if VIF <=4, a level which corresponds to doubling the standard error of the b
coefficient,
earity in OLS regression? Can
As there is no direct counterpart to R-squared in logistic regression, VIF cannot be
computed -- though obviously one could apply the same logic to various psuedo-R-squared
measures. Unfortunately, Il am not aware of a VIF-type test for logistic regression, and I
‘would think that the same obstacles would exist as for creating a true equivalent to OLS
Resquared.
A high odds ratio would not be evidence of multicollinearity in itself.
To the extent that one independent is linearly or nonlinearly related to another independent,
27/2008 1:55 PM
bttpy/srww2. chass.nesuedu/garson/PAT6S/logistic. ttnLogistic Regression: Statnotes, from North Carolina State University, http:/www2.chass.nesu.eduigarson/PA765/logistic.htm
multicollinearity could be a problem in logistic regression since, unlike OLS regression,
logistic regression does not assume linearity of relationship among independents. Some
authors use the VIF test in OLS regression to screen for multicollinearity in logistic
regression if nonlinearity is ruled out, In an OLS regression context, nonlinearity exists
‘when eta-square is significantly higher than R-square, In a logistic regression context, the
Box-Tidwell transformation and orthogonal polynomial contrasts are ways of testing
linearity among the independents.
© How can one use estimated variance of residuals to test for model misspecification?
‘The misspecification problem may be assessed by comparing expected variance of residuals
with observed variance. Since logistic regression assumes binomial errors, the estimated
variance (y) = m(1 - m), where m = estimated mean residual. "Overdispersion” is when the
observed variance of the residuals is greater than the expected variance. Overdispersion
indicates misspecification of the model, non-random sampling, or an unexpected
distribution of the variables. If misspecification is involved, one must respecify the model. If
that is not the ease, then the computed standard error will be over-optimistic (confidence
intervals will be too wide). One suggested remedy is to use adjusted SE = SE*SQRT(),
where s = D/df, where D = dispersion and df=degrees of freedom in the model.
© How are interaction effeets handled in logistic regression?
‘The same as in OLS regression, One must add interaction terms to the model as
crossproducts of the standardized independents and/or dummy independents, Some
‘computer programs will allow the researcher to specify the pairs of interacting variables and
will do all the computation automatically. In SPSS, use the categorical covariates option:
light two variables, then click on the button that shows >a*b> to put them in the
Covariates box .,The significance of an interaction effect is the same as for any other
variable, except in the case of a set of dummy variables representing a single ordinal
variable,
‘When an ordinal variable has been entered as a set of dummy variables, the interaction of
another variable with the ordinal variable will involve multiple interaction terms. In this
case the significance of the interaction of the two vatiables is the significance of the change
of R-square of the equation with the interaction terms and the equation without the set of
terms associated with the ordinal variable, (See the StatNotes section on "Regression" for
computing the significance of the difference of two R-squares).
+ Does stepwise logistic regression exist, as it does for OLS regression?
Yes, it exists, but it is not supported by all computer packages. It is supported by SPSS.
Stepwise regression is used in the exploratory phase of research or for purposes of pure
prediction, not theory testing, In the theory testing stage the researcher should base selection
of the variables on theory, not on a computer algorithm, Menard (1995; 54) writes, "there
appears to be general agreement that the use of computer-controlled stepwise procedures to
select variables is inappropriate for theory testing because it capitalizes on random
variations in the data and produces results that tend to be idosyneratic and difficult to
replicate in any sample other than the sample in which they were originally obtained." Those
‘who use this procedure often focus on step chi-square output in SPSS, which represents the
change in the likelihood ratio test (model chi-square test) at each step.
© What if I use the mulitnomial logistic option when my dependent is binary?
28 of 33
2/7/2008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University, httpy/vww92, chass.nesu,edu/garson/PA765Mlogistic htm
29 of 33,
Binary dependents can be fitted in both the binary and multinomial logistic regression
options of SPSS, with different options and output. This can be done but the multinomial
procedure will aggregate the data, yielding different goodness of fit tests. ‘The SPSS 14
online help manual notes, " An important theoretical distinction is that the Logistic
ion procedure produces all predictions, residuals, influence statistics, and
-ss-oF fit tests using data at the individual case level, regardless of how the data are
entered and whether or not the number of covariate patterns is smaller than the total number
of cases, while the Multinomial Logistic Regression procedure internally aggregates cases to
form subpopulations with identical covariate patterns for the predictors, producing
predietions, residuals, and goodness-of-fit tests based on these subpopulations, If all
predictors are categorical or any continuous predictors take on only a limited number of
vvalnes—so that there are several cases at each distinct covariate pattern—the subpopulation
approach can produce valid goodness-of-fit tests and informative residuals, while the
individual case level approach cannot.”
‘© What is nonparametric logistic regression and how is it more nonlinear?
In general, nonparametric regression as discussed in the section on OLS regression can be
extended to the case of GEM regression models like logistic regression, See Fox (2000:
58-73).
GLM nonparametric regression allows the logit of the dependent variable to be a nonlinear
fimetion of the parameter estimates of the independent variables, While GLM techniques
like logistic regression are nonlinear in that they employ a transform (for logistic regression,
the natural log of the odds of a dependent variable) which is nonlinear, in traditional form
the result of that transform (the logit of the dependent variable) is a linear funetion of the
terms on the right-hand side of the equation, GLM non-parametric regression relaxes the
linearity assumption to allow nonlinear relations over and beyond those of the link finction
(logit) transformation
Generalized nonparametric regression is a GLM equivalent to OLS local regression (local
polynomial nonparametric regression), which makes the dependent variable a single
nontinear function of the independent variables. The same problems noted for OLS local
regression still exist, notably difficulty of interpretation as independent variables increase.
Generalized additive regression is the GLM equivalent to OLS additive regression, which
allow the dependent variable to be the additive sum of nonlinear functions which are
different for each of the independent variables. Fox (2000: 74-77) argues that generalized
additive regression can reveal nonlinear relationships under certain cireumstances where
they are obscured using partial residual plots alone, notably when a strong nonlinear
relationship among independents exists alongside a strong nonlinear relatinship between an
independent and a dependent,
© How can I use matched pairs data in conditional multinomial logistic regression?
‘Multinomial logistic regression can be used for analysis of matched case-control pairs. In
the data setup, every id number has three rows: The case row, the control person row p
with that case, and the difference row obtained by subtraction. Analysis is done on the
difference row by using Data, Select cases, and selecting for type ~ "diff" or similar coding,
where "type" is a column with values *case," "control,", and "diff" for the three rows of data
for any id, Any categorical variables like religion=1,2,3 must be replaced with sets of
27/2008 1:55 PMLogistic Regression: Statnotes, fom North Carolina State Universit
30 of 33
hup/www2.chass.nesu.edu/garson/PA765/logistic htm
dummies such that religion! = 0,1, religion20,1, and religion3 is omitted as the reference
category. That is, by default the highest variable becomes the reference category. All
predictors are entered as covariates. There are no factors because all categorical variables
hhave been transformed into 0,1 dummy variables.
‘The dependent has to be a constant, such as'L.’ That is, the researcher must have a column
in the data setup, pethaps labeled "control" and with 1's for all rows.TThe real dependent is
whatever the researcher is matching cases on, for instanee, people with and without heart
attacks but matched otherwise on a set of variables like age, weight, ete
Under the Model button, the researcher requests no intercept.
Output will be the same as for other multinomial logistic regression models, Note the odds
are usually set up as case:control, so that in the example of heart attacks, the cases might be
heart-attack people and controls would be non-heart-attack matched pairs. The dependent
reference category would become controlnon-heart-attack. Let a covariate be the
dichotomous variable 0=white, 1=non-white, and its odds ratio be 1.3. Its reference category
by default would be t=non-white. The odds ratio statement would take the form, the odds of
a white not getting a heart attack compared to getting one is 1.3 times that of a non-white.
Put another way, being white increases the odds of not getting a heart attack by a factor of
1.
+ How many independents can I have?
‘There is no precise answer to this question, but the more independents, the more likelihood
of multicollinearity. In general, there should be significantly fewer independents than in
OLS regression as logistic dependents, being categorized, have lower information content.
Also, if you have 20 independents, at the .05 level of significance you would expect one to
be found to be significant just by chance. A rule of thumb is that there should be no more
than 1 independent for each 10 cases in the sample. In applying this rule of thumb, keep in
mind that if there are categorical independents, such as dichotomies, the number of cases
should be considered to be the lesser of the groups (ex., in a dichotomy with 480 0's and 20
1's, effective size would be 20), and by the 1:10 rule of thumb, the number of independents
should be the smaller group size divided by 10 (in the example, 20/10 = 2 independents
maximum).
+ How do I express the logistic regression equation if one or more of my independents is
categorical?
‘When a covariate is categorical, SPSS will print out "parameter codings," which are the
intemnal-to-SPSS values which SPSS assigns to the levels of euch categorical variable. These
parameter codings are the X values which are multiplied by the logit (effect) coe!
obtain the predicted values.
* How do I compare logit coefficients across groups formed by a categorical
variable?
There are two strategies. The first strategy is to separate the sample into subgroups, then
perform otherwise identical logistic regression for each. One then computes the p value for a
‘Wald chi-square test of the significance of the differences between the corresponding
coefficients. The formula for this test, for the case of two subgroup parameter estimates, is
Wald chi-square = [(b1 - ba)? {[se(bi)}? + [se(b2)]2, where the b's are the logit coefficients
independent
27712008 1:55 PMLogistic Regression: Statnotes, from North Carolina State University,
310633
for groups 1 and 2 and the se terms are their corresponding standard errors. This chi-square
value is read from a table of the chi-square distribution with 1 degree of freedom,
‘The second strategy is to create an indicator (dummy) variable or set of variables which
reflects membership/non-membership in the group, and also to have interaction terms
between the indicator dummies and other independent variables, such that the significant
interactions are interpreted as indicating significant differences across groups for the
corresponding independent variables. When an indicator variable has been entered as a set
of dummy variables, its interaction with another variable will involve multiple interaction
terms. In this case the significance of the interaction of the indicator variable and another
independent variable is the significance of the change of R-square of the equation with the
interaction terms and the equation without the set of terms associated with the ordinal
variable, (See the StatNotes section on "Regression" for computing the significance of the
iifference of two R-squates).
Allison (1999: 186) has shown that "Both methods may lead to invalid conclusions if
residual variation differs across groups." Unequal residual variation across groups will
occur, for instance, whenever an unobserved variable (whose effect is incorporated in the
disturbance term) has different impacts on the dependent variable depending on the group.
Allison suggests that, as a rule of thumb, if “one group has coefficients that are consistently
higher or lower than those in another group, itis a good indication of a potential problem ...”
(p, 199). Allison explicated a new test to adjust for unequal residual variation, presenting the
code for computation of this test in SAS, LIMDEP, BMDP, and STATA. The test is not
implemented directly by SPSS or SAS, at least as of 1999. Note Allison's test is
conservative in that it will always yield a chi-square which is smaller than the conventional
test, making it harder to prove the existence of cross-group differences,
* How do I compute the confidence interval for the unstandardized logit (effect) coefficients?
To obtain the upper confidence limit at the 95% level, where b is the unstandardized logit
coefficient, se is the standard error, and e is the natural logarithm, take e to the power of (b +
1,96*se), Subtract to get the lower Cl.
* SAS's PROC CATMOD for multinomial logi
get some help?
© SAS CATMOD Examples
© University of Idaho
© York University, on the CATPLOT module
is not user friendly. Where can 1
Bibliography
* Agresti, Alan (1996). Am introduction 10 categorical data analysis. NY: John wiley, An excellent,
accessible introduction,
* Allison, Paul D. (1999). Comparing logit and probit coefficients across groups. Sociological
Methods and Research, 28(2): 186-208
* Cox, DR. and B. J. Snell (1989). Analysis of binary data (2nd edition). London: Chapman & Hall
* DeMaris, Alfred (1992), Logit modeling: Practical applications. Thousand Oaks, CA: Sage
2/7/2008 1:55 PM.
http://www. chass.nesu.edu/garson/PAT65/logistic. htmLogistic Regression: Statnotes, from North Carolina State University, bttp:/hvww2. chass.nesu.edu/garson!PA765/logistic htm
Publications. Scrics: Quantitative Applications in the Social Sciences, No. 106.
© Estrella, A. (1998). A new measure of fit for equations with dichotomous dependent variables.
Journal of Business and Economic Statistics 16(2): 198-205. Discusses proposed measures for an
analogy to R2,
* Fox, John (2000), Multiple and generalized nonparametric regression, Thousand Oaks, CA: Sage
Publications. Quantitative Applications in the Social Sciences Series No.131. Covers
nonparametric regression models for GLM techniques like logistic regression. Nonparamettic
regression allows the logit of the dependent to be @ nonlinear function of the logits of the
independent variables.
* Hosmer, David and Stanley Lemeshow (1989). Applied Logistic Regression, NY: Wiley & Sons.
‘A much-cited treatment utilized in SPSS routines.
© Jaceard, James (2001), Interaction effects in logistic regression. Thousand Oaks, CA: Sage
Publications. Quantitative Applications in the Social Sciences Series, No. 135.
* Kleinbaum, D. G, (1994). Logistic regression: self-learning text. New York: Springer-Verlag.
Whaat it says.
* McKelvey, Richard and William Zavoina (1994), A statistical model for the analysis of ordinal
level dependent variables. Journal of Mathematical Sociology, 4: 103-120. Discusses polytomous
and ordinal logits.
* Menard, Scott (2002). Applied logistic regression analysis, 2nd Edition. Thousand Oaks, CA:
Sage Publications. Series: Quantitative Applications in the Social Sciences, No. 106. First ed.,
1995
* Nagelkerke, N, J. D. (1991). A note on a general definition of the coefficient of determination,
Biomeirika, Vol. 78, No, 3: 691-692, Covers the two measures of R-square for logistie regression
which are found in SPSS output.
+ O'Connell, Ann A. (2005). Logistic regression models for ordinal response variables. Thousand
Oaks, CA: Sage Publications. Quantitive Applications in the Social Sciences, Volume 146.
* Pampel, Fred C. (2000). Logistic regression: A primer. Sage Quantitative Applications in the
Social Sciences Series #132. Thousand Oaks, CA: Sage Publications. Pp. 35-38 provide an
example with commented SPSS output.
* Peduzzi, P., J. Concato, E. Kemper, T. R, Holford, and A. Feinstein (1996). A simulation of the
number of events per vatiable in logistic regression analysis. Journal of Climical Epidemiology
99: 1373-1379.
* Press, S, J, and S, Wilson (1978). Choosing between logistic regression and discriminant analysis
Journal of the American Statistical Association, Vol. 73: 699-705. The authors make the case for
the superiority of logistic regression for situations where the assumptions of multivariate normality
ate not met (ex., when dummy variables are used), though discriminant analysis is held to be better
when they are, They conclude that logistic and discriminant analyses will usually yield the same
conclusions, except in the ease when there are independents which result in predictions very close
to 0 and 1 in logistic analysis, This can be revealed by examining a'plot of observed groups and
predicted probabilities’ in the SPSS logistic regression output.
« Raftery, A. E. (1995), Bayesian model selection in sociat research. In P. V. Marsden, ed.,
Sociological Methodology 1995: 111-163. London: Tavistock. Presents BIC criterion for
evaluating logits.
© Rice, J. C. (1994). "Logistic regression: An introduction". InB, Thompson, ed., Advances in social
science methodology, Vol. 3: 191-245. Greenwich, CT’ JAI Press. Popular introduction.
+ Tabachnick, B.G., and L. 8. Fidell (1996). Using multivariate statistics, 3rd ed. New York: Harper
Collins, Has clear chapter on logistic regression.
320f33 2/7/2008 1:55 PM.Logistic Regression: Statnotes, from North Carolina State University. http:/hvww2.chass.nesu.edu/garson/PAT6S/logistic htm
# Wright, RE. (1995). "Logistic regression". In L.G, Grimm & P.R, Yarnold, eds., Reading and
understanding multivariate statistics. Washington, DC: American Psychological Association. A
widely used recent treatment,
Copyright 1998, 2008 by G. David Garson.
Last update 1/6/08,
Back
33.0f 33, 2/7/2008 1:55 PM.