Logistic Regression
Logistic Regression
Slide 2
Let us look at some scenarios
• In many real life scenarios, the dependent variable is "limited."
o E.g. Is a person loyal to a brand? Will the person purchase my brand?
o The outcome here is not continuous or distributed normally.
• How to predict the outcome using linear regression?
o Do you see any problem here?
o Try to fit a regression line on this data.
o Can you decipher a relationship among these variables?
Slide 3
Fitting a Regression line
• We could severely simplify the plot by drawing a line between the
means for the two dependent variable levels,
• But this is problematic in two ways:
o the line seems to oversimplify the relationship and
o it gives predictions that cannot be observable values of Y for extreme
values of X.
Y = β0 + β1 X + ε ; where Y = (0, 1)
range = 0 to 1 range = -∞ to +∞
Slide45
The Logistic Regression Model
𝑝 𝑒 (𝑎+𝑏𝑋)
ln =a+𝑏X p=
1−𝑝 1+ 𝑒 (𝑎+𝑏𝑋)
Slide 6
• If Linear Regression is used:
− Probabilities (the dependent variable) range from 0 to 1.
− However linear predictions can be outside of this range.
Slide 7
Understanding Odds Ratio
• Problem with probabilities is that they are non-linear
o Going from 0.10 to 0.20 doubles the probability, but going from 0.80
to 0.90 barely increases the probability.
• Odds are like probability.
o Usually written as “4 to 1 odds” which is equivalent to 1 outof
five or 0.20 probability or 20% chance, etc.
p
odds =
1− p
Odds have a range of 0 to ,
Slide 9
Graph of the Logistic Function
Ogive function
(𝑎+𝑏𝑋)
The estimated probability is: 𝑝 = 𝑒 (𝑎+𝑏𝑋)
1+ 𝑒
o If {a + b X} = 0, then p = 0.50
o As {a + b X} gets really big, p approaches 1
o As {a + b X} gets really small, p approaches 0
Slide 10
Why use Logistic Regression?
• Binary Logistic Regression is a type of CLASSIFICATION
ALGORITHM where the dependent variable is a dummy
variable:
• Coded 0 (e.g. not brand loyal) or 1 (e.g. brand loyal)
o Will the customer make a purchase at the store in the next 30
days (yes vs. no)? Does it change if (s)he is a member of store
loyalty program or if the total purchase last year was above Rs
10,000?
o What is the impact of compensation, employee engagement
schemes and satisfaction scores on employee retention?
• Relationship between the DV and predictors is non-linear.
Slide 11
Understanding Logistics Regression
• Binomial logistic regression is similar to linear regression
o Except that dependent variable is binary (notcontinuous)
• Unlike linear regression, you are not attempting to determine
the predicted value of the dependent variable, but the
probability of being in a particular category of the
dependent variable given the independent variables.
• As with other types of regression, binomial logistic regression
also uses interactions between independent variables to
predict the dependent variable.
• Logistic regression is a classification algorithm, don’t
confuse with the name regression.
𝑝
ln =a+𝑏X
1−𝑝
𝑎+𝑏𝑥
𝑃(𝑦|𝑥) = 𝑒
1 + 𝑒 𝑎+𝑏𝑥
Slide 14
Assumptions…
• Dependent variable should be nominal, with 2 categories.
o Could be ordinal or > 2 categories, however focus here is on binary DVs
• One or more independent variables -
o Measured on either a continuous or a nominal scale.
If an independent variable is ordinal, then it must be converted into nominal.
Else the software might treat it as a interval scale data.
• Observations are independent of each other.
i.e. obs should not come from repeated measurements or matched data.
• A case (or an object) should fall only in one category of DV
A person cannot be both – brand loyal and not brand loyal.
• DV should have mutually exclusive and exhaustive categories.
• Obs within each category of the DV should be independent
o i.e. values within a category should not have a relationship
o Same should also hold for the nominal IVs.
Slide 15
…Assumptions
• Limited or no multicollinearity among independent
variables.
• Linearity: A linear relationship between the odd ratio and the
independent variable.
o No assumption about the predictors being linear or to each other
• Error Term: The error term is assumed independent.
• Logistic regression typically requires a large sample size.
• There should be no outliers in the data
o As it uses maximum likelihood estimation (MLE) (unlike linear
regression, which uses Ordinary Least Square)
o Treating values beyond ±3.29 standardized scores
Slide 16
Not Required for Logistic Regression
• The dependent variable in logistic regression is very simple. Not
measured on an interval or ratio scale.
• Logistic regression does not require a linear relationship between
the dependent and independent variables.
− Linear relationship with log of odds
• OLS (linear regression) assumes that the distribution should be
normally distributed, but in logistic regression, the distribution
may be normal, poisson or binominal.
• Error terms (residuals) do not need to be normally distributed.
• Homoscedasticity is not required.
o OLS assumes that there is an equal variance between all independent
variables, but Logistic does not assume that there is an equal
variance between independent variables.
Slide 17
Sample Size
• Overall sample size
o Should be 400 to achieve best results with maximum
likelihood estimation (MLE).
− With smaller samples sizes, method could be less efficient in
model estimation.
• More focused on the size of each outcome group, which should
have 10 times the number of estimated model coefficients:
o Particularly problematic are situations where the actual
frequency of one outcome group is small (i.e., below 20). Actual
size of the small group is a bigger issue than the low rate of
occurrence.
o Alternate approaches are required for addressing this situation,
what may be termed “rare event” situations.
• Sample size requirements should be met in both – the training
and the holdout samples.
Slide 18
Issues in Model Estimation
o Small sample sizes
• Difficult to accurately estimate coefficients and standard errors.
o Complete Separation
• Dependent variable is perfectly predicted by an independent variable.
• Problem is that probabilities of one and zero are not defined for
logit values, thus no values are available for estimation.
o Quasi-Complete Separation (Zero Cells Effect)
• Most frequently encountered.
• One or more of the groups defined by the nonmetric independent
variable have counts of zero.
• Either use specialized methods or collapse categories.
Slide 19
So, in short, for Logistic Regression
Slide 20
Maximum Likelihood Estimate
• Logistic Regression uses MLE, instead of OLS as the statistical method
for estimating coefficients of a model.
• Maximizes the likelihood that an event will occur – the event being
a respondent is assigned to one group versus another.
o An iterative procedure.
o Starts with a guess as to the best weight for each predictor variable (i.e.,
each coefficient in the model). Then adjusts these coefficients repeatedly
until there is no additional improvement in the ability to predict the
value of the outcome variable (either 0 or 1) for each case.
• While OLS is the process of finding the line which best fits the data.
• Logistic regression is more similar to cross-tabulation given that the
outcome is categorical and the test statistic utilized is the Chi Square.
• The likelihood function (L) measures the probability of observing
the particular set of dependent variable values (p1, p2, ..., pn) that
occur in the sample:
L = Prob (p1* p2* * * pn)
Slide 21
DEVELOPING THE MODEL
Slide 22
Definition of Project Parameters
• Exclusions
– Accounts that have abnormal performance
– Adjudicated using some non-score-dependent criteria
– Designated accounts such as staff, VIPs, out of country,
preapproved, lost/stolen cards, deceased, underage, and
voluntary cancellations within the performance window
Accounts used for development should be those that are scored during
normal day-to-day credit-granting operations, and those that would
constitute the intended customer.
• Effect of Seasonality
• Definition of Good and Bad
Slide 23
Data Windows
• Observation Period – Period from where independent variables
come from. Aka the independent variables are created
considering this period (window) only.
• Performance Period – Period from where dependent variable
comes from. It is the period following the observation window.
• No fixed window for all the models. Depends on the type of
model and the industry.
Slide 24
Performance Window
• Factors that would determine it –
– Should be long enough to have enough events. Check the vintage
analysis.
– Depends on the product.
Take multiple length of the performance windows and calculate
event rate against these periods. Select the period at which
event rate stabilizes which means event rate does not increase
much.
Slide 25
Performance Window
• Rolling Performance Window –
Multiple windows to build a model but the duration of
performance window is fixed.
– To account for seasonality
– To include impact of multiple campaigns
Slide 26
Good Model Design
A good model design should document the following:
• The unit of analysis (such as customer or product level)
• Population frame and sample size
• Operational definitions (what are ‘good’/ ‘bad’ customers?) and
modeling assumptions (did this model include/exclude fraudulent
customers?)
• Observational time horizon (such as customers’ payment history
over the last two years) and performance windows (such as the
timeframe for which the “bad” definition applies)
• Data sources and data collection methods
Slide 27
Data Augmentation Techniques
• Classification algorithms trained on imbalanced data often result
in poor predictive quality.
– Models bias heavily toward the majority class, overlooking minority
examples critical to many use cases.
• Data Augmentation Techniques are used in data analytics to
modify unequal data classes to create balanced data sets.
• Oversampling – method to rebalance classes before training.
– When one class of data is the underrepresented in the sample
– When the amount of data collected is insufficient.
• Techniques for Oversampling
– Random oversampling
– Smoothed bootstrap oversampling
– SMOTE (Synthetic Minority Over-sampling Technique)
– ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced
Learning) Slide 28
Under-Sampling
Total 6,206,339
Sep 08
Non-Resp 6,204,138
Resp 2,201
Total 6,283,598
Dec 08
100%
Non-Resp 6,365,785 9,952 9,952
Resp 2,762
Total 6,458,953
Jun 09
Non-Resp 6,456,032
Resp 2,921
Slide 29
Selection of Characteristics
• Expected Predictive Power
• Reliability and Robustness
– E.g. income
• Ease in Collection
• Interpretability
– E.g. occupation
• Human Intervention
• Legal Issues Surrounding the Usage of Certain Types of
Information
• Creation of Ratios Based on Business Reasoning
• Future Availability
• Changes in Competitive Environment
Slide 30
Selecting the Independent Variables
Slide 31
Variable Selection Process
•Segregate variables into Numeric & Character (Nominal)
– Nominal variables not part of Selection process
– 8 Nominal variables segregated from analysis data
Slide 33
Weight of Evidence and Information Value
• The Weight of Evidence (WOE) tells the predictive power of an
independent variable in relation to the dependent variable
Slide 34
Weight of Evidence and Information Value
Range Bins Non Events Events % Non-Events % Events WOE IV
Slide 36
Reject Inference
• Typically used in Credit Risk Models
• Sample selection process –
1. Credit Application
2. Application approved or rejected
3. Approved applications given the loan
4. These accounts are observed: do they remain good or turn bad
5. Scorecard developed on these account
→ Applications rejected in stage-2 are not considered
Slide 37
Reject Inference is a process whereby the performance of
previously rejected applications is analyzed to estimate their
behavior (i.e., to assign performance class).
Just as there are some bads in the population that is approved,
there will be some goods that have been declined.
This process gives relevance to the scorecard development
process by recreating the population performance for a 100%
approval rate.
Slide 38
Techniques for Reject Inference
• Assign All Rejects to Bads
Not satisfactory. As any applications will be good.
Slide 39
Techniques for Reject Inference
• Approve All Applications
Only method to find out the actual (as opposed to inferred)
performance of rejected accounts. It involves approving all
applications for a specific period of time – say 3 months.
This is “buying-data”. Perhaps the most scientific and simple, the
notion of approving applicants that are known to be very high-risk
can be daunting.
• Augmentation techniques
Slide 40
Converting Probabilities into Scores
• Final Scorecard Production – the probabilities have to be scaled
• Scaling does not affect the predictive strength of the scorecard.
It is an operational decision based on considerations such as:
– Implementability of the scorecard into application processing
software.
– Ease of understanding by staff (e.g., discrete numbers are easier
to work with).
– Continuity with existing scorecards or other scorecards in the
company. This avoids retraining on scorecard usage and
interpretation of scores.
Slide 41
Converting Probabilities into Scorecard
Scaled Scores
Slide 42
Cautions
• Overfitting
o Adding IVs to a logistic regression model will always increase the
amount of variance explained in the log odds (similar to R²)
o Adding more independent variables to the model can result in
overfitting
o This reduces the generalizability of the model beyond the data on
which the model is fit.
• Empty cells or small cells: Check for empty/ small cells by doing
a crosstab between categorical predictors and outcome variable.
o If a cell has very few cases (a small cell), the model may become
unstable.
Slide 43
ASSESSING THE STRENGTH OF MODEL
Slide 44
Confusion Matrix – Measuring Predictive Accuracy
• Predictive Accuracy – ability to classify observations into correct
outcome group.
o All predictive accuracy measures are based on the cut-off value
selected for classification.
o The final cut-off value selected should be based on comparison of
predictive accuracy measures across cut-off values. While 0.5 is
generally the default cut-off, other values may substantially improve
predictive accuracy.
Slide 45
Overall and Outcome – Specific Measures
• Predictive Accuracy of Actual Outcomes
o Sensitivity: (aka Recall) true positive rate
– percentage of positive outcomes TP
correctly predicted. TP + FN
Slide 47
ROC Curve (Receiver Operating Characteristic)
• Graphical portrayal. Evaluates the trade off between true positive
rate (sensitivity) and false positive rate (1 – specificity).
• Area under curve (AUC), also known as c-statistic, is a powerful
non-parametric measure.
o Higher the area under curve, better the prediction power of the model.
• Measures the ability of the model to correctly classify true
positives and true negatives. We want our model to predict the
true classes as true and false classes as false.
• We want the true positive rate to be 1. But we are also concerned
about the false positive rate. So, we are not only concerned about
predicting the Y classes as Y but we also want N classes to be
predicted as N.
• We compare it to the random line, at 45o, where c-statistic is 0.5.
o So the c-statistic must be above 0.5.
Slide 48
ROC Curve
So if 2 models are
compared, then the
model having higher c-
statistic is better –
i.e. greater the area under
curve (AUC), better is the
model.
Slide 50
Gini Index
• Uses the Lorenz Curve.
• Ratio of the area between a scorecard’s Lorenz curve and the
45 degree line – the entire triangular area under the 45 degree
line, is equivalent to the Gini index.
Slide 51
Other Measures
• Gains Chart. Cumulative positive predicted value vs.
distribution of predicted positives (depth)
Slide 52
Is the model good? KS Statistics
Decile Good Bad Total % Bad Cum Good Cum Bad % Cum Good % Cum Bad KS
Understanding KS
Statistics and other Me
Slide 54
Explanation of the KS-Table
The output of logistic regression model is Probability. Sort it in Descending
Decile order. Then divide the entire sample into 10 equal parts. So this is based on
PREDICTED probabilities
Count of number of non-responders (i.e. dependent variable = 0) for each
# of Non-Resp
decile
# of Resp Count of number of responders (i.e. dependent variable = 1) for each decile
Total Obs in the
Total number of observations in the decile (Count of prior 2 columns)
Decile
Response Rate for each decile. Total number of responders for each decile
Response Rate
divided by total observations in decile
Cumulative response rate for each decile. Total number of responders till
Cum Response Rate
that decile divided by the total number of observations till that decile
{(Response rate for the decile) - (Overall Response rate)} divided by {Overall
% Gain
Response rate}
Lift Response rate for the decile divided by the overall response rate
Cumulative Lift Cumulative Response rate for the decile divided by the overall response rate
Cumulative # of Cumulative number of non-responders for each decile. Total number of non-
Non-Resp responders till that decile.
Cumulative # of Cumulative number of responders for each decile. Total number of
Resp responders till that decile.
% Cumulative Non- {Cumulative number of non-responders for each decile} divided by (Total
Resp number of non-responders}
{Cumulative number of responders for each decile} divided by (Total
% Cumulative Resp
number of responders}
KS Statistic {% Cumulative Responders} minus {% Cumulative Non-Responders}
Slide 55
Other Measures
• Cost Ratio. Ratio of the cost of misclassifying a bad credit risk as
a good risk (false negative) to the cost of misclassifying a good
risk as a bad (false positive).
– When used to calculate the cutoff, the cost ratio tends to max the
sum of the two proportions of correct classification.
– This is done by plotting the cost ratio against sensitivity and
specificity.
– The point where the two curves meet tends to be the point where
both sensitivity and specificity are maximized.
Slide 56
Comparing Different Models – relative measures
AIC (Akaike Information Criterion)
• Measure of relative goodness of fit
• Lower value of AIC suggests "better" model.
AIC = −2log(L) + 2K
• But it is a relative measure. of model fit. It is used for
model selection, i.e. it lets you to compare different
models estimated on the same dataset.
• If a model has value of AIC of, say, 2000. This number is
meaningless on its own, and tells nothing about how well
the model fits. However, another model having one more
predictor, results in drop of AIC to 1500. This shows that
model 2 is a better fit to the data than model 1.
Slide 57
Hosmer-Lemeshow Goodness Of Fit Test
Small p-values (significance) are However, large p-value does not mean the model
indicative of poor fit fits well. Since lack of evidence against a H0 is not
equivalent to evidence in favour of the H1. In
particular, for small sample size, a high p-value from
the test may simply be a consequence of the test
having lower power to detect mis-specification,
rather than being indicative of good fit.
Slide 58
Model Estimation Fit and Between Model comparisons
• Maximum Likelihood Estimation:
o Maximizes the likelihood that an event will occur – the event
being a respondent is assigned to one group versus another.
o The basic measure of how well the maximum likelihood estimation
procedure fits is the likelihood value.
• Comparisons of the likelihood values follow three steps:
o Estimate a Null Model – which acts as the “baseline” for making
comparisons of improvement in model fit.
o Estimate Proposed Model – the model containing the
independent
variables to be included in the logistic regression.
o Assess –2LL (-2 Log Likelihood) Difference
• Lower -2LL implies better fit of the model
Pl Note: There is no such a thing as "typical" or correct likelihood for a
model. It is a relative measure to compare two models as to how well
they fit the data
Slide 59
Many other measures of model fit
• Pseudo R2 Measures:
o Interpreted in a manner similar to the R2 in multiple regression.
Use -2LL to measure
Low -2LL & High R2
indicate better fit
o Different pseudo R2 measures vary widely in terms of magnitude and
no one version has been deemed most preferred.
o For all of the pseudo R2 measures, however, the values tend to be
much lower than for multiple regression models.
o Commonly used measures
• Cox & Snell R2
• Nagekerke R2
• McFadden’s R2 {= 1 - [-2LL(, )/-2LL()] (from software output)}
Slide 60
Casewise Diagnostics
• Two Types of Casewise Diagnostics Similar to Multiple
Regression
o Residuals
✓ Both residuals (Pearson and deviance) reflect standardized
differences between predicted probabilities and outcome value (0
and 1). Values above ± 2 merit further attention.
o Influential Observations
✓ Influence measures reflect impact on model fit and estimated
coefficients if an observation is deleted from the analysis.
✓ Comparable to those measures found in multiple regression.
Slide 61
Caution
• Reporting the R2
o Numerous pseudo-R2 values have been developed
• Should be interpreted with extreme caution as they have many
computational issues which cause them to be artificially high or low.
o Goodness of fit tests
• Hosmer-Lemeshow test measures it using Chi-square test. An
insignificant result is better. However the test has issues.
Slide 62
UNDERSTANDING THE OUTPUT
Slide 63
Example
• Explaining Brand Loyalty
Slide 64
Example
• A researcher is interested in how variables, such as GRE (Graduate
Record Exam scores), GPA (grade point average) and prestige of the
undergraduate institution, effect admission into graduate school.
• The response variable, admit/don’t admit, is a binary variable.
Slide 65
Understanding the Output (SPSS)
• The first model in the output is a null model, that is, a model with
no predictors.
• The constant in the table labelled Variables in the Equation gives the
unconditional log odds of admission (i.e., admit=1).
• The table labelled Variables not in the Equation gives the results of a
score test. The column labelled Score gives the estimated change in
model fit if the term is added to the model, the other two columns
give the degrees of freedom, and p-value (labelled Sig.) for the
estimated change. Based on the table above, all three of the
predictors, gre, gpa, and rank, are expected to improve the fit of the
model.
Slide 66
Gives the overall test for the model that
includes the predictors. Chi-square value
of 41.459 with a p-value of less than
0.0005 ➔ implies that the model as a
whole fits significantly better than an
empty (or null) model (i.e., a model with
no predictors).
Cox & Snell and Nagelkerke R2 give the -2 Log likelihood – measures how
improvement from null model to the poorly the model predicts the
fitted model. As compared to – decisions. Smaller the statistic,
- R2 as explained variance
better is the model.
- R2 as square of correlation
The stats show that the model is poor.
Slide 67
coefficients S.E. – standard error around the
aka Log Odds odds ratio
coefficient for the constant
Wald chi-square – tests the H0 that
the constant equals 0. Here this hyp
is rejected because the p-value
("Sig.") is smaller than the critical p-
value of .05 (or .01).
df – degrees of freedom for Wald
Slide 69
Example Interpretation of coefficient
p/(1-p)=odds
Slide 70
Accuracy of the Classification
• The results of our logistic regression can be used to classify subjects.
• Before we can use this information to classify subjects, we need to
have a decision rule.
• Our decision rule will be:
o If prob of the event is >= to some threshold, we shall predict that the
event will take place.
o By default, SPSS sets this threshold to 0.5.
o However, in many cases we may want to set it higher or lower than 0.5
Slide 71
Slide 72