Chapter 10 – Logistic
Regression
Logistic Regression
⚫Extends idea of linear regression to
situation where outcome variable is
categorical
⚫Widely used, particularly where a
structured model is useful to explain
(=profiling) or to predict
⚫ Finding the factors that differantiate
between male and female top executives
⚫We focus on binary classification
i.e. Y=0 or Y=1
The Logit
Goal: Find a function of the predictor
variables that relates them to a 0/1
outcome
⚫Instead of Y as outcome variable (like in
linear regression), we use a function of Y
called the logit
⚫Logit can be modeled as a linear function of
the predictors
⚫The logit can be mapped back to a
probability, which, in turn, can be mapped
to a class
Step 1: Logistic Response
Function
p = probability of belonging to class 1
Need to relate p to predictors with a function
that guarantees 0 ≤ p ≤ 1
Standard linear function (as shown below)
does not:
q = number of
predictors
The Fix:
use logistic response
function
Equation 10.2 in
textbook
Step 2: The Odds
The odds of an event are defined as:
p = probability of
eq. 10.3 event
Or, given the odds of an event, the probability
of the event can be computed by:
eq.
10.4
We can also relate the Odds to
the predictors:
eq. 10.5
To get this result, substitute 10.2
into 10.4
Step 3: Take log on both
sides
This gives us the logit:
log(Odds) = logit (eq. 10.6)
Logit, cont.
So, the logit is a linear function of predictors
x1, x2, …
⚫ Takes values from -infinity to +infinity
Review the relationship between logit, odds
and probability (Check the chapter 10)
Odds (a) and Logit (b) as function of
P
Example
Personal Loan Offer
(UniversalBank.csv)
Outcome variable: accept bank loan (0/1)
Predictors: Demographic info, and info about
their bank relationship
Single Predictor Model
Modeling loan acceptance on income (x)
Assume Fitted coefficients (more later): b0 = -
6.3525, b1 = -0.0392
Seeing the Relationship
Last step - classify
Model produces an estimated probability of
being a “1”
⚫Convert to a classification by establishing
cutoff level
⚫If estimated prob. > cutoff, classify as “1”
© Galit Shmueli and Peter Bruce 2017
Ways to Determine Cutoff
⚫0.50 is popular initial choice
⚫Additional considerations (see Chapter 5)
⚫ Maximize classification accuracy
⚫ Maximize sensitivity (subject to min. level of
specificity)
⚫ Minimize false positives (subject to max. false
negative rate)
⚫ Minimize expected cost of misclassification
(need to specify costs)
Example, cont.
⚫Estimates of β’s are derived through an
iterative process called maximum likelihood
estimation
⚫Let’s include all 12 predictors in the model
now
Data Prep
bank_df = pd.read_csv('UniversalBank.csv')
bank_df.drop(columns=['ID', 'ZIP Code'], inplace=True)
bank_df.columns = [c.replace(' ', '_') for c in bank_df.columns]
# Treat education as categorical, convert to dummy variables
bank_df['Education'] = bank_df['Education'].astype('category')
new_categories = {1: 'Undergrad', 2: 'Graduate', 3:
'Advanced/Professional'}
bank_df.Education.cat.rename_categories(new_categories, inplace=True)
bank_df = pd.get_dummies(bank_df, prefix_sep='_', drop_first=True)
y = bank_df['Personal_Loan']
X = bank_df.drop(columns=['Personal_Loan'])
Fitting Model
# partition data
train_X, valid_X, train_y, valid_y = train_test_split(X, y,
test_size=0.4, random_state=1)
# fit a logistic regression (set penalty=l2 and C=1e42 to avoid
# regularization)
logit_reg = LogisticRegression(penalty="l2", C=1e42,
solver='liblinear')
logit_reg.fit(train_X, train_y)
print('intercept ', logit_reg.intercept_[0])
print(pd.DataFrame({'coeff': logit_reg.coef_[0]},
index=X.columns).transpose())
print('AIC', AIC_score(valid_y, logit_reg.predict(valid_X), df =
len(train_X.columns) + 1))
Results
intercept -12.61895521314035
Age Experience Income Family CCAvg Mortgage
coeff -0.032549 0.03416 0.058824 0.614095 0.240534 0.001012
Securities_Account CD_Account Online CreditCard
coeff -1.026191 3.647933 -0.677862 -0.95598
Education_Graduate Education_Advanced/Professional
coeff 4.192204 4.341697
AIC -709.1524769205962
coefficients for logit
Converting from logit to probabilities
logit_reg_pred = logit_reg.predict(valid_X)
logit_reg_proba = logit_reg.predict_proba(valid_X)
logit_result = pd.DataFrame({'actual': valid_y,
'p(0)': [p[0] for p in logit_reg_proba],
'p(1)': [p[1] for p in logit_reg_proba],
'predicted': logit_reg_pred })
# display four different cases
interestingCases = [2764, 932, 2721, 702]
print(logit_result.loc[interestingCases])
actual p(0) p(1) predicted
2764 0 0.976 0.024 0
932 0 0.335 0.665 1
2721 1 0.032 0.968 1
702 1 0.986 0.014 0
Interpreting Odds, Probability
For predictive classification, we typically use
probability with a cutoff value
For explanatory purposes, odds have a useful
interpretation:
⚫If we increase x1 by one unit, holding x2, x3 …
xq constant, then
⚫b1 is the factor by which the odds of
belonging to class 1 increase
⚫ Recall
⚫ Consider single predictor as «Income»,
remaining will be constant.
⚫ Odds(Personel Loan=Yes|Income)=
⚫ So, is the multiplicative factor by which the
odds (of belonging to class 1) increase
when the value of X1 is increased by 1
unit, holding all other predictors constant.
If < 0, an increase in X1 is associated with
a decrease in the odds of belonging to
class 1, whereas a positive value of is
associated with an increase in the odds.
Loan Example:
Evaluating Classification
Performance
Performance measures: Confusion matrix
and % of misclassifications
More useful in this example: gains (lift)
(terms sometimes used interchangeably)
Python’s Gains
Chart
df = logit_result.sort_values(by=['p(1)'], ascending=False)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
gainsChart(df.actual, ax=axes[0])
liftChart(df['p(1)'], title=False, ax=axes[1])
plt.show()
# of 1’s yielded by model,
moving thru records sorted by
predicted prob. of being a 1
# of 1’s yielded by selecting
randomly
Python’s Lift Chart
df = logit_result.sort_values(by=['p(1)'], ascending=False)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
gainsChart(df.actual, ax=axes[0])
liftChart(df['p(1)'], title=False, ax=axes[1])
plt.show()
Top decile (i.e. the 10%
most probable to be
1’s) are 7.8 times as
likely to be 1’s,
compared to random
selection
Multicollinearity
Problem: As in linear regression, if one
predictor is a linear combination of other
predictor(s), model estimation will fail
⚫Note that in such a case, we have at least
one redundant predictor
Solution: Remove extreme redundancies
(by dropping predictors via variable
selection, or by data reduction methods such
as PCA)
Variable Selection
This is the same issue as in linear regression
⚫ The number of correlated predictors can grow when we
create derived variables such as interaction terms
(e.g. Income x Family), to capture more complex
relationships
⚫ Problem: Overly complex models have the danger of
overfitting
⚫ Solution: Reduce variables via automated selection of
variable subsets (as with linear regression)
⚫ See Chapter 6
Summary
⚫Logistic regression is similar to linear
regression, except that it is used with a
categorical response
⚫It can be used for explanatory tasks
(=profiling) or predictive tasks
(=classification)
⚫The predictors are related to the response Y
via a nonlinear function called the logit
⚫As in linear regression, reducing predictors
can be done via variable selection
⚫Logistic regression can be generalized to
more than two classes