Machine Learning
Logistic Regression
By E. Cheteni
www.belgiumcampus.ac.za
Lesson Objectives
Rationale for Logistic Regression
• Identify the types of variables used for dependent and
independent variables in the application of logistic
regression
• Describe the method used to transform binary measures
into the likelihood and probability measures used in logistic
regression.
• Interpret the results of a logistic regression analysis &
assessing predictive accuracy.
2
www.belgiumcampus.ac.za
Overview of Regression Analysis
E.g. What influences a person ‘s salary?
Regression Analysis is used to
achieve two goals
Independent
Dependent
Variables
Variable (Criterion)
[Predictors]
3
www.belgiumcampus.ac.za
Forms of regression Analysis
4
www.belgiumcampus.ac.za
5
www.belgiumcampus.ac.za
6
www.belgiumcampus.ac.za
7
www.belgiumcampus.ac.za
8
www.belgiumcampus.ac.za
9
www.belgiumcampus.ac.za
10
www.belgiumcampus.ac.za
Categorical variables with more than one value
11
www.belgiumcampus.ac.za
Handling Categorical Variables
• Categorical variables are typically encoded using techniques like one-
hot encoding or dummy coding to represent them as binary variables.
• One-hot encoding creates a binary variable for each category (e.g.
gender_male, gender_female), while dummy coding uses one less
variable to represent the categories.
• Multicollinearity among categorical variables with multiple levels can
inflate standard errors and lead to unreliable coefficient estimates.
12
www.belgiumcampus.ac.za
Handling Categorical Variables
• Example of one-hot encoding using a categorical variable ‘Color’ with three levels:
‘Red’, ‘Green’ and ‘Blue’.
One-Hot Encoding:
• One-hot encoding creates a binary variable for each category, where 1 indicates
the presence of the category and 0 indicates absence.
13
www.belgiumcampus.ac.za
Handling Categorical Variables
• Examples of dummy coding using a categorical variable ‘Color’ with three levels:
‘Red’, ‘Green’ and ‘Blue’.
Dummy Coding:
• Dummy coding uses one less variable than the number of categories to represent
the categories.
• Typically, one category is used as the reference category.
14
www.belgiumcampus.ac.za
www.belgiumcampus.ac.za
What is logistic regression
• Classification algorithm that uses regression analysis conducted when
the dependent variable (target) is categorical or binary
16
www.belgiumcampus.ac.za
When do we use LR?
• Continuous—such as temperature in degrees Celsius or weight in grams.
• continuous data is categorized as either interval data by;
• coding
• values are equally split
• Discrete, ordinal—data which can be placed into some kind of order on a scale.
• A score of 1 indicates a lower degree of happiness than a score of 5,
• but there is no way of determining the numerical value between each of the points on the scale.
• Discrete, nominal—data which fits into named groups which do not represent any
order or scale.
• E.g., eye color may fit into the categories “blue”, “brown”, or “green”, but there is no hierarchy to these
categories.
17
www.belgiumcampus.ac.za
Types of logistic regression
Binomial logistic regression
• In binomial Logistic regression, there can be only two possible types of the dependent variables, such
variables, such as 0 or 1, Pass or Fail, Yes or No, High or Low
Ordinal logistic regression
• In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High".
Multinominal logistic regression
• In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep“.
18
www.belgiumcampus.ac.za
Examples of Binary classification problems
• Spam Detection : Predicting if an email is Spam or not
• Credit Card Fraud : Predicting if a given credit card
transaction is fraud or not
• Health : Predicting if a given mass of tissue is benign or
malignant
• Marketing : Predicting if a given user will buy an insurance
product or not
• Banking : Predicting if a customer will default on a loan.
19
www.belgiumcampus.ac.za
Logistic regression assumptions
Assumptions: Logistic Regression makes certain assumptions about the
data, including:
1. Binary output/target: as mentioned at the beginning, logistic
regression is for classification problems. We need to make sure the
target is binary and transform to values of 0 or 1.
2. Linear relationship: the logistic algorithm makes use of the linear
equation, so the same assumptions apply here.
3. Independent inputs: the highly correlated (multicollinearity) input
variables can fail the model convergence.
20
www.belgiumcampus.ac.za
Logistic Regression
• Consider the Default data set, where the response default falls into
• one of two categories, Yes or No.
• Rather than modeling this response Y directly, logistic regression
models the probability that Y belongs to a particular category.
• For the Default data, logistic regression models the probability of
default.
• For example, the probability of default given balance can be written
as
21
www.belgiumcampus.ac.za
Logistic Regression
• The values of ,which we abbreviate p(balance), will
range between 0 and 1.
• Then for any given value of balance, a prediction can be made for
default.
• For example, one might predict default = Yes for any individual for
whom p(balance) > 0.5.
• Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a
lower threshold, such as p(balance) > 0.1.
22
www.belgiumcampus.ac.za
Logistic Regression Model
• The linear logistic regression model would be:
• If we use this approach to predict default=Yes using balance, then we obtain:
• for balances close to zero we predict a negative probability of default;
• if we were to predict for very large balances, we would get values bigger than 1.
• These predictions are not sensible, since of course the true probability of default,
regardless of credit card balance, must fall between 0 and 1.
23
www.belgiumcampus.ac.za
Logistic Regression Model
• To avoid this problem, we must model p(X) using a function that gives
outputs between 0 and 1 for all values of X.
• After a bit of manipulation of the above, we find that,
• The quantity is called odds, and can take on any value between 0 and
∞.
• Values of the odds close to 0 and ∞ indicate very low and very high
probabilities of default, respectively.
24
www.belgiumcampus.ac.za
Logit or Log(odds)
• The logit regression model has a logit that is linear in X.
• Coefficients in logistic regression represent the log-odds of the
outcome associated with a one-unit change in the predictor
variable.
• 𝛽1 gives the average change in Y associated with a one-unit
increase in X.
• By contrast, in a logistic regression model, increasing X by one
unit changes the log odds by 𝛽1
25
www.belgiumcampus.ac.za
Estimating the Regression Coefficients
• The coefficients 𝛽0 and 𝛽1 are unknown, and must be estimated based
on the available training data.
• The maximum likelihood is used to estimate the regression coefficients.
• The basic intuition behind using maximum likelihood:
• we seek estimates for 𝛽0 and 𝛽1 are such that the predicted probability 𝑝(𝑋𝑖 ) of
default for each individual, corresponds as closely as possible to the individual’s
observed default status.
26
www.belgiumcampus.ac.za
Estimating the Regression Coefficients
• The estimates 𝛽0 and 𝛽1 are are chosen to maximize this likelihood function.
• We see that 𝛽1 = 0.0055; this indicates a one-unit increase in balance is associated with an
increase in the log odds of default by 0.0055 units.
• We can measure the accuracy of the coefficient estimates by computing their standard
errors.
• For instance, the z-statistic associated with 𝛽1 is equal to 𝛽1 /𝑆𝐸(𝛽1 ).
• So a large (absolute) value of the z-statistic indicates evidence against the null hypothesis
H0 : 𝛽1 = 0 implying that;
• i.e. the probability of default does not depend on balance.
27
www.belgiumcampus.ac.za
Example 1 - Making Predictions
• Once the coefficients have been estimated, we can compute the
probability of default for any given credit card balance.
• For example, for an individual with a balance of R1, 000;
28
www.belgiumcampus.ac.za
Example 2
• For example, the Default data set contains the qualitative variable
student.
• To fit a model that uses student status as a predictor variable, we
simply create a dummy variable that takes on a value of 1 for students
and 0 for non-students.
29
www.belgiumcampus.ac.za
Exercise
• Question:For the Default data, estimated coefficients of the logistic regression model that
predicts the probability of default using balance, income, and student status. Student
status is encoded as a dummy variable student[Yes], with a value of 1 for a student and a
value of 0 for a non-student. In fitting this model, income was measured in thousands of
dollars.
1. Given a student with a credit card balance of R1, 500 and an income of R40, 000,
estimate probability of default.
2. Give a non-student with the same balance and income, estimate probability of default.
30
www.belgiumcampus.ac.za
THANK YOU
31
www.belgiumcampus.ac.za