Regression with Categorical
Variables
• Regression analysis requires numerical data.
• Categorical data can be included as independent
variables but must be coded numeric using
dummy variables.
• For variables with 2 categories, code as 0 and 1.
Example 8.15: A Model with
Categorical Variables (1 of 2)
• Employee Salaries provides data for 35 employees.
• Predict Salary using Age and MBA (code as yes = 1, no = 0)
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀
where
Y = salary
𝑋1 = age
𝑋2 = MBA indicator (0 or 1)
Example 8.15: A Model with
Categorical Variables (2 of 2)
• Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA
– If MBA = 0, salary Salary = 893.59 + 1044 × Age
– If MBA = 1, salary Salary = 15,660.82 + 1,044.15 × Age
Interactions
• An interaction occurs when the effect of one
variable is dependent on another variable.
• We can test for interactions by defining a new
variable as the product of the two variables,
𝑋3 = 𝑋1 × 𝑋2 , and testing whether this
variable is significant, leading to an alternative
model.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝜀
Example 8.16: Incorporating Interaction
Terms in a Regression Model (1 of 3)
• Define an interaction between Age
and MBA and re-run the regression.
The MBA indicator is not significant; we would typically drop it
and re-run the regression analysis.
Example 8.16: Incorporating Interaction
Terms in a Regression Model (2 of 3)
This results in the model:
salary = 3,323.11 + 984.25 × age + 425.58 × MBA × age
Example 8.16: Incorporating Interaction
Terms in a Regression Model (3 of 3)
• However, statisticians recommend that if
interactions are significant, first-order terms
should be kept in the model regardless of their p-
values.
• Thus, using the first regression model, we have:
salary = 3902.51 + 971.31 × Age − 2971.08
× MBA + 501.85 × MBA × Age
Categorical Variables with More Than Two
Levels
• When a categorical variable has k > 2 levels, we
need to add k − 1 additional variables to the
model.
Example 8.17: A Regression Model with Multiple
Levels of Categorical Variables (1 of 4)
• The Excel file Surface
Finish provides
measurements of the
surface finish of 35
parts produced on a
lathe, along with the
revolutions per minute
(RPM) of the spindle
and one of four types of
cutting tools used.
Example 8.17: A Regression Model with Multiple
Levels of Categorical Variables (2 of 4)
• Because we have k = 4 levels of tool type, we will
define a regression model of the form
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 + 𝜀
where
Y = surface finish
𝑋1 = RPM
𝑋2 = 1 if tool type is B and 0 if not
𝑋3 = 1 if tool type is C and 0 if not
𝑋4 = 1 if tool type is D and 0 if not
Example 8.17: A Regression Model with Multiple
Levels of Categorical Variables (3 of 4)
• Add 3 columns to the
data, one for each of
the tool type variables
Example 8.17: A Regression Model with Multiple
Levels of Categorical Variables (4 of 4)
• Regression results
Surface finish = 24.49 + 0.098 RPM − 13.31 type B − 20.49
type C − 26.04 type D
Example 8.17: A Regression Model with Multiple
Levels of Categorical Variables (4 of 4)
• Regression results:
Regression Models with Nonlinear
Terms
• Curvilinear models may be appropriate when scatter
charts or residual plots show nonlinear relationships.
• A second order polynomial might be used
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + 𝜀
• Here 𝛽1 represents the linear effect of X on Y and 𝛽2
represents the curvilinear effect.
• This model is linear in the β parameters so we can use
linear regression methods.
Example 8.18: Modeling Beverage Sales
Using Curvilinear Regression (1 of 2)
• The U-shape of the residual plot (a second-order
polynomial trendline was fit to the residual data) suggests
that a linear relationship is not appropriate.
Example 8.18: Modeling Beverage Sales
Using Curvilinear Regression (2 of 2)
• Add a variable for temperature squared.
• The model is: Sales = 142,850 − 3,643.17 × Temperature + 23.3
× Temperature2