DUMMY VARIABLE
REGRESSION MODELS
Faculty of Economics and Political Science
Economic Department
Third Year- 2023-2024
2
In regression analysis, the dependent variable is frequently influenced
not only by quantitative variables (e.g. income, price, height), but also
by variables that are qualitative such as nationality, geographical
region, gender.
For example, female workers earn less than their male counterparts,
holding all other factors constant.
If the dependent variable, Y, represents the earnings or wages, then a
qualitative variable that represents gender (male or female) should be
included among the explanatory variables (regressors).
3
The question is how to quantify such variables in order to add them
to the regression model?
Construct artificial variables (dummy variables) that takes on
values 1 or 0; where 1 indicates the presence of an attribute (e.g.
value 1 if female) and 0 indicates the absence of the attribute (value 0
if male)
So, classify data into mutually exclusive categories and then easily
incorporate them into the regression model.
4
Example
If we have data on the average salary of teachers in three geographical
areas: North, South and West, and we want to find out if the average
salary differs among the three areas?
Using Regression Analysis
is 1 if North; 0 otherwise
is 1 if South; 0 otherwise
5
Therefore
→ Mean salary of teachers in the North
→ Mean salary of teachers in the South
→ Mean salary of teachers in the West
6
So, the mean salary of teachers in the West is given by the
intercept (the differential intercept coefficients ( tell by how
much the mean salaries in the North and South differ from that
in the West.
7
Note that the qualitative variable (geographical area) has three
categories (North, South, West), but we introduced only two dummy
variables in the regression model. WHY??
This is to avoid the dummy variable trap (situation of perfect
collinearity; where the sum of the three dummy variables’ columns
in the X matrix reproduce the intercept column. Therefore, the
determinant of the X matrix is zero and we cannot estimate the
model parameters).
8
If a qualitative variable has m categories, we introduce
only (m1) dummy variables in the model. In other words,
for each qualitative regressor the number of dummy
variables must be one less than the categories of the
variable; otherwise fall into the dummy variable trap.
9
The category for which no dummy variable is assigned (West in the
previous example) is known as the comparison category (also base
category, control category, benchmark category, reference category)
because all comparisons are made in relation to this category. The choice
of the comparison category is up to the researcher (so, in the previous
example, doesn’t matter West, North or South). The mean value of the
comparison category is given by the intercept.
10
To avoid the dummy variable trap, we may also introduce as many dummy
variables as the number of categories, but we do NOT introduce the intercept.
Example
In this case
is the mean salary of teachers in the West.
is the mean salary of teachers in the North.
is the mean salary of teachers in the South.
So, we directly obtain the mean values.
11
Most researchers find that introducing (m dummy variables in
case of m categories is better than omitting the intercept term
because it allows researchers to address more easily whether
the categorization makes a difference or not.
12
How can we deal with two qualitative variables?
In case of two qualitative regressors, each with two categories (so a single
dummy variable for each) such as:
Interpret the coefficients in the following regression model
13
In case of more than one qualitative variable, pay close attention to the
comparison category
It is (unmarried, non-south resident)
If the dependent variable is the mean wage, we can say that the mean
wage of unmarried persons who don’t live in the south is $8.81.
Compared with this, the mean wage of those who are married is higher
by about $1.1 (so their actual wage is $9.91).
Similarly, for those who live in the south, the mean wage is lower by
about $1.67.
14
We can test the statistical significance of the coefficient in the same way of
quantitative variables.
For example,
(0.4015) (0.4642) (0.4854)
t= 21.95 2.3688 -3.446
p-value = 0.0000 0.0182 0.0006
Here, all differential intercepts are statistically significant.
This means, for example, that the mean wage in the south is statistically
significantly lower by $1.67.
15
Final note
If the model has several qualitative variables with
several categories, introduction of dummy variables
can consume many degrees of freedom.
So, one should always weigh the number of dummy
variables to be introduced against the total number
of available observations.