Econometrics I
Federal University of ABC
Q1-2024
Problem set #1
Lecturer: Daniel Roland
1) Suppose that you are asked to conduct a study to determine whether smaller class sizes
lead do improved student performance of fourth graders.
(i) If you could conduct any experiment you want, what would you do? Be specific.
If I could conduct any experiment, I would randomly allocate a large number of
students between large and small classrooms. I would also make sure that other
characteristics such as the quality of teachers, school size, teaching method and
any other relevant factor was randomised as well. That way, given the random
allocation, any difference on average performance between students of
different class sizes could be attributed to class size.
(ii) More realistically, suppose you can collect observational data on several
thousand fourth graders in a given state. You can obtain the size of their fourth-
grade class and a standardized test score taken at the end of fourth grade. Why
might you expect a negative correlation between class size and test score?
In larger classrooms, teachers are unable to pay closely pay attention to all
students. This creates more opportunities for chatting with colleagues and less
opportunity for the teacher to engage with each student individually (e.g. ask
direct questions or check their progress).
(iii) Would a negative correlation necessarily show that smaller class sizes cause
better performance? Explain.
It would be indicative, but not conclusive. Other factors such as parental care
could influence this. Parents who are more worried about their children’s
education might enroll them in smaller classrooms so that the child receives
more attention and is coached in the right direction. Such parents would also
take other initiatives to improve their child’s performance such as hiring tutors
or sitting down with them for homework. In this case, the correlation between
smaller class size and performance is due to parental care, not because of the
smaller classroom size itself.
2) A justification for job training programs is that they improve worker productivity.
Suppose that you are asked to evaluate whether more job training makes workers more
productive. However, rather than having data on individual workers, you have access to
data on manufacturing firms in a given state. In particular, for each firm, you have
information on hours of job training per worker (training) and number of nondefective
items produced per worker hour (output).
(i) Carefully state the ceteris paribus thought experiment underlying the policy
question.
Here is one way to pose the question: If two firms, say A and B, are identical in
all respects except that firm A supplies job training one hour per worker more
than firm B, by how much would firm A’s output differ from firm B’s?
(ii) Does it seem likely that a firm’s decision to train its workers will be independent
of worker characteristics? What are some of those measurable and
unmeasurable worker conditions?
Firms are likely to choose job training depending on the characteristics of
workers. Some observed characteristics are years of schooling, years in the
workforce, and experience in a particular job. Firms might even discriminate
based on age, gender, or race. Perhaps firms choose to offer training to more or
less able workers, where “ability” might be difficult to quantify but where a
manager has some idea about the relative abilities of different employees.
Moreover, different kinds of workers might be attracted to firms that offer more
job training on average, and this might not be evident to employers.
(iii) Name a factor other than worker characteristics that can affect worker
productivity.
The amount of capital and technology available to workers would also affect
output. So, two firms with exactly the same kinds of employees would generally
have different outputs if they use different amounts of capital or technology.
The quality of managers would also have an effect.
(iv) If you find a positive correlation between output and training, would you have
convincingly established that job training makes workers more productive?
Explain.
No, unless the amount of training is randomly assigned. The many factors listed
in parts
(ii) and (iii) can contribute to finding a positive correlation between output and
training even if job training does not improve worker productivity.
3) Let kids denote the number of children ever born to a woman, and let educ denote years
of education for the woman. A simple model relating fertility to years of education is
𝑘𝑖𝑑𝑠 = 𝛽 + 𝛽 𝑒𝑑𝑢𝑐 + 𝑢,
where u is the unobserved error.
(v) What kinds of factors are contained in u? Are these likely to be correlated with
level of education?
A few examples of factors that are contained in u are income, housing situation,
marital status and work conditions. All these variables can be correlated with
levels of education.
(vi) Will a simple regression analysis uncover the ceteris paribus effect of education
on fertility? Explain.
No, because we have already established that there are important missing
variables in our model. We would have to control for all other factors in order to
have ceteris paribus conditions, but since we only have measures of education
then we cannot determine the exact effect of education on fertility.
4) According to a news article published in the Brazilian journal “O Dia”, from 17 th March
2023, “more than half the students in public schools in the metropolitan region have
insufficient meals”. According to the research done by Observatório da Alimentação
Escolar, school meals are the main meal for most (56%) of the students in the metropolitan
region of Rio de Janeiro. Furthermore, the research indicates that 41% of the students say
that the quantity of food provided at schools is not sufficient.
The state of Rio de Janeiro has a program that offers a complement to the school meal and
the state secretary of education is interested in estimating the effect of this program on
students’ performance. With this in mind, he had access to the state schools dataset. All the
students from the 5th year enrolled in state schools whose families earn less than R$100 per
capita a month are eligible to participate in the in the program. Each student then receives
a monthly allowance to help with food expenditure. This allowance is deposited in the bank
account of the child’s parent or legal guardian.
It is expected that the program achieves its goal of having a positive effect on school
performance, ceteris paribus. That is, all else being equal, if a student whose family is
struggling financially is selected into the program, that student’s performance will improve.
Given this, the following linear regression model was proposed:
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 𝛽 + 𝛽 𝑚𝑒𝑎𝑙 + 𝑢
where performance measures the proportion (in percentage points) of students that had
passing grades in mathematics, meal indicates the percentage of students from each school
that qualified for participation in the meal assistance program and u indicates the error
term (unobserved factor). Please answer:
(i) Interpret the parameters 𝛽 and 𝛽 in light of the problem presented.
𝛽 : we expect that in a school in which 0% of the students in the fifth year
qualified for the programme, the percentage of students that pass mathematics
is 𝛽 %.
𝛽 : a variation of one percentage point in the share of students that qualified for
the programme is associated with an average variation of 𝛽 percentage points
in the share of students that had passing grades in mathematics.
After reviewing the design of the program, an economist noticed that students’ household
poverty could be related to their performance. Furthermore, this poverty rate must have a
strong relationship with acceptance into the program. Finally, the economist also highlights
the fact that schools that present a higher percentage of students enrolled in the program
are also more likely to present a greater dispersion of the proportion of students that
passed mathematics. In light of this, please answer:
(ii) Would you expect the ordinary least square estimator for 𝛽 to be biased or
unbiased? Justify your answer based on the arguments presented in the text.
According to the text provided, “students’ household poverty could be related
to their performance. Furthermore, this poverty rate must have a strong
relationship with acceptance into the program”. Therefore, there is an omitted
variable in the model that is correlated with the response variable and the
regressor. Thus, there is strong evidence of correlation between the error term
and the regressor, leading to zero conditional mean not being met and the OLS
estimator to be biased.
5) Please state whether the following sentences are true (T) or false (F).
( F ) If the homoskedasticity assumption is violated, the OLS estimators will be biased.
( T ) If all the Gauss-Markov assumptions are met, then the OLS estimators will be the best
linear unbiased estimators.
( F ) In the presence of heteroskedasticity, the error term has constant variance for any
value of the explanatory variable.
( F ) One of the main features of cross-sectional data is that it allows the identification of
trends and seasonal effects over time.
( T ) The regression model 𝑦 = 𝛽 + 𝛽 √𝑥 + 𝑢 can be properly estimated through OLS.
6) A researcher was interested in the patterns of meat consumption in the population. He
decided to use a simple linear regression model in which the personal consumption of meat
(in kg) is explained by their monthly income measured in R$1,000. Thus, the estimated
model is:
𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝚤𝑜𝑛 = −0.52 + 0.853 ∗ 𝑖𝑛𝑐𝑜𝑚𝑒
where consumption is the monthly consumption of meat measured in kilograms and income
is the monthly income measured in R$1,000. Based on this model, please answer:
(i) What is the interpretation of both estimated parameters? Do they make sense?
For beta zero, if an average individual has no income, their average meat
consumption will be negative 0.52kg. This does not make sense as you cannot
consume negative amounts of meat. Perhaps the dataset has few observations
at low levels of income or the data points are distributed in an odd way. For beta
one, the coefficient for income, the interpretation is that for each additional
thousand reais in income, meat consumption increases by 0.85kg per person, on
average. That interpretation does make sense as meat is more expensive and as
a normal good its consumption is expected to increase with increases in income,
up to a certain point.
(ii) What is the monthly consumption of meat (in kg) of a person whose monthly
income is R$2,000?
As income is measured in R$1.000s, the predicted monthly consumption of meat
for an average person with monthly income of R$2.000 will be:
𝑐𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝚤𝑜𝑛 = −0.52 + 0.853 ∗ 2 = 1.186𝑘𝑔
7) Consider a researcher that is trying to estimate the weight of a person based solely in
their height:
𝑦 = 𝛽 + 𝛽 𝑥 +𝑢
in which y is the person’s weight (in kg) and x is their height (in cm). After collecting
information on height and weight from 100 individuals, the researcher created the
following table:
𝑥
𝑦 (𝑥 − 𝑥̅ ) (𝑥 − 𝑥̅ )(𝑦 − 𝑦)
𝑛
1252 0.5 25 7
Based on the information above, calculate OLS estimates for 𝛽 and 𝛽 and write down the
estimated equation showing the relationship between weight and height. Interpret the
estimates.
( , )
The estimator for 𝛽 is given by 𝑦 − 𝛽 𝑥̅ and the estimator for 𝛽 is given by or
( )
∑ ( ̅ )( )
∑ ̅)
. With the information at hand, we can calculate 𝛽 :
(
∑ ( ̅ )( ) ( , )
𝛽 = ∑ ̅)
= = = 0.28.
( ( )
∑
We know that there are 100 observations, so we find 𝑦 = = = 12.52. We also
know that 𝑥̅ = 0.5. Putting everything together we have:
𝛽 = 12.52 − 0.28 ∗ 0.5 = 12.38
8) All the sentences below are false. Rewrite them in a way that makes them correct.
* There is more than one way to answer this correctly, below are just examples.
(i) The Ordinary Least Squares Method finds 𝛽 and 𝛽 in a way that turns the sum of the
squared residuals into zero.
The Ordinary Least Squares Method finds 𝛽 and 𝛽 in a way that minimises the sum of the
squared residuals.
(ii) The Gauss-Markov assumptions SLR.1 through SLR.4 ensures that the OLS estimators are
BLUE.
The Gauss-Markov assumptions SLR.1 through SLR.5 ensures that the OLS estimators are
BLUE.
(iii) If the zero conditional mean assumption is not met, it means that 𝑉𝑎𝑟(𝑢|𝑥) = 𝜎 .
If the zero conditional mean assumption is not met, it means that 𝐸(𝑢|𝑥) ≠ 0.
(iv) In a cross-sectional setting, data is collected for a few units of observation with high
frequency in time.
In a time series setting, data is collected for a few units of observation with high frequency
in time.
(v) The concept of ceteris paribus means that we explore simultaneous changes in all
variables except one.
The concept of ceteris paribus means that we hold constant all explanatory variables except
one.