STUDENT’S SIGNATURE
SECOND PARTIAL EXAM OF STATISTICS 30001
(COD. 30001/6045/5047/4038/371/377)
January 10, 2019
Last Name First name
ID (Matr.) Degree program Course code
VERSION B
Only work appearing inside the spaces provided below will be graded.
An outline of the procedure used to solve each problem and of the calculations performed is required. All
sheets (including all scrap paper, WHICH WILL HOWEVER NOT BE GRADED) must be turned in at the
end of the exam.
(USE THIS SPACE AS ADDITIONAL SCRAP PAPER OR FOR YOUR ANSWERS)
PROBLEM 1 (4 points)
The Production Department of a small company wants to assess whether the productivity of a recruitment
candidate is in line with the required standards. The candidate is asked to perform the same operation repeatedly
and on each occasion, the number of minutes he takes to finalize it is recorded. This results in the following
sample of 6 finalization times (in minutes):
3.4 3.1 3.0 2.4 2.6 2.3
a) Find a 95% confidence interval for µ, the average time required for the candidate to perform the required
task. Specify any assumptions needed to find this interval.
b) The candidate will be hired only if there is sufficient empirical evidence that the time required to perform the
task is, on average, less than 3 minutes. Clearly specify the hypotheses that the company should test in order
to decide whether to hire the candidate. Considering a significance level α = 0.01, what would the company
decide? (Report all the calculations for the required test of hypotheses).
a) Given that the distribution of the population is unknown and the sample is not large enough to apply the
CLT (n < 25), it is necessary to assume that the population from which the sample was drawn follows a
normal distribution.
3.4 + 3.1 + 3.0 + 2.4 + 2.6 + 2.3 16.8
𝑥𝑥̅ = = = 2.8
6 6
6 (3.4 2 +3.12 +3.02 +2.4 2 +2.62 +2.32 )
𝑠𝑠 = � ∙ [ − 2.82 ] =
5 6
6 6
= � ∙ [7.996667 − 7.84] = � ∙ 0.156667 = √0.188 = 0.43359
5 5
1−∝= 0.95 𝑡𝑡𝑛𝑛−1;∝ = 𝑡𝑡5;0.025 = 2.571
2
𝑠𝑠 𝑠𝑠
(𝑥𝑥̅ − 𝑡𝑡𝑛𝑛−1;∝ ∙ ; 𝑥𝑥̅ + 𝑡𝑡𝑛𝑛−1;∝ ∙ )
2 √𝑛𝑛 2 √𝑛𝑛
0.43359 0.43359
(2.8 − 2.571 ∙ ; 2.8 + 2.571 ∙ )
√6 √6
(2.344902; 3.255098)
b) The company should test the hypotheses:
𝐻𝐻0 = 𝜇𝜇 ≥ 3
𝐻𝐻1 = 𝜇𝜇 < 3
In this way, the statement regarding which the company would like to gather empirical evidence,
corresponding to the case in which the average time that the candidate takes to complete the requested
task is in fact lower than 3 minutes, is considered as the alternative hypothesis. In this way, the candidate
will be higher only if the evidence shows that the requisite is fulfilled.
The decision rule is to reject 𝐻𝐻0 if
𝑥𝑥̅ − 𝜇𝜇0
𝑡𝑡 = 𝑠𝑠 < −𝑡𝑡𝑛𝑛−1;∝ ,
� 𝑛𝑛
√
where the value of the test statistic is:
𝑥𝑥̅ − 𝜇𝜇 2.8 − 3
𝑡𝑡 = 𝑠𝑠 0 = 0.43359 = - 1.12987
� 𝑛𝑛 �
√ √6
With a significance level ∝= 0.01, we have 𝑡𝑡𝑛𝑛−1;∝ = 𝑡𝑡5;0.01 = -3.365.
Since 𝑡𝑡 = - 1.12987 > −𝑡𝑡𝑛𝑛−1;∝ = -3.365, the null hypothesis is not rejected at the significance level ∝=
0.01; the data does not provide sufficient empirical evidence to conclude that the average time required
by the candidate to conclude the task is below 3 minutes. The company should therefore not hire the
candidate based on the test result.
PROBLEM 2 (6 points)
A video illustrating the negative consequences of inappropriate nutrition has been created as part of an awareness
campaign aiming to encourage an increased consumption of fruits and vegetables among young people. In order
to decide whether to launch the campaign, the video is shown to a sample of young people, registering the daily
consumption (in grams) of fruits and vegetables of each individual on two fixed dates, one before and one after
the viewing. The observed results are reported in the following table:
Individual Consumption before film viewing Consumption after film viewing
1 600 740
2 750 800
3 900 880
4 450 930
5 650 880
6 1050 1280
7 400 780
8 470 580
9 500 430
a) Clearly state the hypotheses that must be tested in order to stablish if the video has a positive effect in the
context of the awareness campaign and any assumptions required to perform such test. Determine the decision
that must be made at a significance level α = 0.05.
b) A larger survey has determined that out of a sample of 400 young people, only 104 individuals are aware of
the negative consequences of a diet low in fruits and vegetables. Calculate the p-value of the test for the
hypotheses H0: p ≥ 0.3 vs. H1: p < 0.3, where p denotes the proportion of young people in the population who
are aware of such negative effects. What is the conclusion of the test, based on such p-value and with a
confidence level α = 0.02?
a) The hypotheses to verify are:
𝐻𝐻0 : 𝜇𝜇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝜇𝜇𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0 vs. 𝐻𝐻1 : 𝜇𝜇𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝜇𝜇𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 > 0,
In this manner, the efficacy of the video will be established only if there is sufficient empirical
evidence. The necessary assumptions are: dependent samples and normality of both variables
(consumption) BEFORE and AFTER.
We first calculate the differences in consumption d = AFTER – BEFORE
Individual BEFORE AFTER d d2
1 600 740 140 19600
2 750 800 50 2500
3 900 880 -20 400
4 450 930 480 230400
5 650 880 230 52900
6 1050 1280 230 52900
7 400 780 380 144400
8 470 580 110 12100
9 500 430 -70 4900
∑𝑛𝑛
𝑖𝑖 𝑑𝑑𝑖𝑖 1260
𝑑𝑑̅ = = = 170
𝑛𝑛 9
∑𝑛𝑛 2 �2
𝑖𝑖 𝑑𝑑𝑖𝑖 −𝑛𝑛𝑑𝑑 520100−9(170)2 260000
𝑠𝑠𝑑𝑑 = � =� =� = √32500 = 180.277563
𝑛𝑛−1 8 8
𝑠𝑠𝑑𝑑 180.277563
= = 60.092521
√𝑛𝑛 √9
Under the null hypothesis, the standardized mean difference follows a t-Student distribution with n-1
degrees of freedom, therefore 𝐻𝐻0 must be rejected if
𝑑𝑑̅ − 𝑑𝑑0
𝑑𝑑 = 𝑠𝑠𝑑𝑑 > 𝑡𝑡𝑛𝑛−1;∝
√𝑛𝑛
𝑑𝑑� −𝑑𝑑0 170
𝑑𝑑 = 𝑠𝑠𝑑𝑑 = = 2.828971
60.092521
√𝑛𝑛
∝= 0.01 𝑡𝑡𝑛𝑛−1;∝ = 𝑡𝑡8 ; 0.01 = 1.860
Since 2.828971 > 1.860, 𝐻𝐻0 is rejected. The data provides sufficient empirical evidence to state that the
video has had a positive effect towards awareness.
b) The value of the test statistic in this case is:
𝑝𝑝�− 𝑝𝑝0 0.26− 0.3 −0.04
𝑧𝑧 = = = = -1.74574,
𝑝𝑝 (1−𝑝𝑝0 )
� 0 �
0.3(1−0.3) √0.000525
𝑛𝑛 400
104
where 𝑝𝑝̂ = = 0.26 is the observed sample proportion. The corresponding p-value is, therefore,
400
𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = 𝑃𝑃(𝑍𝑍 < 𝑧𝑧) = 𝑃𝑃(𝑍𝑍 < −1.74574) = 1 − 𝑃𝑃(𝑍𝑍 < 1.74574) ~ 1 − 𝑃𝑃(𝑍𝑍 < 1.75)
= 1 − 0.9599 = 0.0401
The null hypothesis is rejected if 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 < ∝. So, given that ∝= 0.02 < 0.0401 = p-value, 𝐻𝐻0 in
this case is not rejected. The data does not provide sufficient empirical evidence that the proportion of
young people who are aware of the negative consequences of a diet low in fruits and vegetables is
below 0.3.
PROBLEM 3 (4 points)
A survey was performed on a sample of readers with the aim of determining if there is an association between the
preferred literary genre and the book format (e-book or printed). The results are summarized on the following
table:
Literary genre English novel International Essay Other
Book format novel
e-book 22 34 22 46
printed 10 12 10 44
a) State the hypotheses that must be tested.
b) What can you say about the p-value for the test? Clearly justify by providing all required calculations.
c) Based on the obtained results, what is the conclusion of the test? Use α = 0.1.
a) 𝐻𝐻0 = there is NO association between the variables GENRE and FORMAT (the variables ARE independent)
𝐻𝐻1 = there IS some association between the variables GENRE and FORMAT (the variables are NOT independent)
(𝑂𝑂𝑖𝑖𝑖𝑖 −𝐸𝐸𝑖𝑖𝑖𝑖 )2 𝑅𝑅𝑖𝑖 𝐶𝐶𝑗𝑗
b) The test statistic is: ∑𝑟𝑟𝑖𝑖=1 ∑𝑐𝑐𝑗𝑗=1 , where 𝐸𝐸𝑖𝑖𝑖𝑖 = . Under the null hypothesis, this statistic
𝐸𝐸𝑖𝑖𝑖𝑖 𝑛𝑛
follows a Chi-squared distribution with (r-1)(c-1) degrees of freedom. The corresponding expected
frequencies for this sample are as follows:
English novel International
GENRE novel
Essay Other 𝑅𝑅𝑖𝑖
FORMA
T
32x124/200=19. 46x124/200=28. 22x124/200=19. 90x124/200=55 22+34+22+46=1
e-book
84 52 84 .8 24
printed 32x76/200=12.1 46x76/200=17.4 32x76/200=12.1 90x76/200=34. 10+12+10+44=7
6 8 6 2 6
𝐶𝐶𝑗𝑗 22+10= 32 34+12=46 22+10=32 46+44=90 200
Therefore, the observed value of the test statistic is:
(𝑂𝑂𝑖𝑖𝑖𝑖 −𝐸𝐸𝑖𝑖𝑖𝑖 )2 (22−19.84)2 (34−28.52)2 (22−19.84)2 (46−55.8)2 (10−12.16)2 (12−17.48)2
∑𝑟𝑟𝑖𝑖=1 ∑𝑐𝑐𝑗𝑗=1 = + + + + +
𝐸𝐸𝑖𝑖𝑖𝑖 19.84 28.52 19.84 55.8 12.16 17.48
(10−12.16)2 (44−34.2)2
+ 12.16
+ 34.2
= 8.537971
And the p-value is 𝑝𝑝 − 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 = Pr( 𝜒𝜒32
> 8.537971). From the table of the Chi-squared distribution, we
can see that Pr( 𝜒𝜒3 > 7.81) = 0.05 and Pr( 𝜒𝜒32 > 9.35) = 0.025, from which we can conclude that
2
0.025<p-value< 0.05.
c) Given that p-value < 0.05, it follows that p-value < α = 0.1, Therefore, the null hypothesis is rejected. At this
significance level, the data provides sufficient evidence of association between the variables.
PROBLEM 4 (7 points)
In order to study the relationship between the time Y (in minutes) spent by a customer in a shopping center and
the distance X (in km) that the customer has traveled to reach the center, a sample of 12 customers is selected. A
sample correlation coefficient equal to +0.8215 and the following estimated linear regression model is obtained:
𝑦𝑦� = −36.7943 + 4.5681 𝑥𝑥
a) Based on this sample, is it possible to say that there is a positive relationship between the variables X and Y
in the entire population? Perform the test with a confidence level α = 0.05, clearly stating the hypotheses.
b) Provide a forecast for the average time spent in the shopping centre by customers who travel 18 km to reach
it.
Knowing also that ∑12 2
𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ ) = 163.24,
c) Estimate the variance of the error component for the linear model under consideration.
d) Find a 95% confidence interval for the slope coefficient of the linear model under consideration.
a) In order to preform the necessary test, we must assume that the random sample has been extracted from
a bivariate normal population. We must test the hypotheses:
𝐻𝐻0 : 𝜌𝜌 = 0
𝐻𝐻1 : 𝜌𝜌 > 0
𝑟𝑟√𝑛𝑛−2
The decision rule is to reject 𝐻𝐻0 if > 𝑡𝑡𝑛𝑛−2; ∝ , with significance level ∝= 0.05.
√1−𝑟𝑟 2
From the sample, we have r = 0.8215 , n= 12 so the observed value of the test statistic is:
0.8215√10 2.597811
= = 4.555897
�1−(0.8215)2 0.570209
Given that 4.555897 > 𝑡𝑡10; 0.05 = 1.812, we reject 𝐻𝐻0 . Therefore, with a 5% significance level, it is
possible to say that there is a positive relationship between X and Y.
b) Setting x = 18, we get
𝑦𝑦� = −36.7943 + 4.5681 ∙ 18 = 45.4315
Therefore, customers who must travel 18 km to reach the shopping centre spend, on average, about 45
there.
c) The estimated variance of the error component for the linear model is given by:
∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖2 𝑆𝑆𝑆𝑆𝑆𝑆
𝑠𝑠𝑒𝑒2 = =
𝑛𝑛 − 2 𝑛𝑛 − 2
Knowing that 𝑆𝑆𝑆𝑆𝑆𝑆 = SSR + SSE, and given that ∑12 (𝑥𝑥
𝑖𝑖=1 𝑖𝑖 − 𝑥𝑥̅ )2 = 163.24, we may calculate the SSR
as follows:
𝑆𝑆𝑆𝑆𝑆𝑆 = ∑12 �𝑖𝑖 − 𝑦𝑦�)2 = 𝑏𝑏12 ∑12
𝑖𝑖=1(𝑦𝑦
2 2
𝑖𝑖=1(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅ ) = (4.5681) ∙ 163.24 = 3406.417
𝑆𝑆𝑆𝑆𝑆𝑆
On the other hand, 𝑅𝑅 2 = 𝑟𝑟 2 = , from which we can find
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆 3406.417
𝑆𝑆𝑆𝑆𝑆𝑆 = 2 = = 5047.574
𝑟𝑟 (0.8215)2
Finally, 𝑆𝑆𝑆𝑆𝑆𝑆 = 𝑆𝑆𝑆𝑆𝑆𝑆 − 𝑆𝑆𝑆𝑆𝑆𝑆 = 5047.574 − 3406.417 = 1641.157, and
∑𝑛𝑛𝑖𝑖=1 𝑒𝑒𝑖𝑖2 𝑆𝑆𝑆𝑆𝑆𝑆 1641.157
𝑠𝑠𝑒𝑒2 = = = = 164.1157
𝑛𝑛 − 2 𝑛𝑛 − 2 10
𝑠𝑠𝑒𝑒3 164.1157
d) 𝑠𝑠𝑏𝑏1 = � 2 =� = 1.002679,
∑𝑛𝑛
𝑖𝑖=1(𝑥𝑥𝑖𝑖 −𝑥𝑥̅ )
163.24
so the desired confidence interval is:
𝐶𝐶𝐶𝐶(𝛽𝛽1 )1−𝛼𝛼=0.90 = 𝑏𝑏1 ± 𝑡𝑡(𝑛𝑛−2);∝ 𝑠𝑠𝑏𝑏1 = 𝑏𝑏1 ± 𝑡𝑡10;0.05 𝑠𝑠𝑏𝑏1 = 4.5681 ± 2.228 ∙ 1.002679
2
= [2.334131; 6.802069]
PROBLEM 5 (3 points)
State the definition of confidence interval for a generic parameter θ.
What is the interpretation of the confidence level?
𝜎𝜎
When is it possible to use the formula 𝑥𝑥̅ ∓ 𝑧𝑧𝛼𝛼� to construct a confidence interval for the mean of a population?
2 √𝑛𝑛
Please refer to the textbook and the lecture notes
PROBLEM 6 (3 points)
In the context of a simple linear regression model, what are the possible types of prediction that can be made?
Provide the relevant formulas for point estimation and interval estimation.
Please refer to the textbook and the lecture notes