UCD Business Analytics - Practical Sheet Solution
Miguel Nicolau
Chapter 3: Linear Regression
Exercise 1
Table 1: Employees and sales in a small sample of companies.
employees sales (thousands of Euros)
1 15
4 25
5 100
7 120
1. For the data shown in Table 1, calculate a linear regression model of the form y = a + bx, using
employees as the predictor and sales as the response. Apply the Least Squares method, using the
formulas below.
P P P
n · xy − x y
b= P 2 P
n · x − ( x)2
a = ȳ − bx̄
Solution
This exercise instructs us to use employees as the predictor variable (x), and sales as the response
variable (y). So in order to apply the Least Squares equations to obtain the a and b coefficients, we
need to calculate some required values:
• n (number of samples): 4;
•
P
xy (sum of each x value multiplied by corresponding y value): 1×15+4×25+5×100+7×120 =
1455;
•
P
x (sum of all x values): 1 + 4 + 5 + 7 = 17;
•
P
y (sum of all y values): 15 + 25 + 100 + 120 = 260;
•
P 2
x (sum of each x value squared): 12 + 42 + 52 + 72 = 91;
• ( x) (sum of all x values, squared): 172 = 289;
P 2
• x̄ (average of all x values): 17/4 = 4.25;
• ȳ (average of all y values): 260/4 = 65.
So the slope will be:
4 × 1455 − 17 × 260 5820 − 4420 1400
b= = = = 18.667
4 × 91 − 289 364 − 289 75
1
And the intercept will be
a = 65 − 18.667 × 4.25 = 65 − 79.335 = −14.335
2. Calculate the predictions of the model for each of the data points of the training set (i.e. x = 1, 4, 5, 7).
Solution
f (1) = −14.335 + 18.667 ∗ 1 = 4.332
f (4) = −14.335 + 18.667 ∗ 4 = 60.333
f (5) = −14.335 + 18.667 ∗ 5 = 79
f (7) = −14.335 + 18.667 ∗ 7 = 116.334
3. Calculate the train RMSE and R2 , using the formulas below.
r Pn
i=1 (yi − (a + bxi ))2
RMSE =
n
Pn 2
2 i=1(yi − (a + bxi ))
R =1− Pn 2
i=1 (yi − ȳ)
Solution
r
(15 − 4.332)2 + (25 − 60.333)2 + (100 − 79)2 + (120 − 116.334)2
RMSE =
4
r
10.6682 + (−35.333)2 + 212 + 3.6662
=
4
r
113.806 + 1248.421 + 441 + 13.440
=
4
r
1816.667
=
4
√
= 454.167
= 21.311
1816.667
R2 = 1 −
(15 − 65)2 + (25 − 65)2 + (100 − 65)2 + (120 − 65)2
1816.667
=1−
(−50) + (−40)2 + 352 + 552
2
1816.667
=1−
2500 + 1600 + 1225 + 3025
1816.667
=1−
8350
= 1 − 0.218
= 0.782
4. Table 2 shows available test data. Use it to calculate test RMSE and R2 values. Which would you
typically expect to be a larger value: train RMSE or test RMSE? What about train vs. test R2 ?
2
Table 2: Test data for employees and sales
employees sales (thousands of Euros)
3 26
10 135
Solution
r
(26 − f (3))2 + (135 − f (10))2
RMSE =
2
r
(26 − 41.666) + (135 − 172.335)2
2
=
2
r
(−15.666) + (−37.335)2
2
=
2
r
245.424 + 1393.902
=
2
r
1639.326
=
2
√
= 819.663
= 28.63
1639.326
R2 = 1 −
(26 − 80.5)2 + (135 − 80.5)2
1639.326
=1−
(−54.5)2 + 54.52
1639.326
=1−
2970.25 + 2970.25
1639.326
=1−
5940.5
= 1 − 0.276
= 0.724
Typically you would expect train RMSE to be smaller than test RMSE, because the model was made
to fit the train data, and RMSE is an error measure.
Likewise, you would expect the train R2 to be higher than the test R2 , because the model was trained
using the variance of y from the train dataset.
5. For each of the points in Table 2, is that an in-sample or out-of-sample point?
In-sample data is the data that was used to train the model. Table 2 does not contain any data from
the training set (i.e. from Table 1), therefore both observations are out-of-sample data.
6. For each of your predictions for Table 2, is it an interpolation or extrapolation?
Interpolation basically means to make predictions for x values within the range of x values used to
train the model; extrapolation is the opposite. The range of x values used to train the model was
[1, 7]; this means that a prediction for x = 3 is an interpolation, whereas a prediction for x = 10 is an
extrapolation.