STOR 455/002 - Homework I
January 17, 2025
This homework is due on 11:59 pm (Eastern Time) on Jan 30, 2025. The homework is to be
submitted in Gradescope.
Exercise 1 (13 points)
Consider a bivariate dataset consisting of 𝑛 observations on 2 numerical variables 𝑥 (") and
(&)
𝑥 ($) . Let 𝑥% denotes the value of the variable 𝑥 (&) for the 𝑗-th observation, where 𝑖 = 1,2
and 𝑗 = 1, … , 𝑛.
(a) (2 + 3 + 3 points)
Fix two real numbers 𝑎 and 𝑏. Consider the new variable 𝑧 : = 𝑎𝑥 (") + 𝑏. In other words, the
(")
value of the variable 𝑧 for the 𝑗-th observation is given by 𝑧% = 𝑎𝑥% + 𝑏. Show that
(")
𝑧‾' = 𝑎𝑥‾' + 𝑏, var(𝑧) = 𝑎$ var6𝑥 (") 7, cov6𝑧, 𝑥 ($) 7 = 𝑎cov6𝑥 (") , 𝑥 ($) 7.
Use the definition of sample mean, variance and covariance.
' '
1 1 (")
𝑧‾' = ; 𝑧% = ;<𝑎𝑥% + 𝑏=
𝑛 𝑛
%(" %("
' '
1 (") 1
𝑧‾' = 𝑎 ⋅ ; 𝑥% + ; 𝑏
𝑛 𝑛
%(" %("
(") " (")
Simplify using 𝑥‾' = ' ∑'%(" 𝑥% :
(")
𝑧‾' = 𝑎𝑥‾' + 𝑏
'
1 $
var(𝑧) = ;6𝑧% − 𝑧‾' 7
𝑛−1
%("
(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:
'
1 (") (")
$
var(𝑧) = ; C𝑎𝑥% + 𝑏 − <𝑎𝑥‾' + 𝑏=D
𝑛−1
%("
'
1 (") (")
$
var(𝑧) = ; 𝑎$ <𝑥% − 𝑥‾' =
𝑛−1
%("
'
1 (") (")
$
var(𝑧) = 𝑎 ⋅ ;<𝑥% − 𝑥‾' = = 𝑎$ var6𝑥 (")7
$
𝑛−1
%("
'
($)
1 ($) ($)
cov6𝑧, 𝑥 7= ;6𝑧% − 𝑧‾' 7 <𝑥% − 𝑥‾' =
𝑛−1
%("
(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:
'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7= ; 𝑎 <𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =
𝑛−1
%("
'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7=𝑎⋅ ;<𝑥% − 𝑥‾' = <𝑥% − 𝑥‾' = = 𝑎cov6𝑥 (") , 𝑥 ($) 7
𝑛−1
%("
(b) (5 points)
Show that
var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (")7 + var6𝑥 ($)7 + 2cov6𝑥 (") , 𝑥 ($) 7.
Write down Var6𝑥 (") + 𝑥 ($)7 as
'
(") ($)
1 (") ($) (") ($)
$
var6𝑥 +𝑥 7= ;<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = .
𝑛−1
%("
Expand in the following way :
$ $ $
(") ($) (") ($) (") (") ($) ($) (") (") ($) ($)
<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = = <𝑥% − 𝑥‾' = + <𝑥% − 𝑥‾' = + 2<𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =.
'
(") ($)
1 (") ($) (") ($)
$
var6𝑥 +𝑥 7= ;<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' =
𝑛−1
%("
$ $ $
(") ($) (") ($) (") (") ($) ($) (") (") ($) ($)
<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = = <𝑥% − 𝑥‾' = + <𝑥% − 𝑥‾' = + 2<𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =
var6𝑥 (") + 𝑥 ($) 7
' '
1 (") (")
$
($) ($)
$
= F;<𝑥% − 𝑥‾' = + ;<𝑥% − 𝑥‾' =
𝑛−1
%(" %("
'
(") (") ($) ($)
+ 2 ;<𝑥% − 𝑥‾' = <𝑥% − 𝑥‾' =G
%("
var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (") 7 + var6𝑥 ($) 7 + 2cov6𝑥 (") , 𝑥 ($)7
Exercise 2 (7 points)
Consider a dataset of 𝑛 observations which collected data on a variable 𝑌. The value for
the numerical variable 𝑌 for the 𝑖-th observation is given by 𝑌& , for 𝑖 = 1, … , 𝑛. Establish the
following sum-of-squares decomposition. For any real number 𝑐,
' '
;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ ,
$
&(" &("
where 𝑌‾' is the sample mean of the variable 𝑌. Observe that the above identity shows the
LHS to be minimized uniquely at 𝑐 = 𝑌‾' .
Write (𝑌& − 𝑐) on the left hand side as
(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐),
and use the quadratic formula (𝑎 + 𝑏)$ = 𝑎$ + 𝑏 $ + 2𝑎𝑏 to expand (𝑌& − 𝑐)$ .
' '
;(𝑌& − 𝑐)$ = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ .
&(" &("
(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐). (𝑌& − 𝑐)$ = [(𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐)]$ .
(𝑌& − 𝑐)$ = (𝑌& − 𝑌‾' )$ + (𝑌‾' − 𝑐)$ + 2(𝑌& − 𝑌‾' )(𝑌‾' − 𝑐).
' ' ' '
;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾'
$ )$ + ;(𝑌‾' − 𝑐) + 2 ;(𝑌& − 𝑌‾' ) (𝑌‾' − 𝑐).
$
&(" &(" &(" &("
since ∑'&("(𝑌‾' − 𝑐)$ = 𝑛(𝑌‾' − 𝑐)$ and ∑'&("(𝑌& − 𝑌‾' ) = 0,
' '
;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ .
$
&(" &("
Exercise 3 (13 points)
Consider the bivariate dataset of 𝑛 observations
{(𝑌" , 𝑥" ), … , (𝑌' , 𝑥' )},
where 𝑌 and 𝑥 are respectively response and explanatory variables. We want to fit a simple
linear regression model of 𝑌 on 𝑥, given by
𝑌& = 𝛽) + 𝛽" 𝑥& + 𝜀& , 𝑖 = 1, … , 𝑛.
In this exercise we shall prove (without using calculus) the following algebraic expressions
for the least-square estimates :
𝑟*+ 𝑠+
𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' , 𝛽Q" = .
𝑠*
In other words, the above quantities solve the following minimization problem:
(a) (2 points )
Use Exercise 1 to show that for any (𝛽) , 𝛽" ),
' '
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;(𝑌& − 𝑌‾' − 𝛽" 𝑥& + 𝛽" 𝑥‾' )$ + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
(b) (2 points)
Deduce from Part (a) that any solution 6𝛽Q) , 𝛽Q" 7 to the optimization problem in () satisfies
the relation
𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' .
(c) (2 points)
Deduce from Part (b) that the solution 𝛽Q" to () solves the following optimization problem.
(d) (4 points)
Show that
'
1 $
;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("
(e) (3 points)
Use Part (d) to deduce that () is solved at
𝑟*+ 𝑠+
𝛽Q" = .
𝑠*
Use the fact that for any positive real number 𝑎 and real numbers 𝑏, 𝑐, the quadratic
expression 𝑎𝑡 $ + 𝑏𝑡 + 𝑐 is minimized at 𝑡 = −𝑏/2𝑎.
a.
Using Exercise 2, substitute 𝑐 = 𝛽) + 𝛽" 𝑥& into the sum-of-squares decomposition. Let 𝑐 =
𝑌‾' − 𝛽" 𝑥‾' + 𝛽) − (𝑌‾' − 𝛽" 𝑥‾' ). Expanding:
' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
This matches the required decomposition.
𝑌& − 𝛽) − 𝛽" 𝑥& = 6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + (𝑌‾' − 𝛽" 𝑥‾' − 𝛽) ).
' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
The cross-term vanishes because ∑'&("(𝑌& − 𝑌‾' ) = 0 and ∑'&("(𝑥& − 𝑥‾' ) = 0.
b. The total sum of squares is minimized when the second term 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ is zero.
Thus:
𝑌‾' − 𝛽" 𝑥‾' − 𝛽) = 0
𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' .
c. Substitute 𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' to the original minimization problem:
' '
$ $
;6𝑌& − 𝛽Q) − 𝛽" 𝑥& 7 = ;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 .
&(" &("
𝛽Q" minimizes this reduced sum of squares
d,
$
6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = (𝑌& − 𝑌‾' )$ − 2𝛽" (𝑌& − 𝑌‾' )(𝑥& − 𝑥‾' ) + 𝛽"$ (𝑥& − 𝑥‾' )$ .
'
1
;[⋯ ] = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("
e. 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌) is a quadratic in 𝛽" . Minimizing this quadratic gives:
Cov(𝑥, 𝑌) 𝑟*+ 𝑠+
𝛽Q" = = ,
Var(𝑥) 𝑠*
,-.(*,+)
where 𝑟*+ = 0! 0"
, 𝑠* = XVar(𝑥), and 𝑠+ = XVar(𝑌).
Exercise 4 (12 points)
The file contains data from the public health study on Nepalese children. The dataset has
877 observations on 3 variables:
(a) (5 + 5 points)
Fit separate simple linear regression models with being the response variable and being
the predictor variable on the sub-datasets of male and female children. Report the
estimated co-efficients and a scatter plot (along with the fitted line) for each sub-
population.
nepal <- read.csv("/Users/macbook/Desktop/STOR455/nepal.csv")
males <- subset(nepal, sex == 1)
females <- subset(nepal, sex == 2)
male_model <- lm(weight~height, data = males)
female_model <- lm(weight~height, data = females)
male_model$coefficients
## (Intercept) height
## -9.0869252 0.2393433
female_model$coefficients
## (Intercept) height
## -8.3712108 0.2281936
plot(males$height, males$weight, main = "Weight vs. Height (Males)", xlab =
"Height (cm)", ylab = "Weight (kg)", pch = 19)
abline(male_model, col = "red", lwd = 2)
plot(females$height, females$weight, main = "Weight vs. Height (Females)",
xlab = "Height (cm)", ylab = "Weight (kg)", pch = 19)
abline(male_model, col = "red", lwd = 2)
(b) (2 points)
Comment on the goodness-of-fit of the simple linear regression models for the two sub-
populations. For which sub-population, the model fits the data better?
summary(male_model)
##
## Call:
## lm(formula = weight ~ height, data = males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7192 -0.5064 -0.0510 0.4496 3.2427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.086925 0.288998 -31.44 <2e-16 ***
## height 0.239343 0.003341 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8373 on 453 degrees of freedom
## Multiple R-squared: 0.9189, Adjusted R-squared: 0.9187
## F-statistic: 5131 on 1 and 453 DF, p-value: < 2.2e-16
summary(female_model)
##
## Call:
## lm(formula = weight ~ height, data = females)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82127 -0.57982 -0.02652 0.50813 3.15115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.371211 0.303580 -27.57 <2e-16 ***
## height 0.228194 0.003551 64.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8916 on 420 degrees of freedom
## Multiple R-squared: 0.9077, Adjusted R-squared: 0.9075
## F-statistic: 4129 on 1 and 420 DF, p-value: < 2.2e-16
Examining the regression outputs for both subpopulations, the male model exhibits a
lower standard error (0.8383 vs. 0.8924), indicating that its predictions are slightly more
precise compared to the female model. Additionally, the male subpopulation
demonstrates a higher R-squared value (0.9188 vs. 0.9073) and a lower residual standard
error (RSE) than the female subpopulation. These results suggest that the simple linear
regression model provides a better fit for the male subpopulation, as it explains a slightly
greater proportion of the variance in weight and produces more precise predictions.
Exercise 5 (3 + 2 points)
Consider a bivariate dataset consisting of 𝑛 observations on 2 numerical variables 𝑥 (") and
𝑥 ($) . We first fit a simple linear regression model with 𝑥 (") as the response variable and 𝑥 ($)
as the explanatory variable. Let 𝑏"$ denote the (least-square) estimated slope co-efficient.
Next we fit a simple linear regression model with 𝑥 ($) as the response variable and 𝑥 (") as
the explanatory variable. Let 𝑏$" denote the (least-square) estimated slope of this new
fitted line. Show that
$
𝑏"$ 𝑏$" = 6𝑟* ($) *(&) 7 ,
where 𝑟* ($) * (&) is the sample correlation co-efficient between the variables 𝑥 (") and 𝑥 ($) .
Argue that both the estimated slopes have the same signs and at most one of them can
have absolute value greater than 1.
The least-square slope estimates are:
Cov6𝑥 (") , 𝑥 ($)7 Cov6𝑥 (") , 𝑥 ($) 7
𝑏"$ = , 𝑏$" = .
Var(𝑥 ($) ) Var(𝑥 (") )
$
\Cov6𝑥 (") , 𝑥 ($) 7]
𝑏"$ ⋅ 𝑏$" = .
Var(𝑥 (") )Var(𝑥 ($) )
The correlation coefficient:
Cov6𝑥 (") , 𝑥 ($)7
𝑟* ($)* (&) = ,
𝑠* ($) 𝑠* (&)
since 𝑠* ($) = XVar(𝑥 (") ) and 𝑠* (&) = XVar(𝑥 ($))
$
$ \Cov6𝑥 (") , 𝑥 ($) 7]
6𝑟* ($) * (&) 7 = .
Var(𝑥 (") )Var(𝑥 ($) )
$
Thus, 𝑏"$ ⋅ 𝑏$" = 6𝑟* ($)* (&) 7
Both slopes 𝑏"$ and 𝑏$" share the sign of Cov6𝑥 (") , 𝑥 ($) 7, which matches the sign of 𝑟* ($)* (&) .
The slopes are:$ b_{12} = r_{x{(1)}x{(2)}} , b_{21} = r_{x{(1)}x{(2)}} . $
0 ($) 0 (&)
If 0! > 1, then 0! <1
!(&) !($)
Since `𝑟* ($) * (&) ` ≤ 1, at most one of |𝑏"$ | or |𝑏$" | can exceed 1.