Topic3 Linear Regression
Topic3 Linear Regression
In data science, many applications involve making predictions about the outcome y based
on a number of predictors x.
y ≈ f (x)
It is called `supervised learning' because we have data on both the outcome y and the
predictor x.
Therefore, the data can `teach' us, given a certain predictor value for x, what is the most
likely corresponding outcome y.
Linear regression is an analytical technique used to model the relationship between several
input variables and a continuous outcome variable.
A key assumption is that the relationships between the input variables and the outcome
variable is linear.
For example, in simple linear regression with only one predictor, we assume a model of the
form
y ≈ f (x) = β0 + β1 x.
Suppose we are interested in building a linear regression model that estimates a HDB
unit's resale price as a function of oor area in square meters.
Suppose we have three observations. Each observation has an outcome y and an input
variable x.
yi ≈ β0 + β1 xi
Since there is only one input variable, this is an example of simple linear model.
i xi yi
1 -1 -1
2 3 3.5
3 5 3
yi ≈ f (xi ) = β0 + β1 xi
i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1
We do not want the residuals to cancel o each other, so we square each of them,
leading to the squared residuals.
i ei = yi − (β0 + β1 xi )
residual: squared residual: e2i
1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2
To express the total magnitude of the deviations, we sum up the squared residuals for all
the data points, Residual Sum of Squares, abbreviated as RSS, some might denote it as
SSres , sum of squared residuals.
We now need to nd the values of β0 and β1 such that RSS is minimized, where
The whole process (from getting each ei , e2i , RSS and minimize it to get the values of β0
and β1 ) is known as the method of ordinary least squares (OLS).
To nd the minimum of h(β0 , β1 ), we rst take the derivative of it w.r.t β0 while holding
β1 constant, and then take the derivative of it w.r.t β1 while holding β0 constant.
∂h(β0 , β1 )
= −11 + 6β0 + 14β1 .
∂β0
∂h(β0 , β1 )
= −53 + 14β0 + 70β1 .
∂β1
Solving it, we have the solution which is the least squares estimates
β0 ≈ 0.1250
β1 ≈ 0.7321
We usually add the hat sign on top of the parameter to denote estimated value of the
parameter, so the least squares estimates are
β̂0 = 0.1250
β̂1 = 0.7321.
In the previous slides, we had a specic example, a data with 3 points, and had manually
practiced the OLS method.
We now generalize OLS to a data set which has 2 variables X and Y with n observations
(x1 , y1 ), ..., (xn , yn ).
yi ≈ β0 + β1 xi , i = 1, ..., n.
n
X h i2
RSS = e2i = yi − (β0 + β1 xi ) .
i=1
n
∂RSS X
Take derivative of RSS w.r.t β0 and β1 , one at a time. = −2 (yi − β0 − β1 xi )
∂β0
i=1
n
∂RSS X
= −2 (yi − β0 − β1 xi )xi
∂β1
i=1
The least square estimate of β0 and β1 , βˆ0 and βˆ1 , are the solution when we set the
derivative to zero.
n n
1 X 1X
βˆ0 + β̂1 xi − yi = 0 (1)
n n
i=1 i=1
n n n
1 X 1 X 1 X
βˆ0 xi + β̂1 x2i − yi x i = 0 (2)
n n n
i=1 i=1 i=1
n n
1X 1X
Denote ȳ = yi and x̄ = xi .
n n
i=1 i=1
From (1), we have β̂0 = ȳ − β̂1 x̄, and replace this β̂0 into (2), we have
Pn Pn
Pn i=1 yi i=1 xi
i=1 yi xi − n
β̂1 = 2
Pn
Pn i=1 xi
2 −
i=1 xi n
> x = c( -1, 3, 5)
> y = c( -1, 3.5 , 3)
> lm(y~x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.1250 0.7321
We now can write the tted model as
ŷ = 0.125 + 0.7321x
The R code is
▶ Coecient of determination, R2 .
▶ When comparing the goodness-of-t of two models with the same response, we can use
Residual Standard Error (RSE) as a criterion.
Its null hypothesis (H0 ) states model is NOT signicant. Its alternative (H1 ) states
model is signicant. Equivalently:
If the test has small p-value (such as <0.05), then data provide strong evidence against
H0 . Otherwise, we cannot eliminate H0 .
T SS − RSS RSS
R2 = =1−
T SS T SS
n
X
where T SS = (yi − ȳ)2 is the total sum of squares .
i=1
T SS measures the total variance in the response in the given data, and can be thought of
as the amount of variability inherent in the response before the regression is performed.
RSS measures the amount of variability that is left unexplained after performing the
regression.
R2 measures the proportion of variability in the response Y that is explained by the tted
model. Larger R2 indicates better model t.
R2 directly
Deriving
> summary(M)$r.squared
[1] 0.822407
r n
1 X
RSE = RSS where RSS = (yi − ŷi )2
n−2
i=1
For the same response, one may t many dierent linear models, the one with larger RSE
indicates poorer model t.
Suppose we have n observations. Each observation has an outcome y and multiple input
variables x1 , ..., xp .
y ≈ β0 + β1 x1 + β2 x2 + ... + βp xp
or equivalently
yi ≈ β0 + β1 x1i + β2 x2i + ... + βp xpi , i = 1, ..., n.
n h
i 2
yi − β0 + β1 x1i + β2 x2i + ... + βp xpi
X
RSS = .
i=1
The least squares estimates of β0 , β1 , β2 , ..., βp , are returned by the lm() function in R.
> lm(y~x1+x2)
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
0.9362 1.7649 -4.9560
R2 can be inated simply by adding more regressors to the model (even insignicant
terms).
2 RSS/(n − p − 1)
Radj =1− .
T SS/(n − 1)
2
When comparing two models of the same response, the model with larger Radj is preferred.
HDB ats are sold on 99-year leases. Hence, the older lease commence date (it's the date
the rst owner took the key from HDB), usually the lower resale price, given other
conditions are similar.
Hence, we may consider the number of years from the lease commence date till this year
as a quantitative regressor, called years_left.
Can you try to t a linear model for the resale price with two regressors, oor area and the
years left (called model M2).
Write down the equation of model M2. Report the goodness-of-t of it.
Compared to the simple model (with only oor area as the regressor), which model would
you prefer and why?