Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views52 pages

Topic3 Linear Regression

Uploaded by

Bui Xuan Phong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views52 pages

Topic3 Linear Regression

Uploaded by

Bui Xuan Phong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 1 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 2 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 3 / 52


Supervised Learning

In data science, many applications involve making predictions about the outcome y based
on a number of predictors x.

Often we assume models of the form

y ≈ f (x)

where f (x) is a function that maps the predictor(s) to the outcome.

One example is the linear regression model.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 4 / 52


Supervised Learning

It is called `supervised learning' because we have data on both the outcome y and the
predictor x.

Therefore, the data can `teach' us, given a certain predictor value for x, what is the most
likely corresponding outcome y.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 5 / 52


Linear regression

Linear regression is an analytical technique used to model the relationship between several
input variables and a continuous outcome variable.

A key assumption is that the relationships between the input variables and the outcome
variable is linear.

For example, in simple linear regression with only one predictor, we assume a model of the
form
y ≈ f (x) = β0 + β1 x.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 6 / 52


Examples

Forecasting models can be built to predict


taxi demand, emergency room visits, and
ambulance dispatches.

Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 7 / 52


Examples

Medical: Linear regression model can be


used to analyze the eect of a proposed
radiation treatment on reducing tumor
sizes.

Multiple input variables might include


duration of a single radiation treatment,
frequency of radiation treatment, and
patient attributes such as age or weight.

Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 8 / 52


Examples

Pharmaceutical Industry:: Linear


regression model can be used to analyze
the clinical ecacy of drugs.

Input variables may include age, gender


and other patient characteristics such as
blood pressure and blood sugar level.

Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 9 / 52


Examples

Finance: Linear regression is used to


model the relationships between stock
market prices and other variables such as
economic performance, interest rates and
geopolitical risks.

Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 10 / 52


Examples

Real estate: Linear regression analysis


can be used to model at's price as a
function of the oor area.

Such a model helps set or evaluate the list


price of a at on the market.

The model could be further improved by


including other input variables such as
number of bathrooms, number of
bedrooms, district rankings, etc.

Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 11 / 52


Data on HDB Resale Flats

Data on resale HDB prices based on


registration date is publicly available from
https://data.gov.sg/dataset/
resale-flat-prices
We have extracted a subset of all the
resale records from March 2012 to
December 2014 based on registration date.

Available as the data set


hdbresale_reg.csv on the course
website.
Source: The Business Times

Topic 3  Linear Regression DSA1101 Introduction to Data Science 12 / 52


HDB Resale Flats

> resale = read.csv("C:/Data/hdbresale_reg.csv")


> head(resale[ ,2:7]) # 1st column indicates ID of flats
month town flat_type block street_name storey_range
1 2012-03 CENTRAL AREA 3 ROOM 640 ROWELL RD 01 TO 05
2 2012-03 CENTRAL AREA 3 ROOM 640 ROWELL RD 06 TO 10
3 2012-03 CENTRAL AREA 3 ROOM 668 CHANDER RD 01 TO 05
4 2012-03 CENTRAL AREA 3 ROOM 5 TG PAGAR PLAZA 11 TO 15
5 2012-03 CENTRAL AREA 3 ROOM 271 QUEEN ST 11 TO 15
6 2012-03 CENTRAL AREA 4 ROOM 671A KLANG LANE 01 TO 05

Topic 3  Linear Regression DSA1101 Introduction to Data Science 13 / 52


HDB Resale Flats

> head(resale[ ,8:11])


floor_area_sqm flat_model lease_commence_date resale_price
1 74 Model A 1984 380000
2 74 Model A 1984 388000
3 73 Model A 1984 400000
4 59 Improved 1977 460000
5 68 Improved 1979 488000
6 75 Model A 2003 495000

Suppose we are interested in building a linear regression model that estimates a HDB
unit's resale price as a function of oor area in square meters.

How to form such function?

Topic 3  Linear Regression DSA1101 Introduction to Data Science 14 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 15 / 52


Simple Linear Regression (SLR)

Suppose we have three observations. Each observation has an outcome y and an input
variable x.

We are interested in the linear relationship

yi ≈ β0 + β1 xi

Since there is only one input variable, this is an example of simple linear model.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 16 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 17 / 52


Ordinary Least Squares for SLR

i xi yi
1 -1 -1
2 3 3.5
3 5 3

Plot of the three data points.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 18 / 52


Ordinary Least Squares for SLR

We are interested in the linear relationship

yi ≈ f (xi ) = β0 + β1 xi

There are many dierent straight lines


that can be used to model x and y, as
shown in the plot.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 19 / 52


Ordinary Least Squares for SLR

Intuitively, we want the line to be as close


to the data points as possible.

This closeness can be measured in terms


of the vertical distance between each point
to the line.

The line that is closest to the data points


is chosen as best-tting line. The values of
intercept and slope of it are picked for β0
and β1 .

Topic 3  Linear Regression DSA1101 Introduction to Data Science 20 / 52


Ordinary Least Squares for SLR

i xi yi β0 + β1 xi residual: ei = yi − (β0 + β1 xi )
1 -1 -1 β0 + (−1)β1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1
2 3 3.5 β0 + (3)β1 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1
3 5 3 β0 + (5)β1 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1

Topic 3  Linear Regression DSA1101 Introduction to Data Science 21 / 52


Ordinary Least Squares for SLR

The residual for each point may be positive or negative.

We do not want the residuals to cancel o  each other, so we square each of them,
leading to the squared residuals.

The squared residuals are in the last column.

i ei = yi − (β0 + β1 xi )
residual: squared residual: e2i
1 −1 − (β0 + (−1)β1 ) = −1 − β0 + β1 [−1 − β0 + β1 ]2
2 3.5 − (β0 + (3)β1 ) = 3.5 − β0 − 3β1 [3.5 − β0 − 3β1 ]2
3 3 − (β0 + (5)β1 ) = 3 − β0 − 5β1 [3 − β0 − 5β1 ]2

Topic 3  Linear Regression DSA1101 Introduction to Data Science 22 / 52


Ordinary Least Squares for SLR

To express the total magnitude of the deviations, we sum up the squared residuals for all
the data points, Residual Sum of Squares, abbreviated as RSS, some might denote it as
SSres , sum of squared residuals.

For the 3 data points, we have

RSS = e21 + e21 + e23


= [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 .

Topic 3  Linear Regression DSA1101 Introduction to Data Science 23 / 52


Ordinary Least Squares for SLR

We now need to nd the values of β0 and β1 such that RSS is minimized, where

RSS = [−1 − β0 + β1 ]2 + [3.5 − β0 − 3β1 ]2 + [3 − β0 − 5β1 ]2 .

The whole process (from getting each ei , e2i , RSS and minimize it to get the values of β0
and β1 ) is known as the method of ordinary least squares (OLS).

Topic 3  Linear Regression DSA1101 Introduction to Data Science 24 / 52


Ordinary Least Squares for SLR

Consider RSS as a function in terms of β0 and β1 . Let's call it h(β0 , β1 ).

To nd the minimum of h(β0 , β1 ), we rst take the derivative of it w.r.t β0 while holding
β1 constant, and then take the derivative of it w.r.t β1 while holding β0 constant.

∂h(β0 , β1 )
= −11 + 6β0 + 14β1 .
∂β0
∂h(β0 , β1 )
= −53 + 14β0 + 70β1 .
∂β1

Topic 3  Linear Regression DSA1101 Introduction to Data Science 25 / 52


Ordinary Least Squares for SLR
Lastly, setting both the derivative to zero, we have a system of two equations

−11 + 6β0 + 14β1 = 0


−53 + 14β0 + 70β1 = 0

Solving it, we have the solution which is the least squares estimates

β0 ≈ 0.1250
β1 ≈ 0.7321

We usually add the hat sign on top of the parameter to denote estimated value of the
parameter, so the least squares estimates are

β̂0 = 0.1250
β̂1 = 0.7321.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 26 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 27 / 52


Ordinary Least Squares for SLR in General

In the previous slides, we had a specic example, a data with 3 points, and had manually
practiced the OLS method.

We now generalize OLS to a data set which has 2 variables X and Y with n observations
(x1 , y1 ), ..., (xn , yn ).

The simple model (straight line) has the form

yi ≈ β0 + β1 xi , i = 1, ..., n.

Then, the residuals are


ei = yi − (β0 + β1 xi ), i = 1, ..., n

Topic 3  Linear Regression DSA1101 Introduction to Data Science 28 / 52


Ordinary Least Squares for SLR in General

The residuals sum of squares is then

n
X h i2
RSS = e2i = yi − (β0 + β1 xi ) .
i=1

n
∂RSS X
Take derivative of RSS w.r.t β0 and β1 , one at a time. = −2 (yi − β0 − β1 xi )
∂β0
i=1
n
∂RSS X
= −2 (yi − β0 − β1 xi )xi
∂β1
i=1

Topic 3  Linear Regression DSA1101 Introduction to Data Science 29 / 52


Ordinary Least Squares for SLR in General

The least square estimate of β0 and β1 , βˆ0 and βˆ1 , are the solution when we set the
derivative to zero.

n n
1 X 1X
βˆ0 + β̂1 xi − yi = 0 (1)
n n
i=1 i=1
n n n
1 X 1 X 1 X
βˆ0 xi + β̂1 x2i − yi x i = 0 (2)
n n n
i=1 i=1 i=1

(1) and (2) are called the least-squares normal equations.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 30 / 52


Ordinary Least Squares for SLR in General

n n
1X 1X
Denote ȳ = yi and x̄ = xi .
n n
i=1 i=1

From (1), we have β̂0 = ȳ − β̂1 x̄, and replace this β̂0 into (2), we have

  
Pn Pn
Pn i=1 yi i=1 xi
i=1 yi xi − n
β̂1 =  2
Pn
Pn i=1 xi
2 −
i=1 xi n

Topic 3  Linear Regression DSA1101 Introduction to Data Science 31 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 32 / 52


Ordinary Least Squares in R
We can check that the least squares estimates we computed manually are equivalent to
those returned by the lm() function in R:

> x = c( -1, 3, 5)
> y = c( -1, 3.5 , 3)
> lm(y~x)
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
0.1250 0.7321
We now can write the tted model as
ŷ = 0.125 + 0.7321x

Topic 3  Linear Regression DSA1101 Introduction to Data Science 33 / 52


Ordinary Least Squares in R
With the tted model, we now can obtain the tted outcome (predicted outcome) value ŷ
given any value of the predictor, x.

For example, if x = 2, then the tted value for the outcome is

ŷ = 0.125 + 0.7321 × 2 = 1.589.

In R, we use function predict().

> M = lm(y~x) # M = name of the fitted model


> new = data.frame(x = 2) # create dataframe of new point
> predict(M, newdata = new)
1
1.589286

Topic 3  Linear Regression DSA1101 Introduction to Data Science 34 / 52


HDB Resale Flats Data Set
We now can answer the question in slide 14 on building a linear regression model that
estimates a HDB unit's resale price as a function of oor area in square meters.

The R code is

> price = resale$resale_price


> area = resale$floor_area_sqm
> lm(price~area)$coef # coefficients of the model
(Intercept) area
115145.730 3117.212

The tted model is then


ŷ = 115145.73 + 3117.21 ∗ area
where y is the resale price of a at, in SGD.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 35 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 36 / 52


Goodness-of-t of a Model

The goodness-of-t of a model could be accessed by some measures. In this course, we


consider only two basic measurements:

▶ The signicance of the model by a test (F-test).

▶ Coecient of determination, R2 .

▶ When comparing the goodness-of-t of two models with the same response, we can use
Residual Standard Error (RSE) as a criterion.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 37 / 52


F-test of a Linear Model

To test if the whole model is signicant or not, we use F-test.

Its null hypothesis (H0 ) states model is NOT signicant. Its alternative (H1 ) states
model is signicant. Equivalently:

H0 : all the coecients, except intercept, are zero


versus
H1 : at least one of the coecients (except intercept), is NON-zero.

If the test has small p-value (such as <0.05), then data provide strong evidence against
H0 . Otherwise, we cannot eliminate H0 .

Topic 3  Linear Regression DSA1101 Introduction to Data Science 38 / 52


F-test of a Linear Model

Topic 3  Linear Regression DSA1101 Introduction to Data Science 39 / 52


F-test of a Linear Model

Topic 3  Linear Regression DSA1101 Introduction to Data Science 40 / 52


Coecient of Determination R2

The quantity R2 is dened as

T SS − RSS RSS
R2 = =1−
T SS T SS
n
X
where T SS = (yi − ȳ)2 is the total sum of squares .
i=1

T SS measures the total variance in the response in the given data, and can be thought of
as the amount of variability inherent in the response before the regression is performed.

RSS measures the amount of variability that is left unexplained after performing the
regression.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 41 / 52


Coecient of Determination R2

R2 measures the proportion of variability in the response Y that is explained by the tted
model. Larger R2 indicates better model t.

R2 directly
Deriving

> RSS =sum ((y- M$fitted )^2)


> TSS = var(y)*(length(y) -1)
> R2 = 1 - RSS/TSS; R2
[1] 0.822407

Or getting it from the model output, Multiple R-squared, or as below.

> summary(M)$r.squared
[1] 0.822407

Topic 3  Linear Regression DSA1101 Introduction to Data Science 42 / 52


Residual Standard Error (RSE)

RSE in simple linear regression is dened as

r n
1 X
RSE = RSS where RSS = (yi − ŷi )2
n−2
i=1

For the same response, one may t many dierent linear models, the one with larger RSE
indicates poorer model t.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 43 / 52


Residual Standard Error (RSE)
RSE could be obtained from the R output, at Residual standard error.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 44 / 52


1 Introduction

2 Forming Equation for Simple Linear Regression


Manual Practice
SLR in General
OLS in R

3 Assessing the Goodness-of-t of the Model

4 Multiple Linear Regression

Topic 3  Linear Regression DSA1101 Introduction to Data Science 45 / 52


Settings

Suppose we have n observations. Each observation has an outcome y and multiple input
variables x1 , ..., xp .

We are interested in the linear relationship

y ≈ β0 + β1 x1 + β2 x2 + ... + βp xp

or equivalently
yi ≈ β0 + β1 x1i + β2 x2i + ... + βp xpi , i = 1, ..., n.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 46 / 52


Settings

The OLS method minimizes the RSS given by

n h
i 2
yi − β0 + β1 x1i + β2 x2i + ... + βp xpi
X
RSS = .
i=1

We rely on R to derive the minimizers, β̂0 , ..., β̂p .

Topic 3  Linear Regression DSA1101 Introduction to Data Science 47 / 52


MLR in R

The least squares estimates of β0 , β1 , β2 , ..., βp , are returned by the lm() function in R.

Consider a simulated data with x1 , x2 and response y where y is created as


1 2
(1 + 2x − 5x ) with some noise added.
> set.seed(250)
> x1 = rnorm(100) ; x2 = rnorm(100)
> y = 1 + 2*x1 -5*x2+ rnorm(100)

We now t a linear model, y ∼ x1 + x2 .

Topic 3  Linear Regression DSA1101 Introduction to Data Science 48 / 52


MLR in R

The tted model can be obtained from the R code below.

> lm(y~x1+x2)
Call:
lm(formula = y ~ x1 + x2)

Coefficients:
(Intercept) x1 x2
0.9362 1.7649 -4.9560

Topic 3  Linear Regression DSA1101 Introduction to Data Science 49 / 52


Adjusted R2 in MLR
A multiple linear model has R2 which is dened exactly as in simple linear regression, and
its meaning remains the same.

R2 can be inated simply by adding more regressors to the model (even insignicant
terms).

We have adjusted R2 , denoted as


2
Radj -which penalizes the model for adding regressors of
too little help to the model.

2 RSS/(n − p − 1)
Radj =1− .
T SS/(n − 1)

2
When comparing two models of the same response, the model with larger Radj is preferred.

Topic 3  Linear Regression DSA1101 Introduction to Data Science 50 / 52


MLR for HDB Resale Flats

HDB ats are sold on 99-year leases. Hence, the older lease commence date (it's the date
the rst owner took the key from HDB), usually the lower resale price, given other
conditions are similar.

Hence, we may consider the number of years from the lease commence date till this year
as a quantitative regressor, called years_left.

> years_left = 2022 - resale$lease_commence_date

Can you try to t a linear model for the resale price with two regressors, oor area and the
years left (called model M2).

Topic 3  Linear Regression DSA1101 Introduction to Data Science 51 / 52


MLR for HDB Resale Flats

Write down the equation of model M2. Report the goodness-of-t of it.

Compared to the simple model (with only oor area as the regressor), which model would
you prefer and why?

Topic 3  Linear Regression DSA1101 Introduction to Data Science 52 / 52

You might also like