Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views64 pages

Lecture 2

Uploaded by

Sanjana Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views64 pages

Lecture 2

Uploaded by

Sanjana Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Introduction to Linear Regression

September 7, 2017

Regression model
 Relation between variables where changes in some variables
may “explain” or possibly “cause” changes in other variables.
 Explanatory variables are termed the independent variables
and the variables to be explained are termed the dependent
variables.
 Regression model estimates the nature of the relationship
between the independent and dependent variables.
4-1
- Change in dependent variables that results from changes in
independent variables.
- Strength of the relationship.
- Statistical significance of the relationship.

Empirical problem: Class size and educational output


 Policy question: What is the effect of reducing class
size by one student per class? by 8 students/class?
 What is the right output (performance) measure?
 parent satisfaction
 student personal development
 future adult welfare
 future adult earnings
 performance on standardized tests
4-2
What do data say about class sizes and test scores?

The California Test Score Data Set

Variables:
 5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
 Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers

4-3
An initial look at the California test score data:

4-4
Do districts with smaller classes (lower STR) have higher test
scores?

4-5
The class size/test score policy question:
 What is the effect on test scores of reducing STR by
one student/class?
Test score
 Object of policy interest:
STR
 This is the slope of the line relating test score and STR

4-6
This suggests that we want to draw a line through the
Test Score v. STR scatterplot – but how?

4-7
Notation and Terminology

The population regression line:


Test Score = 0 + 1STR

1 = slope of population regression line


Test score
=
STR
= change in test score for a unit change in STR
 Why are 0 and 1 “population” parameters?
 We would like to know the population value of 1.
 We don’t know 1, so must estimate it using data.
4-8
How can we estimate 0 and 1 from data?
Recall that Y was the least squares estimator of Y: Y
solves,
n
min m  (Yi  m) 2
i 1

By analogy, we will focus on the least squares


(“ordinary least squares” or “OLS”) estimator of the
unknown parameters 0 and 1, which solves,

n
min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1

4-9
n
The OLS estimator solves: min b0 ,b1  [Yi  (b0  b1 X i )]2
i 1

 The OLS estimator minimizes the average squared


difference between the actual values of Yi and the
prediction (predicted value) based on the estimated line.

4-10
Why use OLS, rather than some other estimator?
 OLS is a generalization of the sample average: if the
“line” is just an intercept (no X), then the OLS
estimator is just the sample average of Y1,…Yn (Y ).
 Like Y , the OLS estimator has some desirable
properties: under certain assumptions, it is unbiased
(that is, E( ˆ1 ) = 1)

4-11
4-12
Application to the California Test Score – Class Size data

Estimated slope = ˆ1 = – 2.28


Estimated intercept = ˆ0 = 698.9
Estimated regression line: TestScore = 698.9 – 2.28STR
4-13
Interpretation of the estimated slope and intercept
TestScore = 698.9 – 2.28STR
 Districts with one more student per teacher on average
have test scores that are 2.28 points lower.
Test score
 That is, = –2.28
STR

4-14
Predicted values & residuals:

One of the districts in the data set is Antelope, CA, for


which STR = 19.33 and Test Score = 657.8
predicted value: YˆAntelope = 698.9 – 2.2819.33 = 654.8
residual: uˆ Antelope = 657.8 – 654.8 = 3.0
4-15
The OLS regression line is an estimate, computed using
our sample of data; a different sample would have given
a different value of ˆ1 .
How can we:
 quantify the sampling uncertainty associated with ˆ1 ?
 use ˆ1 to test hypotheses such as 1 = 0?
 construct a confidence interval for 1?

Like estimation of the mean, we proceed in four steps:


1. The probability framework for linear regression
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
4-16
1. Probability Framework for Linear Regression

The Population Linear Regression Model

Yi = 0 + 1Xi + ui, i = 1,…, n

 X is the independent variable/regressor


 Y is the dependent variable
 0 = intercept
 1 = slope
 ui = “error term”

4-17
 The error term consists of omitted factors, or possibly
measurement error in the measurement of Y. In
general, these omitted factors are other factors that
influence Y, other than the variable X

4-18
Ex.: The population regression line and the error term

4-19
What are some of the omitted factors in this example?

Data and sampling


The population objects (“parameters”) 0 and 1 are
unknown; so to draw inferences about these unknown
parameters we must collect relevant data.

Simple random sampling:


Choose n entities at random from the population of
interest, and observe (record) X and Y for each entity

4-20
Simple random sampling implies that {(Xi, Yi)}, i = 1,…,
n, are independently and identically distributed (i.i.d.).
(Note: (Xi, Yi) are distributed independently of (Xj, Yj) for
different observations i and j.)

4-21
The Least Squares Assumptions

1. The conditional distribution of u given X has mean


zero, that is, E(u|X = x) = 0.
2. (Xi,Yi), i =1,…,n, are i.i.d.
3. X and u have finite fourth moments, that is:
E(X4) <  and E(u4) < .

4-22
Least squares assumption #1: E(u|X = x) = 0.
For any given value of X, the mean of u is zero

4-23
Example: Assumption #1 and the class size example
Test Scorei = 0 + 1STRi + ui, ui = other factors

“Other factors:”
 parental involvement
 outside learning opportunities (extra math class,..)
 home environment conducive to reading
 family income is a useful proxy for many such factors

So E(u|X=x) = 0 means E(Family Income|STR) = constant


(which implies that family income and STR are
uncorrelated).

4-24
Least squares assumption #2:
(Xi,Yi), i = 1,…,n are i.i.d.

This arises automatically if the entity (individual, district)


is sampled by simple random sampling: the entity is
selected then, for that entity, X and Y are observed
(recorded).

The main place we will encounter non-i.i.d. sampling is


when data are recorded over time (“time series data”) –
this will introduce some extra complications.

4-25
Least squares assumption #3:
E(X4) <  and E(u4) < 

Because Yi = 0 + 1Xi + ui, assumption #3 can


equivalently be stated as, E(X4) <  and E(Y4) < .

Assumption #3 is generally plausible. A finite domain of


the data implies finite fourth moments.

4-26
1. The probability framework for linear regression
2. Estimation: the Sampling Distribution of ˆ1
3. Hypothesis Testing
4. Confidence intervals

Like Y , ˆ1 has a sampling distribution.


 What is E( ˆ1 )?
 What is var( ˆ1 )? (measure of sampling uncertainty)
 What is its sampling distribution in small samples?
 What is its sampling distribution in large samples?

4-27
The sampling distribution of ˆ1
Yi = 0 + 1Xi + ui
Y = 0 + 1 X + u
so Yi – Y = 1(Xi – X ) + (ui – u )
Thus,
n

( X i  X )(Yi  Y )
ˆ1 = i 1
n

 i
( X
i 1
 X ) 2

4-28
E( ˆ1 ) = 1

That is, ˆ1 is an unbiased estimator of 1.

4-29
p
(1) E( ˆ1 ) = 1, ˆ1  1

(2) When n is large, the sampling distribution of ˆ1 is


well approximated by a normal distribution

4-30
Large-n approximation to the distribution of ˆ1 :

Recall the summary of the sampling distribution of Y :


For (Y1,…,Yn) i.i.d. with 0 <  Y2 < ,
 The exact (finite sample) sampling distribution of Y
has mean Y (“Y is an unbiased estimator of Y”) and
variance  Y2 /n
p
 Y  Y (law of large numbers)
Y  E (Y )
 is approximately distributed N(0,1)
var(Y )

4-31
Parallel conclusions hold for the OLS estimator ˆ1 :

Under the three Least Squares Assumptions,


 The exact sampling distribution of ˆ1 has mean 1
(“ ˆ1 is an unbiased estimator of 1”), and var( ˆ1 ) is
inversely proportional to n.
p
 ˆ1  1 (law of large numbers)
ˆ1  E ( ˆ1 )
 is approximately distributed N(0,1)
var( ˆ1 )

4-32
1. The probability framework for linear regression
2. Estimation
3. Hypothesis Testing
4. Confidence intervals

Suppose a skeptic suggests that reducing the number of


students in a class has no effect on learning or,
specifically, test scores. The skeptic thus asserts the
hypothesis,
H0: 1 = 0

4-33
We wish to test this hypothesis using data – reach a
tentative conclusion whether it is correct or incorrect.

Null hypothesis and two-sided alternative:


H0: 1 = 0 vs. H1: 1  0
or, more generally,
H0: 1 = 1,0 vs. H1: 1  1,0
where 1,0 is the hypothesized value under the null.

Null hypothesis and one-sided alternative:


H0: 1 = 1,0 vs. H1: 1 < 1,0

4-34
An effect could “go either way,” so it is standard to focus
on two-sided alternatives.
Recall hypothesis testing for population mean using Y :

Y  Y ,0
t=
sY / n
then reject the null hypothesis if |t| >1.96.

4-35
Applied to a hypothesis about 1:
estimator - hypothesized value
t=
standard error of the estimator
so
ˆ1  1,0
t=
SE ( ˆ1 )

where 1 is the value of 1,0 hypothesized under the null


(for example, if the null value is zero, then 1,0 = 0)

What is SE( ˆ1 )?


SE( ˆ1 ) = the square root of an estimator of the
variance of the sampling distribution of ˆ1
4-36
The calculation of the t-statsitic:

ˆ1  1,0 ˆ1  1,0


t= =
SE ( ˆ1 ) ˆ 2ˆ
1

 Reject at 5% significance level if |t| > 1.96


 p-value is p = Pr[|t| > |tact|] = probability in tails of
normal outside |tact|
 Both the previous statements are based on large-n
approximation; typically n = 50 is large enough for
the approximation to be excellent.
4-37
Example: Test Scores and STR, California data

Estimated regression line: TestScore = 698.9 – 2.28STR

Regression software reports the standard errors:

SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52


ˆ1  1,0 2.28  0
t-statistic testing 1,0 = 0 = = = –4.38
SE ( ˆ1 ) 0.52

 The 1% 2-sided significance level is 2.58, so we reject


the null at the 1% significance level.
4-38
 Alternatively, we can compute the p-value…

4-39
1. The probability framework for linear regression
2. Estimation
3. Hypothesis Testing
4. Confidence intervals

In general, if the sampling distribution of an estimator is


normal for large n, then a 95% confidence interval can be
constructed as estimator  1.96standard error.

So: a 95% confidence interval for ˆ1 is,


{ ˆ1  1.96SE( ˆ1 )}

4-40
Example: Test Scores and STR, California data
Estimated regression line: TestScore = 698.9 – 2.28STR

SE( ˆ0 ) = 10.4 SE( ˆ1 ) = 0.52

95% confidence interval for ˆ1 :

{ ˆ1  1.96SE( ˆ1 )} = {–2.28  1.960.52}


= (–3.30, –1.26)
Equivalent statements:
 The 95% confidence interval does not include zero;
 The hypothesis 1 = 0 is rejected at the 5% level

4-41
A convention for reporting estimated regressions:

Put standard errors in parentheses below the estimates

TestScore = 698.9 – 2.28STR


(10.4) (0.52)

This expression means that:


 The estimated regression line is
TestScore = 698.9 – 2.28STR
 The standard error of ˆ is 10.4
0

 The standard error of ˆ1 is 0.52


4-42
Regression when X is Binary

Sometimes a regressor is binary:


 X = 1 if female, = 0 if male
 X = 1 if treated (experimental drug), = 0 if not
 X = 1 if small class size, = 0 if not

So far, 1 has been called a “slope,” but that doesn’t


make much sense if X is binary.

4-43
How do we interpret regression with a binary regressor?

Yi = 0 + 1Xi + ui, where X is binary (Xi = 0 or 1):

 When Xi = 0: Y i = 0 + u i
 When Xi = 1: Y i = 0 +  1 + u i
thus:
 When Xi = 0, the mean of Yi is 0
 When Xi = 1, the mean of Yi is 0 + 1
that is:
 E(Yi|Xi=0) = 0
4-44
 E(Yi|Xi=1) = 0 + 1
so:
1 = E(Yi|Xi=1) – E(Yi|Xi=0)
= population difference in group means
Example: TestScore and STR, California data
Let
1 if STRi  20
Di = 
0 if STRi  20

The OLS estimate of the regression line relating


TestScore to D (with standard errors in parentheses) is:

TestScore = 650.0 + 7.4D


4-45
(1.3) (1.8)

Difference in means between groups = 7.4;


SE = 1.8 t = 7.4/1.8 = 4.0

4-46
Compare the regression results with the group means,
computed directly:
Class Size Average score (Y ) Std. dev. (sY) N
Small (STR > 20) 657.4 19.4 238
Large (STR ≥ 20) 650.0 17.9 182

Estimation: Ysmall  Ylarge = 657.4 – 650.0 = 7.4


Ys  Yl 7.4
Test =0: t  = 4.05
SE (Ys  Yl ) 1.83
95% confidence interval ={7.41.961.83}=(3.8,11.0)
This is the same as in the regression!
TestScore = 650.0 + 7.4D
(1.3) (1.8)

4-47
Summary: regression when Xi is binary (0/1)

Yi = 0 + 1Xi + ui

 0 = mean of Y given that X = 0


 0 + 1 = mean of Y given that X = 1
 1 = difference in group means, X =1 minus X = 0
 SE( ˆ1 ) has the usual interpretation
 t-statistics, confidence intervals constructed as usual
 This is another way to do difference-in-means
analysis

4-48
 Other Regression Statistics

A natural question is how well the regression line “fits”


or explains the data. There are two regression statistics
that provide complementary measures of the quality of
fit:
 The regression R2 measures the fraction of the
variance of Y that is explained by X; it is unitless and
ranges between zero (no fit) and one (perfect fit)
 The standard error of the regression measures the fit
– the typical size of a regression residual – in the units
of Y.

4-49
The R2
Write Yi as the sum of the OLS prediction + OLS
residual:

Yi = Yˆi + uˆi

The R2 is the fraction of the sample variance of Yi


“explained” by the regression, that is, by Yˆi :

ESS
2
R = ,
TSS
n n
where ESS =  (Yˆi  Yˆ ) and TSS =
i 1
2
 i
(Y
i 1
 Y ) 2
.

4-50
n n
ESS
2
R =
TSS
, where ESS =  (Yˆi  Yˆ ) and TSS =
i 1
2
 i
(Y
i 1
 Y ) 2

The R2:
 R2 = 0 means ESS = 0, so X explains none of the
variation of Y
 R2 = 1 means ESS = TSS, so Y = Yˆ so X explains all of
the variation of Y
 0 ≤ R2 ≤ 1
 For regression with a single regressor (the case here),
R2 is the square of the correlation coefficient between
X and Y

4-51
The Standard Error of the Regression (SER)

The standard error of the regression is (almost) the


sample standard deviation of the OLS residuals:

1 n
SER = 
n  2 i 1
( ˆ
ui  ˆ
ui ) 2

1 n 2
= 
n  2 i 1
uˆi

4-52
1 n 2
SER = 
n  2 i 1
uˆi

The SER:
 has the units of u, which are the units of Y
 measures the spread of the distribution of u
 measures the average “size” of the OLS residual (the
average “mistake” made by the OLS regression line)
 The root mean squared error (RMSE) is closely
related to the SER:
1 n 2
RMSE = 
n i 1
uˆi

This measures the same thing as the SER

4-53
TestScore = 698.9 – 2.28STR, R2 = .05, SER = 18.6
(10.4) (0.52)

4-54
The slope coefficient is statistically significant and large
in a policy sense, even though STR explains only a small
fraction of the variation in test scores.

Heteroskedasticity and Homoskedasticity

 What do these two terms mean?


 Consequences of homoskedasticity
 Implication for computing standard errors

4-55
What do these two terms mean?
If var(u|X=x) is constant – that is, the variance of the
conditional distribution of u given X does not depend on
X, then u is said to be homoskedastic. Otherwise, u is
said to be heteroskedastic.

4-56
Homoskedasticity in a picture:

 E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)


 The variance of u does not change with (depend on) x
4-57
Heteroskedasticity in a picture:

 E(u|X=x) = 0 (u satisfies Least Squares Assumption #1)


 The variance of u depends on x – so u is
heteroskedastic.
4-58
An real-world example of heteroskedasticity
: Average hourly earnings vs. years of education (data
source: 1999 Current Population Survey)
Average Hourly Earnings Fitted values

60
Average hourly earnings

40

20

0
5 10 15 20
Years of Education
Scatterplot and OLS Regression Line

4-59
Is heteroskedasticity present in the class size data?

Hard to say…looks nearly homoskedastic, but the spread


might be tighter for large values of STR.

4-60
So far we have (without saying so) assumed that u is
heteroskedastic:

Recall the three least squares assumptions:


1. The conditional distribution of u given X has mean
zero, that is, E(u|X = x) = 0.
2. (Xi,Yi), i =1,…,n, are i.i.d.
3. X and u have four finite moments.

Heteroskedasticity and homoskedasticity concern


var(u|X=x). Because we have not explicitly assumed
homoskedastic errors, we have implicitly allowed for
heteroskedasticity.
4-61
What if the errors are in fact homoskedastic?:
 the OLS is the estimator with the lowest variance
among all estimators that are linear functions of
(Y1,…,Yn)

4-62
Homoskedasticity-only standard errors are the
default setting in regression software – sometimes the
only setting (e.g. Excel). To get the general
“heteroskedasticity-robust” standard errors you must
override the default.

If you don’t override the default and there is in fact


heteroskedasticity, you will get the wrong standard errors
(and wrong t-statistics and confidence intervals).

4-63
The critical points:
 If the errors are homoskedastic and you use the
heteroskedastic formula for standard errors. You are
OK.
 If the errors are heteroskedastic and you use the
homoskedasticity-only formula for standard errors,
the standard errors are wrong.
 The two formulas coincide (when n is large) in the
special case of homoskedasticity
 The bottom line: you should always use the
heteroskedasticity-based formulas – these are
conventionally called the heteroskedasticity-robust
standard errors.
4-64

You might also like