Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views36 pages

Unit 2 - Class 4

Uploaded by

Alireza Tehrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views36 pages

Unit 2 - Class 4

Uploaded by

Alireza Tehrani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 2

Class 4
Today’s Agenda
Last class:
• Covariance
• Correlation
• Simple linear regression
• Motivating example 

Today:
• Interpolation and extrapolation
• Influential observations
• Confounding

2
Example: Trees Dataset
This data set provides measurements of the diameter, height and
volume of timber in 31 felled black cherry trees. Note that the
diameter (in inches) is erroneously labelled Girth in the data. It is
measured at 4ft 6in above the ground.

trees[1:5,]
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
3
A useful command…
summary(Model)
#NOTE: I have removed a lot of stuff you don’t need to see yet!
Coefficients:
Estimate
(Intercept) 62.0313
Girth 1.0544
Multiple R-squared: 0.2697

4
Our line of best fit is…
ℎ� = 62.0313 + 1.0544 × 𝑔𝑔

We can use this line to predict the height of a tree for a particular girth.

Example
If the girth is 10 inches, the model estimates the height will be:

Note! This is NOT the height of a particular cherry tree. Instead, it is an


average height for a cherry tree with girth 10 inches.
5
Clicker Question
The line of best fit is ℎ� = 62.0313 + 1.0544 × 𝑔𝑔. By approximately
how much does our average tree height change for every inch increase
in girth?

A) 63 feet
B) 1 foot
C) 64 feet
D) Cannot determine from the information given
E) None of the above
6
Interpolation
Interpolation is the use of the regression
line for predicting the value of a response
using an explanatory variate within the
range of x.

(i.e., here a value between about 8 and 20)

7
Clicker Question
The line of best fit is ℎ� = 62.0313 + 1.0544 × 𝑔𝑔. This model says the
average height of a tree when the girth is 14 inches is approximately…
A) 14 ft
B) 63 ft
C) 1 ft
D) 77 ft
E) None of the above

8
Extrapolation
Is the use of the regression line for
predicting the value of a response using an
explanatory variate beyond the range of x.

(i.e., here a value outside of 8 and 20)

9
Warnings!

Interpolation tends to be a fairly safe


thing to do, but extrapolation can have
some scary consequences.

10
Clicker Question
The line of best fit is ℎ� = 62.0313 + 1.0544 × 𝑔𝑔. This model says the
average height of a tree when the girth is 0 inches is approximately…
A) 14 ft
B) 63 ft
C) 1 ft
D) 62 ft

11
Today’s Agenda
Today:
• Interpolation and extrapolation
• Influential observations
• Confounding

12
Influential Observations
In one dimension, it is an outlier.

In more than one dimension, it is an outlier in ________________ or in


________________.
• An example will help us see this!

13
Influential Observations affect…
• Slope
• Covariance
• Correlation

14
Influential Observations
Example

15
Dataset
𝒙𝒙𝒊𝒊 𝒚𝒚𝒊𝒊 𝒙𝒙𝒊𝒊 𝒚𝒚𝒊𝒊
69 207 64 193
70 212 73 219
68 206 76 230
69 206 66 195
73 219 70 209
16
Plot the data
The least squares
regression line is
𝑦𝑦� = −9.432 + 3.138𝑥𝑥

𝑟𝑟 = 0.990

17
Let’s add an outlier
𝑥𝑥11 = 100; 𝑦𝑦11 = 299
Extreme in both directions!

The least squares


regression line is
𝑦𝑦� = 1.444 + 2.981𝑥𝑥

𝑟𝑟 = 0.999

(the red line is the original)

18
Let’s change the
outlier
𝑥𝑥11 = 100; 𝑦𝑦11 = 205
Extreme in the X direction
only!

The least squares


regression line is
𝑦𝑦� = 191.915 + 0.238𝑥𝑥

𝑟𝑟 = 0.216

(the red line is the original)

19
Let’s change the
outlier
𝑥𝑥11 = 64; 𝑦𝑦11 = 299
Extreme in the Y direction
only!

The least squares


regression line is
𝑦𝑦� = 276.77 − 0.85𝑥𝑥

𝑟𝑟 = −0.111

(the red line is the original)

20
Notes:
• An influential observation has a large influence on the statistical
calculations being done.
• Identification:
• If removing it from the data causes our line of best fit to change markedly
(see previous examples), then its an influential observation.
• Investigators should try to determine if it’s due to an error or some
other factor surrounding the unit/process from which this
observation was collected.
• Under certain scenarios, the investigator may choose to remove such points
from the analysis.

21
Today’s Agenda
Today:
• Interpolation and extrapolation
• Influential observations
• Confounding

22
Confounding

23
Example:
We are interested in how previous work experience affects income.
Wages are in thousands of dollars and Experience is in years.

Experience = c(5, 7, 9, 11, 13, 15, 17, 19)


Wages = c(45, 48, 50, 45, 65, 62, 65, 70)

24
We build a regression line…
The line is
� = 33.6786 + 1.881 × 𝐸𝐸𝐸𝐸𝐸𝐸
𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
The correlation is 0.9.

Look at that high correlation!

25
Clicker Question
� = 33.6786 + 1.881 × 𝐸𝐸𝐸𝐸𝐸𝐸 and the correlation is 0.9.
The line is 𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊𝑊
What is the R squared value?

A) 0.9
B) 0.81
C) -69.774
D) 1.881
E) None of the above

26
Clicker Question
Therefore, we can say an increase in
previous experience CAUSES Wages to
increase…
A) TRUE
B) FALSE

27
Clicker Question
Therefore, we can say an increase in
previous experience CAUSES Wages to
increase…
A) TRUE
B) FALSE

28
Example continued…
In this example, there is another piece of data that I should give you that may
or may not have been recorded by the researcher…

Experience = c(5, 7, 9, 11, 13, 15, 17, 19)


Wages = c(45, 48, 50, 45, 65, 62, 65, 70)
Education = c(0,0,0,0,1,1,1,1) #1=university, 0=no
university

And if we look at the correlation between Non-University-Educated Wages


and Previous Experience, we obtain a correlation of 0.11. The correlation
between University-Educated Wages and Previous Experience has a
correlation of 0.7.
29
Warnings!
Education is called a lurking variable. Every study has them!

A lurking variable is a variable that is not one of the explanatory or response


variables in a study that may influence the interpretation of relationships
among those variables!
• Can falsely identify a strong relationship between variables or hide a true
relationship.

This is why CORRELATION DOES NOT IMPLY CAUSATION!


30
Correlation does NOT imply Causation!

31
Activity: Suppose you want to study the effect of
diet and exercise on a person’s blood pressure.

What lurking variable(s) might we need to consider?

32
33
Confounding
Two variables (either explanatory variables or lurking variables) are
confounded when their effects on a response cannot be distinguished
from each other.

Example
Weight and Age are confounding variables for Height in children.

34
Today’s Agenda
Today:
• Interpolation and extrapolation
• Influential observations
• Confounding

35
Homework Questions
• Chapter 4
• 4.15-4.23 (odds), 4.29b, 4.33 (use R)

36

You might also like