W3 (Units 530-531) – Regression
Some definitions
Linear equation for the population (using Greek letters): Y = β0 + β 1∗X i
In the example below: Exam grade =β 0 + β 1∗Study timei
Least square equation (using sample data you
can fill in the numbers): ^y =b0 +b 1 X 1
b0: Intercept to “Y axis” when x = 0
b1: Slope of the variable X1
X1: Independent variable (X-axis) ->
Study_time
Y : Dependent variable (Y-axis) ->
Exam_grade
How to read R output :
^y =b0 +b 1 X 1 So -> ^y (grade)=31.6+1.83 X (Study time )
Typical questions :
- What is the intercept? The intercept is 31.6 which is the exam_grade of someone who did
not study at all x = study_time = 0.
- What is the slope? The slope is 1.83 and represents the effect of study_time on exam_grade.
In this case for every hour you study, on average your grade increases by 1.83 points.
- What is the expected grade of a student who studied 10 hours? -> Change X in the equation:
^y =31.6 +1.83 X 1 -> ^y =31.6 +1.83∗10 = 49.9
- It seems that the residual for a student is -2pts. What is the real/observed grade of a
student who studied 7 hours ? -> Change X in the regression equation + or – The residual.
y=31.6 +1.83∗7−2=42.41
To do: Extra example: Let’s say the regression equation is : ^y (Price)=25+11 ,50 X ¿ Imagine your
friend found out a very good deal for a room in Enschede. This room is 14 m^2 and the real cost of
the rent is 250 euros. Do you think this is really a good deal? If so, why? By how much?
Y = 25+11,50(14) = 186 is the expected, your friend is stupid. (250 would be the observed)
Testing whether the relationship is significant (Data from Unit 531): We are interested in the
relationship between weight and height. The taller you are the more you weigh? Imagine you expect
a positive relationship
Linear equation:
Y = β0 + β 1∗X i → Weight=β 0 + β 1∗Height i
Hypothesis:
- HO: β 1=0(Thereis not significant effect)
- H1: β 1 ≠ 0 or β 1> 0(If you expect a positive relationship)
R output:
Call:
lm(formula = weight ~ height, data = .)
Residuals:
Min 1Q Median 3Q Max
-36.700 -9.362 -1.771 7.162 63.421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.54888 3.37147 -20.33 <2e-16 ***
height 0.84342 0.01938 43.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.35 on 5120 degrees of freedom
Multiple R-squared: 0.27, Adjusted R-squared: 0.2699
F-statistic: 1894 on 1 and 5120 DF, p-value: < 2.2e-16
b1 0.843
Test statistic: t= = =43.52
SE b 1 0,019
Conclusion about the hypothesis: Is the effect of Height on weigh significant ?
- P-Value < or > than alpha (0,05) -> Reject or not Reject H0.
- In our example P-value <0.05, so we reject H0 and we can “accept” H1 which is that there is a
significant positive effect of height on weight.
Typical questions: (to do)
- Write the linear equation: y(Weight) = bo + b1x(Height) -> Y = -68.54 + 0.84X
- Suppose someone who is 180cm has a residual of +10. What is his real/observed weight?
o First: substitute X in formula : Y = -68.54 + 0.84(180) = 82,66.
o Second: Add or subtract the residual Y = + 10 = 82,66 + 10 = 92,66.
- What is the slope and what does it means in this context? For every cm taller, your weight
increases on average by 0.84
Same R output:
Call:
lm(formula = weight ~ height, data = .)
Residuals:
Min 1Q Median 3Q Max
-36.700 -9.362 -1.771 7.162 63.421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.54888 3.37147 -20.33 <2e-16 ***
height 0.84342 0.01938 43.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.35 on 5120 degrees of freedom
Multiple R-squared: 0.27, Adjusted R-squared: 0.2699
F-statistic: 1894 on 1 and 5120 DF, p-value: < 2.2e-16
New questions:
95% Confidence interval for the slope
By hand:
- C.I = Estimate ± Margin of error
¿
- C.I = b 1 ± t ∗SE(b 1)
o We know b1 and SE(b1)
o We can assume t ¿=1.96 since our sample size is large enough (n=5122)
o More precisely, you can find it with R:
o critical_t <- qt(p=.05/2, df=5120, lower.tail=FALSE) = 1.9604
¿
- C.I = 0.84 ± 1.9604 ∗0.019=[0.805; 0.881]
Using R:
confint(L_model1, level = 0.95)
2.5 % 97.5 %
(Intercept) -75.1583964 -61.9393557
height 0.8054237 0.8814085
**NB: You can also reject the null hypothesis using your confidence interval:
- If the 95% C.I does NOT includes zero -> Reject H0
- If the 95% C.I DOES includes zero -> Don’t Reject H0
- In our example since 0 is not within the interval [0.805 ; 0.881] we can reject H0 and say that
the effect of height on weight is significant.