Correlation and Regression
Correlation
Done for bi-variate data, where we have two
variables for each point.
In this plot, we have the literacy fraction and
population fraction under 6 years of age for
each village in Shahpur block. Each cross
denotes one village
Important:
Correlation is not the same as causation.
Having fewer children is correlated to better literacy.
Is there a measure to quantify the relatedness of the two variables?
Another example
In this plot, we have the population fraction
under 6 years of age versus number of
Households for the villages of Shahpur block.
Each cross denotes one village
Here, we don’t see much going on
Several formulae for correlation r
Can find the sample correlation or population correlation
Correlation coefficient values under 6 fraction vs. literacy frac
• r for the two examples is -0.76 (under 6 fraction vs. literacy
frac) and -0.16 (under 6 fraction vs. no. of HHs)
• -1 and 1 indicate maximum correlation, and that happens
when all points fall on a straight line with slope 1 or -1
• The correlation between (p-06/TOT-P) with (P-ST/TOT-P) is
0.57 thus indicating that the under 6 fraction of children is
more tightly correlated with literacy than with being tribal.
under 6 fraction vs. no. of HHs
The correlation coefficient measures only linear association
Least Squares Line
A linear model to determine the nature and extent
of correlation
May be used to estimate data points
x is the independent variable and y is the dependent
variable, the βs are the regression coefficients and ε is
the error.
Estimated quantity
𝑤 .𝑟 . 𝑡 . ^
𝛽 0 𝑎𝑛𝑑 ^
𝛽1
𝑛
𝜕 ∑ ( 𝑦𝑖 − 𝑦
^ 𝑖) 2 𝜕 ∑ ( 𝑦𝑖 − ^
𝑦𝑖) 2
=0
𝑖 =0
=0
𝜕𝛽^
𝜕^
𝛽0 1
𝜕∑ ¿¿¿¿
y is expected to depend linearly on x
𝑖=1
∑ ¿¿
𝑖= 1
∑ 𝑥𝑖 ¿ ¿
The estimates are not the same as
the true values
𝑖=1
∑ 𝑒𝑖=0
Difference between Estimates/Values, and Residuals /Errors
are estimates of β1 and β0 And will vary
depending on the sample set
Errors
Residuals
Residuals are the vertical distances from the observed values yi to the least-squares line
Errors are the distances from the yi to the true line y = β0 + β1x.
When using linear fit
Don't Extrapolate Outside the Range of the Data
Only use the linear fit when the data is linear
If we interchange the dependent and independent variables, that
changes the regression coefficients – i.e. the best fit line changes
• Curing times in days (x) and compressive strengths in MPa (y) were recorded for several
concrete specimens. The means and standard deviations of the x and y values were = 5, sx =
2, = 1350, sy = 100. The correlation between curing time and compressive strength was
computed to be r = 0.7. Find the equation of the least-squares line to predict compressive
strength from curing time.
r
(The line passes through the means)
= 5, sx = 2, = 1350, sy = 100. r = 0.7.
Example 2
r
End of slides