Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
47 views14 pages

Correlation & Regression Analysis

Correlation quantifies the linear relationship between two variables, ranging from -1 to 1, where -1 is total negative correlation, 0 is no correlation, and 1 is total positive correlation. The correlation coefficient can be used to find the least squares line that best fits the data points and minimizes the vertical distances between the observed y-values and the line. While correlation indicates association, it does not necessarily imply causation. Residuals represent the vertical distances from the observed to predicted y-values on the least squares line, while errors are the distances to the true line.

Uploaded by

Akash Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
47 views14 pages

Correlation & Regression Analysis

Correlation quantifies the linear relationship between two variables, ranging from -1 to 1, where -1 is total negative correlation, 0 is no correlation, and 1 is total positive correlation. The correlation coefficient can be used to find the least squares line that best fits the data points and minimizes the vertical distances between the observed y-values and the line. While correlation indicates association, it does not necessarily imply causation. Residuals represent the vertical distances from the observed to predicted y-values on the least squares line, while errors are the distances to the true line.

Uploaded by

Akash Srivastava
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Correlation and Regression

Correlation

Done for bi-variate data, where we have two


variables for each point.

In this plot, we have the literacy fraction and


population fraction under 6 years of age for
each village in Shahpur block. Each cross
denotes one village

Important:
Correlation is not the same as causation.

Having fewer children is correlated to better literacy.

Is there a measure to quantify the relatedness of the two variables?


Another example

In this plot, we have the population fraction


under 6 years of age versus number of
Households for the villages of Shahpur block.
Each cross denotes one village

Here, we don’t see much going on


Several formulae for correlation r

Can find the sample correlation or population correlation


Correlation coefficient values under 6 fraction vs. literacy frac

• r for the two examples is -0.76 (under 6 fraction vs. literacy


frac) and -0.16 (under 6 fraction vs. no. of HHs)
• -1 and 1 indicate maximum correlation, and that happens
when all points fall on a straight line with slope 1 or -1
• The correlation between (p-06/TOT-P) with (P-ST/TOT-P) is
0.57 thus indicating that the under 6 fraction of children is
more tightly correlated with literacy than with being tribal.

under 6 fraction vs. no. of HHs

The correlation coefficient measures only linear association


Least Squares Line
A linear model to determine the nature and extent
of correlation

May be used to estimate data points

x is the independent variable and y is the dependent


variable, the βs are the regression coefficients and ε is
the error.
Estimated quantity

𝑤 .𝑟 . 𝑡 . ^
𝛽 0 𝑎𝑛𝑑 ^
𝛽1

𝑛
𝜕 ∑ ( 𝑦𝑖 − 𝑦
^ 𝑖) 2 𝜕 ∑ ( 𝑦𝑖 − ^
𝑦𝑖) 2
=0
𝑖 =0
=0
𝜕𝛽^
𝜕^
𝛽0 1

𝜕∑ ¿¿¿¿
y is expected to depend linearly on x
𝑖=1
∑ ¿¿
𝑖= 1

∑ 𝑥𝑖 ¿ ¿
The estimates are not the same as
the true values
𝑖=1
∑ 𝑒𝑖=0
Difference between Estimates/Values, and Residuals /Errors

are estimates of β1 and β0 And will vary


depending on the sample set

Errors

Residuals

Residuals are the vertical distances from the observed values yi to the least-squares line

Errors are the distances from the yi to the true line y = β0 + β1x.
When using linear fit

Don't Extrapolate Outside the Range of the Data


Only use the linear fit when the data is linear
If we interchange the dependent and independent variables, that
changes the regression coefficients – i.e. the best fit line changes
• Curing times in days (x) and compressive strengths in MPa (y) were recorded for several
concrete specimens. The means and standard deviations of the x and y values were = 5, sx =
2, = 1350, sy = 100. The correlation between curing time and compressive strength was
computed to be r = 0.7. Find the equation of the least-squares line to predict compressive
strength from curing time.

r
(The line passes through the means)

= 5, sx = 2, = 1350, sy = 100. r = 0.7.


Example 2
r
End of slides

You might also like