01 Correlation (Revised)
01 Correlation (Revised)
Bivariate distribution:
There are many situations arise in which we are interested to study the
relationship between two variables such as:
1. The amount of rainfall and yield of a certain crop
2. The amount of FGR and fish weight
3. The height and weight of a group of children
4. Income and expenditure of several families
5. The heart girth and body weight of animal etc.
What is correlation?
When there is a relationship between quantitative measures between two sets
of phenomena, the appropriate statistical tool for discovering and measuring the
relationship and expressing it on a precise way is known as correlation.
To show up any relationship more clearly, we can move the origin of the diagram to
the point ( x, y) . The coordinate of a typical point ( xi , y i ) will now be written
as ( xi x, y i y ) . The ith point on the diagram has been labeled to show this. If we
look the scatter diagram we can see the signs (+ or -) taken by all the ( xi x) and
( y i y ) in the four new quadrants. By multiplying the signs, we can find the sign
taken by their product ( xi x) ( y i y ) .
Scatterplot
The most useful graph for displaying the relationship between two quantitative
variables is a scatterplot.
A scatterplot shows the relationship between two quantitative variables measured for
the same individuals. The values of one variable appear on the horizontal axis, and the
values of the other variable appear on the vertical axis. Each individual in the data
appears as a point on the graph.
Because r uses the standardized values of the observations, r does not change
when we change units of measurement (inches vs. centimeters, pounds vs.
kilograms and miles vs. meters etc.). So, r is “scale invariant”.
7. Correlation requires that both variables be quantitative (numerical).
You can‟t calculate a correlation between “income” and “city of residence”
because “city of residence” is a qualitative (non-numerical) variable.
8. The correlation can be misleading in the presence of outliers or nonlinear
association.
Correlation coefficient r does not describe curved relationships. r is affected
by outliers. When possible, check the scatter plot.
9. Correlation measures association. But association does not necessarily show
causation.
Both variables may be influenced simultaneously by some third variable.
10. The ratio of the values of rxy does not show the closeness of correlations.
If rxy 0.6 , then it does not mean that this correlation has double strength of
that whose value is rxy 0.3
11. The value rxy 0.5 shows as much closeness in the positive direction, as
rxy 0.5 shows in the negative direction.
12. Random variables which are at the interval or ratio level of measurement.
13. If X and Y are independent, then rxy 0 ; though, the converse is not true.
Let x and y be the variables and (x1,y1), (x2,y2),… …,(xn,yn) denotes n pairs of
observations with means ( x, y) and standard deviation sx and sy respectively. If we
write the standard normal variate as follows:
2 2
x x x x x x
ui i u i i u i i
2 2
sx s s
x x
( xi x ) 2
u i u i n
2 2
2
sx
yi y
and vi
sy
Similarly vi n
2
Again
( xi x ) ( y i y )
u i vi
sx sy
( xi x)( y i y )
sx s y
nCov( x, y )
sx s y
nr
Where r denote the correlation coefficient between x and y.
Now, (ui vi ) 2 can never be negative, because it is a perfect square. Hence the sum
of all such squares for i=1, 2, 3, … …, n can not be negative; i.e.
(ui vi ) 2 0
ui vi 2ui vi 0
2 2
n n 2nr 0
2n(1 r ) 0
1 r 0 (Since n 0)
1 r 0
1 r 0
1 r
r 1
1 r 1.
This proves that the correlation coefficient lies between –1 and +1.
Proof :
We know that correlation coefficient between x and y is given by,
SP(x, y) Σx i x yi y
rxy
SS(x) SS(y) Σx i x Σyi y
2 2
xi a y b
Suppose the transformations u i and vi i for defining change of origin
c d
and scale (where c, d 0 ).
xi a x a
Now, u i u
c c
x a x a xi x
ui u i
c c c
xi x cui u
Similarly, yi y dvi v
Σx i x yi y Σcu i u dvi v
rxy
Σx i x Σyi y Σcu i u Σdvi v
2 2 2 2
rxy ruv
This proves that the correlation coefficient is not affected by change of origin and
scale.
Degrees of Correlation
Through the coefficient of correlation, we can measure the degree or extent of the
correlation between two variables. On the basis of the coefficient of correlation we
can also determine whether the correlation is positive or negative and also its degree
or extent.
1. Perfect correlation: If two variables changes in the same direction and in the
same proportion, the correlation between the two is perfect positive.
High degree, moderate degree or low degree are the three categories of this kind of
correlation. The following table reveals the effect (or degree) of coefficient or
correlation.
Interpretation of r:
The correlation coefficient always lies between –1 and +1. To interpret the
correlation coefficient, we must consider both its sign (positive or negative) which
indicates the direction of relationship and its absolute value which indicates the
strength of linear relationship. A perfect positive correlation has a coefficient of 1.0; a
perfect negative correlation has a coefficient of -1.0. When there is no association
between two variables, the correlation coefficient has a value of 0. A value of zero
indicates that the variables are not linearly related or perhaps have more complex or
non linear relationship.
Review each correlation coefficient presented below and determine it‟s direction and
strength
1. -0.38
The negative sign tells us that this is a negative correlation. A high score on
the X variable would predict a low score on the Y variable. The absolute value of
the correlation is 0.38 would be considered moderate in size in agricultural
research. It is not a terrible strong relationship but there is definitely a linear
relationship between the two variables.
2. 0.23
This is a small, positive correlation. A high score on the X variable would
predict a high score on the Y variable but not with a great deal of accuracy. There
would be a fair amount of scatter on the bivariate plot but a definite linear
relationship could be seen.
4. 0.84
This is a strong positive correlation. A high score on the X variable would
predict a high score on the Y variable. The absolute value of the correlation is
0.84 which is close size to 1.0. There would not be a lot of scatter on a bivariate
plot. This would be considered a high degree of correlation.
5. –1.0
This is a perfect negative correlation. The data would follow a perfectly
straight line on a scatter plot beginning in the upper left corner of the plot and
progressing downward to the lower right corner of the plot. A high score on the X
variable would predict a low score on the Y variable. There would be no scatter at
all on the plot.
6. 0.11
This is a small, positive correlation. The positive sign indicates that a high
score on the X variable would predict a high score on the Y variable. But, an
absolute value of 0.11 suggests a very small linear relationship. There would
be a large amount of scatter on the bivariate plot.
7. –0.06
This correlation coefficient is close to zero. Even though the sign of the
correlation coefficient is negative, the fact that its absolute value is so close to
zero would lead to an interpretation of no relationship. There would a large
amount of scatter on the bivariate plot that would appear to be random.
8. 0.62
This is a positive correlation. The positive sign indicates that a high score on the X
variable would predict a high score on the Y variable. The absolute value of 0.62
suggests a fairly predictable relationship between X and Y; There would be only a
modest amount of scatter on the bivariate plot.
9. -0.75
This is a high degree of negative correlation. The negative sign indicates that a
high score on the X variable would predict a low score on the Y variable. The
absolute value of 0.75 indicates a strong relationship. There would be only a
modest amount of scatter on the bivariate plot.
58 60 62 64 66 68 70 72
Height in Inches
Obviously the relationship will be perfect. All the points are on the line. There is no
x x y y
spread to the scatter plot. Thus your Z x i will be equal to Z y i . That's
x y
because you stand in the same relative location in the height distribution no matter
whether it is in feet or inches. Such a relationship gives you an r = 1.0 in the simple
derivation below.
Just because one variable relates to another variable does not mean that
changes in one causes changes in the other. Other variables may be acting on one or
both of the related variables and affect them in the same direction. Cause-and-effect
may be present, but correlation does not prove cause. For example, the length of a
person‟s pants and the length of their legs are positively correlated - people with
longer legs have longer pants; but increasing one’s pant length will not lengthen one’s
legs!
Property of Linearity
The conclusion of no significant linear correlation does not mean that X and Y
are not related in any way. The data depicted in Figure 2 result in r = 0, indicating no
linear correlation between the two variables. However, close examination show a
definite pattern in the data reflecting a very strong “nonlinear” relationship. Pearson’s
correlation apply only to linear data.
100
80
60
40
20
0
-15 -10 -5 0 5 10 15
x: -3 -2 -1 0 1 2 3
y: 9 4 1 0 1 4 9
Here x = 0, y = 28, xy = 0, n = 7
ΣxΣy
SP(x, y) Σxy 0
n
SP(x, y)
Therefore rxy 0 i.e. the correlation coefficient between x and y is
SS(x).SS(y)
zero. But it may be noticed that x and y are bound by the relation y x 2 . So, x and y
are not independent. Thus the correlation may be zero, even when the variables are
not independent.
r 2 100
For example, if two variables are correlated r = 0.71 they have 50% common variance
(0.712 x 100 = 50%) indicating that 50% of the variability in the Y-variable can be
explained by variance in the X-variable. The remaining 50% of the variance in Y
remains unexplained. This unexplained variance indicates the error when predicting Y
from X. For example, strength and speed are related about r = 0.80, (r2 = 64%
common variance) indicating 64% of both strength and speed come from common
factors and the remaining 36% remains unexplained by the correlation.
y variable (predicted)
x variable (predictor)
1 r2
SE(r)
n
1 r 2
P.E 0.6745 S.E(r) 0.6745
n
Symbolically = r P. E.
State in each case whether you would expect to obtain a positive, negative or no
correlation between:
Age and blood pressure.
Air temperature and metabolic rate.
Amount of rainfall and yield of a certain crop.
Dose of nitrogen and yield of a certain crop.
Drug dose and blood pressure.
Food intake and weight.
Idle time of machine and volume of production.
Income and expenditure of several families.
Increase in rainfall up to a point and production of rice.
Investment and profit
Number of goals conceded by a team and their position in the league.
Number of hours studied and grade obtained.
Number of tiller and yield of wheat.
Numbers of errors and typing speed.
Panicle length and yield of rice.
Price and demand of commodities.
Production and price per unit.
Sale of cold-drinks and day temperature.
Sale of woolen garments and day temperature.
Shoe size and intelligence.
Supply and Price of commodities.
Temperature and percentage breakage of unhusked rice in milling.
The age of husbands and wives.
The height and weight of a group of children.
Weight and blood pressure.
Years of education and income.
Solution:
1 r 2
P.E. 0.6745
n
1 (0.6) 2
0.6745
64
0.57
Cov(x, y)
r
Var(x) Var(y)
- 1.65
2.85 100
- 0.97
Problem:
A group of n 15 Stray berry plants was grown in plots in a green house and
the measurement were taken on crop yield (y) and the corresponding level of nitrogen
present in the leaf at the time of picking:
x 2.50 2.55 2.54 2.65 2.68 2.55 2.62 2.57 2.63 2.59 2.69 2.61 2.67 2.57 2.53
y 247 245 266 277 284 251 275 272 241 265 281 292 285 274 282
Find the association between level of nitrogen and crop yield. Test the association
between level of nitrogen and crop yield.
1 r 2 1 r 2
1. P.E. 0.6475 2. P.E. 0.6475
n n
1 r 2 1 r 2
3. P.E. 0.6475 4. P.E. 0.6745
n n
All the points lie on the line y bx with regression coefficient b and correlation
coefficient r . Find b and r .
1) b 1 and r 2
2) b 2 and r 2
3) b 2 and r 1
1
4) b and r 2 .
2
(f). The correlation coefficient r satisfies 0 r 2 1. Which of the following
statements is true?
1) 0 r 1. 2) r 1 3) r 1 4) r 1 or r -1
(g). Find the correlation coefficient for 6 pairs of observations if the LSR line is
y 0.5 0.05 x and if 81% of the variation in y is explained by regression on x.
1) 0.9 2) 0.81 3) -0.05 4) None of these.
(h). For the bivariate data (x1,y1) (x2,y2) (xn,yn), the least squares regression line is
fitted. The line is y 2.51 4.1x . You know that the first data point is
( x1, y1 ) (0.1, 2.0) , so the residual at this point is:
(i). The correlation coefficient for a set of bivariate data (xi,yi) is r = 0.87, where the xi
are measured in inches and the yi are measured in lbs. A second analyst records the xi
values in cm. (1 inch ≈ 2.5 cm). What is the second analyst‟s value of the correlation
coefficient (to 2dp)?
1) 0.35
2) 0.87
3) 2.18
4) Unable to determine without knowing the yi units.
y y 23 11 12
Quadrant 2 Quadrant 1
16
•
12
y = 11
(x, y)
8
Quadrant 3 • Quadrant 4
4
••
0 x
0 1 2 3 4 5 6 7
Causation
If there is a significant linear correlation between two variables, then one of five
situations can be true.
There are some common errors that are made when looking at correlation.
Avoid concluding causation. We just got through talking about causation. Just
because there is a linear relationship doesn't mean that one thing caused the other.
It could be any of the five situations above.
Avoid data based on rates or averages. Variation is suppressed when using a rate
or an average. The variance of the sample means was the variance of the
population divided by the sample size. So, if you work with averages, the
variances are smaller and you might be able to find linear relationships that are
significant when they would not be if the original data was used.
Watch out for linearity. All that we're testing here is the strength of a linear
relationship. There are other kinds of relationships. In algebra, we talk about
linear, quadratic, cubic, exponential, logarithmic, Gaussian (bell shaped),
logistics, and power models. A scatter plot is a good way to look for patterns.
Correlation is:
(a) the covariance of standardized scores
(b) the mean of the population standard deviations
(c) a way of testing cause and effect
(d) for comparing mean differences
(e) none of the above
What would you expect the correlation between daily calorie consumption and body
weight to be?
(a) moderate to large positive
(b) small positive
(c) zero or near zero
(d) small negative
(e) moderate to large negative
The measure of how well the regression line fits the data is the:
1. Coefficient of determination
2. Slope of the regression line
3. Mean square error
4. Standard error of the regression coefficient
As the relationship deteriorates from a perfect correlation, what happens to the points
on a scatter diagram?
1. They become more scattered
2. The slope changes
3. The y-intercept changes
4. Both B and C, above
5. None of the above
Observed errors, which represent information from the data which is not explained by
the model, are called?
1. Marginal values
2. Residuals
3. Mean square errors
4. Standard errors
5. None of the above
In an experiment an analyst has observed that SP(x,y) equals -212.35, SS(x) equals
237.16 and SS(y) = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate, the
least squares estimate for 0 is ____________.
1. -0.859
2. -0.895
3. 188.099
4. 206.710
5. 218.719
In an experiment an analyst has observed that SP(xy) equals -212.35, SS(x) equals
237.16 and SS(y) = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate, the
least squares estimate for 1 is ____________.
1. -0.859
2. -0.895
3. 188.099
4. 206.710
5. 218.719
In an experiment an analyst has observed that SPxy equals -212.35, SSx equals
237.16 and SSy = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2.Assume that a linear regression model is appropriate. These
results imply that, if X equals 200, the expected value for Y would be ____________.
1. 37,618.95
2. 26,824.83
3. 12.5
4. 11.2
5. 9.02
In an experiment an analyst has observed that SPxy equals -212.35, SSx equals
237.16 and SSy = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate,
approximately ____________ of variation in Y could be attributed to variation in X.
1. 13%