Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views21 pages

01 Correlation (Revised)

Uploaded by

shishirsms769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views21 pages

01 Correlation (Revised)

Uploaded by

shishirsms769
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Correlation Analysis

Noor Md. Rahmatullah


Professor

Bivariate distribution:
There are many situations arise in which we are interested to study the
relationship between two variables such as:
1. The amount of rainfall and yield of a certain crop
2. The amount of FGR and fish weight
3. The height and weight of a group of children
4. Income and expenditure of several families
5. The heart girth and body weight of animal etc.

The distribution in which we consider a pairs of observations simultaneously is


known as bivariate distribution.

What is correlation?
When there is a relationship between quantitative measures between two sets
of phenomena, the appropriate statistical tool for discovering and measuring the
relationship and expressing it on a precise way is known as correlation.

Correlation may be linear or non-linear. If the amount of change in one


variable tends to bear a constant ratio to the amount of change in the other variable,
the correlation is said to be linear; because the scatter diagram would show a linear
path. Here we shall be concerned with linear correlation or simple correlation only.

Karl Pearson’s coefficient of correlation:


The Pearson‟s product-moment correlation coefficient (PMCC) is a statistic
that is used to estimate the intensity or degree of linear relationship between two
variables. It is a numerical estimate of both the strength of the linear relationship
and the direction of the relationship. It is calculated when the scale of measurement is
interval or ratio. This statistic is typically referred to as the correlation coefficient and
which is given by:
Σ(x i  x ) (yi  y)
Cov(x, y) n
rxy  
V(x) V(y) Σ(x i  x )2 Σ(yi  y) 2
n n
1
Σ(x i  x ) (yi  y)
 n
1
Σ(x i  x ) 2 Σ(yi  y)2
n
Σx Σy
Σx i yi  i i SP(x, y)
 n 
 2 (x i ) 2   2 (yi ) 2  SS(x) SS(y)
 Σx i    Σyi  
 n   n 
 The correlation coefficient computed from the sample data measures the strength
and direction of a linear relationship between two variables.
 The symbol for the population correlation coefficient  is the correlation
computed by using all the possible pairs of data values (x,y) taken from a
population.
 The symbol for the sample correlation coefficient is r is the correlation computed
from data obtained from the samples.

Assumptions underlying Karl Pearson’s correlation coefficient:


Pearson‟s correlation coefficient r is based on the following assumptions:
1. The variables under the study are measured on an interval or ratio scale.
2. The two variables follow bivariate normal distribution.
3. The relationship between the variables is linear.
4. The sample is adequate size to assume normality.

Geometrical interpretation of product moment correlation coefficient:


We can see from a scatter diagram whether two variables are correlated, but
we have no measure of how strong this relationship is. The diagram below shows a
typical scatter diagram for a bivariate distribution with the n pairs of observations of
the two variables x and y plotted.

y 2nd Quadrant 1st Quadrant


(x i  x, y i  y)
xi  x 
 
  yi  y
 
y
  ( x , y) 

 

3rd Quadrant 4th Quadrant
x
x

To show up any relationship more clearly, we can move the origin of the diagram to
the point ( x, y) . The coordinate of a typical point ( xi , y i ) will now be written
as ( xi  x, y i  y ) . The ith point on the diagram has been labeled to show this. If we
look the scatter diagram we can see the signs (+ or -) taken by all the ( xi  x) and
( y i  y ) in the four new quadrants. By multiplying the signs, we can find the sign
taken by their product ( xi  x) ( y i  y ) .

2 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Quadrant (x i  x) (y i  y) (x i  x) (y i  y)
First + + +
Second - + -
Third - - +
Fourth + - -

Now think about the three types of correlation:


 If there is a positive correlation, most points lie in the first and third quadrants so
 (x i  x) (y i  y) would be positive.
 If there is a negative correlation, most points lie in the second and fourth
quadrants so  (x i  x) (y i  y) would be negative.
 If there if no linear relationship, the points lie in all four quadrants and
 (x i  x) (y i  y) is zero or close to zero, since the positive and negative values
tend to cancel out.

Example: Judging correlation from scatter plots

Scatterplot

The most useful graph for displaying the relationship between two quantitative
variables is a scatterplot.
A scatterplot shows the relationship between two quantitative variables measured for
the same individuals. The values of one variable appear on the horizontal axis, and the
values of the other variable appear on the vertical axis. Each individual in the data
appears as a point on the graph.

3 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Properties of correlation coefficient:
1. The correlation coefficient is independent of the choice of origin and scale of
measurement of the variables.
2. Correlation coefficient measures the linear relationship.
3. Correlation coefficient is a symmetric measure with respect to x and y,
symbolically rxy  ryx . The r is not affected by the choice of x and y.
Interchange x and y and the value of r will not change.
4. Correlation coefficient lies between +1 and –1. Symbolically  1  r  1.
5. The correlation coefficient will be positive or negative depending on whether
the sign of Cov(x, y).
6. The correlation coefficient is a pure number without units i.e. dimensionless.

r is not affected by:


-- interchanging the two variables (it makes no difference which variable is
called x and which is called y)
-- adding the same number to all the values of one variable
-- multiplying all the values of one variable by the same positive number

Because r uses the standardized values of the observations, r does not change
when we change units of measurement (inches vs. centimeters, pounds vs.
kilograms and miles vs. meters etc.). So, r is “scale invariant”.
7. Correlation requires that both variables be quantitative (numerical).
You can‟t calculate a correlation between “income” and “city of residence”
because “city of residence” is a qualitative (non-numerical) variable.
8. The correlation can be misleading in the presence of outliers or nonlinear
association.
Correlation coefficient r does not describe curved relationships. r is affected
by outliers. When possible, check the scatter plot.
9. Correlation measures association. But association does not necessarily show
causation.
Both variables may be influenced simultaneously by some third variable.
10. The ratio of the values of rxy does not show the closeness of correlations.
If rxy  0.6 , then it does not mean that this correlation has double strength of
that whose value is rxy  0.3
11. The value rxy  0.5 shows as much closeness in the positive direction, as
rxy  0.5 shows in the negative direction.
12. Random variables which are at the interval or ratio level of measurement.
13. If X and Y are independent, then rxy  0 ; though, the converse is not true.

4 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


 Prove that correlation coefficient r lies between –1 and +1
symbolically  1  r  1

Let x and y be the variables and (x1,y1), (x2,y2),… …,(xn,yn) denotes n pairs of
observations with means ( x, y) and standard deviation sx and sy respectively. If we
write the standard normal variate as follows:
2 2
x x  x  x  x  x
ui  i  u i   i   u i   i 
2 2

sx s  s 
 x   x 
( xi  x ) 2
 u i   u i  n
2 2
2
sx
yi  y
and vi 
sy
Similarly vi  n
2

Again
( xi  x ) ( y i  y )
u i vi  
sx sy
( xi  x)( y i  y )

sx s y
nCov( x, y )

sx s y
 nr
Where r denote the correlation coefficient between x and y.
Now, (ui  vi ) 2 can never be negative, because it is a perfect square. Hence the sum
of all such squares for i=1, 2, 3, … …, n can not be negative; i.e.

(ui  vi ) 2  0
 ui  vi  2ui vi  0
2 2

 n  n  2nr  0
 2n(1  r )  0
 1  r  0 (Since n  0)
1  r  0

1  r  0
 1  r

r  1
 1  r  1.

This proves that the correlation coefficient lies between –1 and +1.

5 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


 Correlation coefficient is independent of change of origin and scale.

Proof :
We know that correlation coefficient between x and y is given by,
SP(x, y) Σx i  x yi  y 
rxy  
SS(x)  SS(y) Σx i  x  Σyi  y 
2 2

xi  a y b
Suppose the transformations u i  and vi  i for defining change of origin
c d
and scale (where c, d  0 ).
xi  a x a
Now, u i  u 
c c
x  a x  a xi  x
 ui  u  i  
c c c
 xi  x  cui  u 
Similarly, yi  y  dvi  v
Σx i  x yi  y  Σcu i  u  dvi  v 
rxy  
Σx i  x  Σyi  y  Σcu i  u  Σdvi  v 
2 2 2 2

cdΣu i  u vi  v  cdΣu i  u vi  v 


   ruv
c Σu i  u  d Σvi  v  cd Σu i  u  Σvi  v 
2 2 2 2 2 2

rxy  ruv
This proves that the correlation coefficient is not affected by change of origin and
scale.

Why correlation is important?


Correlation is widely used statistic in different field of research. At some point
in your career as a researcher you may be asked to describe the association (strength
and direction) between variables. A statistical technique called correlation which
allows us to do this. Most of the variables show some kind of relationship. For
instance, there is relationship between nitrogen rate and yield, price and supply,
income and expenditure etc. With the help of correlation analysis we can measure in
one figure the degree of relationship.

Degrees of Correlation

Through the coefficient of correlation, we can measure the degree or extent of the
correlation between two variables. On the basis of the coefficient of correlation we
can also determine whether the correlation is positive or negative and also its degree
or extent.

1. Perfect correlation: If two variables changes in the same direction and in the
same proportion, the correlation between the two is perfect positive.

6 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


According to Karl Pearson the coefficient of correlation in this case is +1. On
the other hand if the variables change in the opposite direction and in the same
proportion, the correlation is perfect negative. Its coefficient of correlation is
-1. In practice we rarely come across these types of correlations.
2. Absence of correlation: If two series of two variables exhibit no relations
between them or change in variable does not lead to a change in the other
variable, then we can firmly say that there is no correlation or absurd
correlation between the two variables. In such a case the coefficient of
correlation is 0.
3. Limited degrees of correlation: If two variables are not perfectly correlated
or is there a perfect absence of correlation, then we term the correlation as
Limited correlation. It may be positive, negative or zero but lies with the limits
 1.

High degree, moderate degree or low degree are the three categories of this kind of
correlation. The following table reveals the effect (or degree) of coefficient or
correlation.

Degrees Positive Negative


Absence of correlation 0 0
Perfect correlation +1 -1
High degree + 0.75 to + 1 - 0.75 to -1
Moderate degree + 0.25 to + 0.75 - 0.25 to - 0.75
Low degree 0 to 0.25 0 to - 0.25

Interpretation of r:
The correlation coefficient always lies between –1 and +1. To interpret the
correlation coefficient, we must consider both its sign (positive or negative) which
indicates the direction of relationship and its absolute value which indicates the
strength of linear relationship. A perfect positive correlation has a coefficient of 1.0; a
perfect negative correlation has a coefficient of -1.0. When there is no association
between two variables, the correlation coefficient has a value of 0. A value of zero
indicates that the variables are not linearly related or perhaps have more complex or
non linear relationship.
Review each correlation coefficient presented below and determine it‟s direction and
strength
1. -0.38
The negative sign tells us that this is a negative correlation. A high score on
the X variable would predict a low score on the Y variable. The absolute value of
the correlation is 0.38 would be considered moderate in size in agricultural
research. It is not a terrible strong relationship but there is definitely a linear
relationship between the two variables.

2. 0.23
This is a small, positive correlation. A high score on the X variable would
predict a high score on the Y variable but not with a great deal of accuracy. There
would be a fair amount of scatter on the bivariate plot but a definite linear
relationship could be seen.

7 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


3. –0.50
This is a negative correlation. The negative sign indicates that a high score on
the X variable would predict a low score on the Y variable. The absolute value of
0.50 suggests a fairly predictable relationship between X and Y. A correlation of –
0.50 would be considered a moderate correlation in agricultural research even
though it is only half-way between no relationship and a perfect relationship.
There would be a modest amount of scatter on the bivariate plot.

4. 0.84
This is a strong positive correlation. A high score on the X variable would
predict a high score on the Y variable. The absolute value of the correlation is
0.84 which is close size to 1.0. There would not be a lot of scatter on a bivariate
plot. This would be considered a high degree of correlation.

5. –1.0
This is a perfect negative correlation. The data would follow a perfectly
straight line on a scatter plot beginning in the upper left corner of the plot and
progressing downward to the lower right corner of the plot. A high score on the X
variable would predict a low score on the Y variable. There would be no scatter at
all on the plot.

6. 0.11
This is a small, positive correlation. The positive sign indicates that a high
score on the X variable would predict a high score on the Y variable. But, an
absolute value of 0.11 suggests a very small linear relationship. There would
be a large amount of scatter on the bivariate plot.

7. –0.06
This correlation coefficient is close to zero. Even though the sign of the
correlation coefficient is negative, the fact that its absolute value is so close to
zero would lead to an interpretation of no relationship. There would a large
amount of scatter on the bivariate plot that would appear to be random.

8. 0.62
This is a positive correlation. The positive sign indicates that a high score on the X
variable would predict a high score on the Y variable. The absolute value of 0.62
suggests a fairly predictable relationship between X and Y; There would be only a
modest amount of scatter on the bivariate plot.

9. -0.75
This is a high degree of negative correlation. The negative sign indicates that a
high score on the X variable would predict a low score on the Y variable. The
absolute value of 0.75 indicates a strong relationship. There would be only a
modest amount of scatter on the bivariate plot.

8 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Scatter diagram:
A graph that shows the relationship between two variables is called a scatter
diagram or scatter plot. A bivariate plot graphs the relationship between two variables
that have been measured on a single sample of subjects. Such a plot permits you to
see at a glance the degree and pattern of relationship between the two variables.

Nuance: Why is the maximum r = 1?


Consider a correlation coefficient correlated between your height in inches and
your height in feet. This is incredibly stupid to correlate (see our graph below).
6.0
5.8
Height in Feets
5.6
5.4
5.2
5.0
4.8

58 60 62 64 66 68 70 72
Height in Inches

Obviously the relationship will be perfect. All the points are on the line. There is no
x x y y
spread to the scatter plot. Thus your Z x  i will be equal to Z y  i . That's
x y
because you stand in the same relative location in the height distribution no matter
whether it is in feet or inches. Such a relationship gives you an r = 1.0 in the simple
derivation below.

9 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Cov( x, y )
rxy 
 x y
1 nCov( x, y )

n  x y
1 SP( x, y )

n  x y
1 ( xi  x)( y i  y )

n  x y
1 ( xi  x ) ( y i  y )
 
n x y
1 ( xi  x ) 2
 
n  x2
( xi  x ) 2 1

n  x2
 x2
 2
x
 1.

Correlation Does Not Imply Causation

Just because one variable relates to another variable does not mean that
changes in one causes changes in the other. Other variables may be acting on one or
both of the related variables and affect them in the same direction. Cause-and-effect
may be present, but correlation does not prove cause. For example, the length of a
person‟s pants and the length of their legs are positively correlated - people with
longer legs have longer pants; but increasing one’s pant length will not lengthen one’s
legs!

Property of Linearity
The conclusion of no significant linear correlation does not mean that X and Y
are not related in any way. The data depicted in Figure 2 result in r = 0, indicating no
linear correlation between the two variables. However, close examination show a
definite pattern in the data reflecting a very strong “nonlinear” relationship. Pearson’s
correlation apply only to linear data.

10 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


120

100

80

60

40

20

0
-15 -10 -5 0 5 10 15

Fig. 2: Nonlinear relationship between x and y rxy  0 where y  x 2

If two variables are independent, their correlation coefficient is zero. Is the


converse true? Explain by means of an example.
If two variables are independent, their correlation coefficient is zero. But the
converse is not true. A zero correlation coefficient does not necessarily signify that
the variables are independent. It only implies that there is no linear relationship
between the variables. However, the possible existence of a non-linear relationship
can not be ruled out altogether. For example, let us consider the following data:

x: -3 -2 -1 0 1 2 3
y: 9 4 1 0 1 4 9

Here  x = 0,  y = 28,  xy = 0, n = 7
ΣxΣy
 SP(x, y)  Σxy  0
n
SP(x, y)
Therefore rxy   0 i.e. the correlation coefficient between x and y is
SS(x).SS(y)
zero. But it may be noticed that x and y are bound by the relation y  x 2 . So, x and y
are not independent. Thus the correlation may be zero, even when the variables are
not independent.

Coefficient of Determination (r2)

The relationship between two variables can be represented by the overlap of


two circles representing each variable (Figure 3). If the circles do not overlap, no
relationship exists; if they overlap completely, the correlation equals r = 1.0. If the
circles overlap somewhat, as in Figure 3, the area of overlap represents the amount of

11 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


variance in the dependent (Y-variable) than can be explained by the independent (X-
variable). The area of overlap, called the percent common variance, calculates as:

r 2  100
For example, if two variables are correlated r = 0.71 they have 50% common variance
(0.712 x 100 = 50%) indicating that 50% of the variability in the Y-variable can be
explained by variance in the X-variable. The remaining 50% of the variance in Y
remains unexplained. This unexplained variance indicates the error when predicting Y
from X. For example, strength and speed are related about r = 0.80, (r2 = 64%
common variance) indicating 64% of both strength and speed come from common
factors and the remaining 36% remains unexplained by the correlation.

Fig. 3:Example of the coefficient of determination (percent common variance r2x100).

y variable (predicted)
x variable (predictor)

Area of overlap; r2x100 (percent common variance)

Probable Error of correlation coefficient:

If r is the correlation coefficient in sample of n pairs of observations, then its standard


error is given by

1  r2
SE(r) 
n

Probable error of correlation coefficient is given by

1  r 2 
P.E  0.6745  S.E(r)  0.6745  
 n 

Probable error is a measure for testing the reliability of an observed correlation


coefficient. The reason for taking the factor 0.6745 is that in a normal
distribution, the range   0.6745 covers 50% of the total area.

12 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


i. If the value of r is less than P. E., then there is no evidence of correlation i.e. r
is not significant.
ii. If r is more than 6 times the P. E. „r‟ is practically certain .i.e. significant.
iii. By adding or subtracting P. E. to „r‟, we get the upper and Lower limits within
which „r’ of the population can be expected to lie.

Symbolically  = r  P. E.

 = Population correlation coefficient.

State in each case whether you would expect to obtain a positive, negative or no
correlation between:
 Age and blood pressure.
 Air temperature and metabolic rate.
 Amount of rainfall and yield of a certain crop.
 Dose of nitrogen and yield of a certain crop.
 Drug dose and blood pressure.
 Food intake and weight.
 Idle time of machine and volume of production.
 Income and expenditure of several families.
 Increase in rainfall up to a point and production of rice.
 Investment and profit
 Number of goals conceded by a team and their position in the league.
 Number of hours studied and grade obtained.
 Number of tiller and yield of wheat.
 Numbers of errors and typing speed.
 Panicle length and yield of rice.
 Price and demand of commodities.
 Production and price per unit.
 Sale of cold-drinks and day temperature.
 Sale of woolen garments and day temperature.
 Shoe size and intelligence.
 Supply and Price of commodities.
 Temperature and percentage breakage of unhusked rice in milling.
 The age of husbands and wives.
 The height and weight of a group of children.
 Weight and blood pressure.
 Years of education and income.

13 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Example: If r = 0.6 and n = 64 find out the probable error of the coefficient of
correlation also find the expected population correlation coefficient.

Solution:

1  r 2 
P.E.  0.6745  
 n 
1  (0.6) 2 
 0.6745  
 64 
 0.57

Expected population correlation coefficient   r  P.E.  0.6  0.057 .

Example: Find the coefficient of correlation r, given that

Cov(x,y) = −16·5, Var(x) = 2·85 and Var(y) = 100

Solution:- Putting the given values in the formula,

Cov(x, y)
r
Var(x)  Var(y)
- 1.65

2.85  100
 - 0.97

Problem:

A group of n  15 Stray berry plants was grown in plots in a green house and
the measurement were taken on crop yield (y) and the corresponding level of nitrogen
present in the leaf at the time of picking:

x 2.50 2.55 2.54 2.65 2.68 2.55 2.62 2.57 2.63 2.59 2.69 2.61 2.67 2.57 2.53
y 247 245 266 277 284 251 275 272 241 265 281 292 285 274 282

Find the association between level of nitrogen and crop yield. Test the association
between level of nitrogen and crop yield.

14 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Objective Type Questions:

Comment on the following:

a) rxy  0  x and y are independent.


b) If rxy  0 then rx,  y  0, r x, y  0, r x,  y  0 .
c) Pearson correlation coefficient is independent of origin but not of scale.
d) If the correlation coefficient between the variables x and y is zero then the
correlation coefficient between the variables x2 and y2 is zero.
e) The numerical value of product moment correlation coefficient „r‟ between
two variables x and y can not exceed unity.
f) r measures every type of relationship between two variables.
g) If r  0 , then as x increases y also increases.
h) rxy  0.9 , then for large values of x, what sort of values do we expect for y.
i) if rxy  0 , what is the value of Cov(x,y) and how are x and y related.

 Indicate the correct answer using tick (√) mark:

a). The coefficient of correlation will have positive sign when

1. x is increasing, y is decreasing 2. both x and y are increasing

3. x is decreasing, y increasing 4. there is no change in x and y.

b). The coefficient of correlation

1. can not be positive 2. cannot be negative

3. is always positive 4. can be both positive as well as negative.

c). The correlation coefficient

1. can take any value between –1 and +1 2. is always less than –1

3. is always more than +1 4. can not be zero.

d). Probable error of r is

1  r 2  1  r 2 
1. P.E.  0.6475   2. P.E.  0.6475  
 n   n 

1  r 2  1  r 2 
3. P.E.  0.6475   4. P.E.  0.6745  
 n   n 

15 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


(e). The following table gives the relationship between pairs of data values (xi,yi) for
i = 1, 2, …, 5.
x 1 2 3 4 5
y 2 4 6 8 10

All the points lie on the line y  bx with regression coefficient b and correlation
coefficient r . Find b and r .

1) b  1 and r  2
2) b  2 and r  2
3) b  2 and r  1
1
4) b  and r  2 .
2
(f). The correlation coefficient r satisfies 0  r 2  1. Which of the following
statements is true?

1) 0  r  1. 2) r  1 3) r  1 4) r  1 or r  -1

(g). Find the correlation coefficient for 6 pairs of observations if the LSR line is
y  0.5  0.05 x and if 81% of the variation in y is explained by regression on x.
1) 0.9 2) 0.81 3) -0.05 4) None of these.

(h). For the bivariate data (x1,y1) (x2,y2) (xn,yn), the least squares regression line is
fitted. The line is y  2.51  4.1x . You know that the first data point is
( x1, y1 )  (0.1, 2.0) , so the residual at this point is:

1) 2.1 2) -0.1 3) 0.1 4) 2.0

(i). The correlation coefficient for a set of bivariate data (xi,yi) is r = 0.87, where the xi
are measured in inches and the yi are measured in lbs. A second analyst records the xi
values in cm. (1 inch ≈ 2.5 cm). What is the second analyst‟s value of the correlation
coefficient (to 2dp)?

1) 0.35
2) 0.87
3) 2.18
4) Unable to determine without knowing the yi units.

16 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


 Justification for r Formula

(x, y) centroid of sample points


x=3
y x - x = 7- 3 = 4
(7, 23)
24

20

y  y  23 11  12
Quadrant 2 Quadrant 1
16

12
y = 11
(x, y)
8
Quadrant 3 • Quadrant 4

4
••
0 x
0 1 2 3 4 5 6 7

17 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Methods of studying correlation
The following are the important methods of ascertaining whether two variables are
correlated or not:
1. Scatter Diagram Method
2. Karl Pearson‟s Coefficient of Correlation
3. Spearman‟s Rank Correlation Coefficient

 Scatterplot (or scatter diagram) is a graphical representation in which the


paired (x,y) sample data are plotted with a horizontal x axis and a vertical y
axis. Each individual (x,y) pair is plotted as a single point.

Causation

If there is a significant linear correlation between two variables, then one of five
situations can be true.

1. There is a direct cause and effect relationship


2. There is a reverse cause and effect relationship
3. The relationship may be caused by a third variable
4. The relationship may be caused by complex interactions of several variables
5. The relationship may be coincidental

Common Errors in Correlation

There are some common errors that are made when looking at correlation.

 Avoid concluding causation. We just got through talking about causation. Just
because there is a linear relationship doesn't mean that one thing caused the other.
It could be any of the five situations above.
 Avoid data based on rates or averages. Variation is suppressed when using a rate
or an average. The variance of the sample means was the variance of the
population divided by the sample size. So, if you work with averages, the
variances are smaller and you might be able to find linear relationships that are
significant when they would not be if the original data was used.
 Watch out for linearity. All that we're testing here is the strength of a linear
relationship. There are other kinds of relationships. In algebra, we talk about
linear, quadratic, cubic, exponential, logarithmic, Gaussian (bell shaped),
logistics, and power models. A scatter plot is a good way to look for patterns.

18 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


Quiz on MCQ

Correlation is a __________ type of statistical analysis.


(a) univariate
(b) bivariate
(c) multivariate
(d) none of these

Correlation is:
(a) the covariance of standardized scores
(b) the mean of the population standard deviations
(c) a way of testing cause and effect
(d) for comparing mean differences
(e) none of the above

What would you expect the correlation between daily calorie consumption and body
weight to be?
(a) moderate to large positive
(b) small positive
(c) zero or near zero
(d) small negative
(e) moderate to large negative

The square of the correlation coefficient or r2 is called the


(a) coefficient of determination
(b) variance
(c) covariance
(d) cross-product
(e) none of the above

The measure of how well the regression line fits the data is the:
1. Coefficient of determination
2. Slope of the regression line
3. Mean square error
4. Standard error of the regression coefficient

The assumptions of the simple linear regression model include:


1. The errors are normally distributed
2. The error terms have a constant variance
3. The errors have a mean of zero
4. A and B
5. A, B, and C

As the relationship deteriorates from a perfect correlation, what happens to the points
on a scatter diagram?
1. They become more scattered
2. The slope changes
3. The y-intercept changes
4. Both B and C, above
5. None of the above

19 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


If two variables have a correlation coefficient of .30, what percentage of one variable
is accounted for by the other variable?
1. 30%
2. 70%
3. 10%
4. 9%

Observed errors, which represent information from the data which is not explained by
the model, are called?
1. Marginal values
2. Residuals
3. Mean square errors
4. Standard errors
5. None of the above

In an experiment an analyst has observed that SP(x,y) equals -212.35, SS(x) equals
237.16 and SS(y) = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate, the
least squares estimate for 0 is ____________.
1. -0.859
2. -0.895
3. 188.099
4. 206.710
5. 218.719

In an experiment an analyst has observed that SP(xy) equals -212.35, SS(x) equals
237.16 and SS(y) = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate, the
least squares estimate for 1 is ____________.
1. -0.859
2. -0.895
3. 188.099
4. 206.710
5. 218.719

In an experiment an analyst has observed that SPxy equals -212.35, SSx equals
237.16 and SSy = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2.Assume that a linear regression model is appropriate. These
results imply that, if X equals 200, the expected value for Y would be ____________.
1. 37,618.95
2. 26,824.83
3. 12.5
4. 11.2
5. 9.02

In an experiment an analyst has observed that SPxy equals -212.35, SSx equals
237.16 and SSy = 858.49. The sample average for x was 193.1 and the sample
average for y was 15.2. Assuming that a linear regression model is appropriate,
approximately ____________ of variation in Y could be attributed to variation in X.
1. 13%

20 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc


2. 22%
3. 44%
4. 47%
5. None of the above

In a regression experiment involving 102 observations an analyst has estimated 0 as


81.41 and 1 as 1.925. In this analysis the sample average for x was 62.5, SSx was
325.64 and the standard error of the regression was 28.5. If x equals 71, a 95%
confidence interval for Y would be _____________.
1. 158.52; 254.55
2. 140.72; 272.35
3. 149.18; 263.89
4. 204.72; 208.35
5. 205.63; 207.44

In a study of 42 observations, the sample covariance between two variables, X1 and


X2, is -188.37. If SSx1 equals 202.25 and SSx2 equals 305.12. At  = 0.05, the test
statistic for H0:  =0 would equal _____________; we would therefore infer
_____________ between X1 and X2.
1. -3.06; a significant negative relationship
2. 2.14; a significant positive relationship
3. -1.99; no significant relationship
4. 5.24; a significant negative relationship
5. 3.06; a significant positive relationship

Which of the following values is minimized through least squares estimation of 0


and 1?
1. ( y  yˆ )2
2. ( yi  y ) 2
3. ( yi  yˆ )2

21 D:\Class Notes\Correlation (Rahmat)\01 Correlation (Revised).doc

You might also like