Correlation Analysis
Print Date: 07.05.24
Two variables are said to have correlation when they are so related that the change in
the value of one variable is accompanied by the change in the value of the other
variable. For example: (i) the amount of rainfall to some extent is accompanied by an
increase in the volume of production. (ii) the decrease in the price of a commodity is
accompanied by the increase in the quantity demanded.
The measure of correlation called the correlation coefficient. Correlation coefficient is a
quantitative measure of the direction and strength of linear relationship between two
numerically measured variables.
Types of correlation:
Correlation may be of the following three types:
i) Positive and negative correlation.
ii) Linear and non linear correlation.
iii) Simple, multiple and partial correlation.
(i) Positive and negative correlation:
If two variables vary in the same direction i.e. if increase ( or decrease) in the value of
one variable results increase ( or decrease ) in the value of other variable, then the two
variables are said to have positive correlation.
For example:
(a) x: 10 20 25 50 (b) x: 100 50 30 10
y: 5 8 10 20 Y: 8 5 3 2
One the other hand, two variables are said to have negative correlation if two variables
move in the opposite direction i.e. if one variable increase (or decreases) the second
decreases ( or increases).
For example:
(a) x: 10 20 25 50 (b) x: 100 50 30 10
y: 50 20 10 8 y: 8 15 23 28
(ii) Linear and non linear correlation:
The correlation between two variables is said to be linear when a unit change in one
variable results a constant change in the other variable over the entire range of the
values.
For example:
x: 1 2 3 4
y: 7 9 11 13
If corresponding to a unit change in one variable, there is no constant change in other
variable, then correlation is said to be non linear.
As example :
x: 1 2 3 4
y: 7 10 11 20
(iii) Simple, multiple and partial correlation:
The correlation between two variables is known as simple correlation. When three or
more variables are considered, then the correlation may be multiple or partial. In a
multiple correlation, three or more variables are studied simultaneously. In a partial
correlation, three will be three or more variables but we consider only two variables
influencing each other and other variables being kept constant.
The following are the example of simple multiple and partial correlation.
(a) The amount of fertilizer used and the yield of wheat per hectare.
(b) The amount of rainfall, quantity of fertilizer used and the yield of wheat per hectare.
(c) The amount of rainfall, and the yield of wheat per hectare, keeping the quantity of
fertilizer used constant.
Methods of studying correlation:
The following methods can be used to study the correlation between two variables:
(a) Scatter diagram
(b) Karl Pearson’s correlation coefficient
(c) Spearmen’s rank correlation
Scatter diagram:
It is a graphical method of studying correlation. The simplest method of ascertaining the
correlation between two variables is the scatter diagram. Let X and Y be two variables,
each consisting the same number of values. If we plot the x values along x-axis and y
values along y-axis, we shall get a number of dots on the graph paper. The diagram
consisting all the dots is said to be scatter diagram.
When all the dots shows an upward trend rising form lower left hand corner to the upper
right hand corner, then the correlation is said to be positive and when all the dots lie in a
straight line the correlation is said to be perfect positive.
When all the dots shows an downward trend falling form upper left hand corner to the
lower right hand corner, then the correlation is said to be negative and when all the dots
lie in a straight line the correlation is said to be perfect negative.
If the dots are widely scattered and they do not show any trend (rising or falling ) then
the variables are said to be uncorrected
y
y y
x
x x
Karl Pearson’s Correlation coefficient:
One of the widely used mathematical methods of studying the correlation coefficient
between two variables is Karl Pearson’s correlation coefficient. It is also known as
product moment correlation coefficient. Let ( x1 , y1 ), ( x2 , y 2 ), ... ( x n , y n ) be n pairs of
values of two variables x and y. The correlation coefficient between the variables x and
y is denoted by r and is defined as:
r= x x y y
i i
=
SP( x, y)
x x y y SS ( x) SS ( y)
2 2
i i
x y
xy n
=
( x ) ( y )
2 2
{ x 2
}{ y 2
}
n n
Properties of correlation coefficient:
1. The correlation coefficient is a symmetrical measure i.e. rxy = ryx
2. The correlation coefficient is independent of the changes of origin and scale i.e.
xa y b
rxy = ruv where u , v where a, b are assumed means and h, k common
h k
factors which are called origin and scale of measurement.
3. Correlation coefficient lies between -1 and +1
4. Correlation coefficient is the geometric mean of two regression coefficients.
Proof of property 2: Let x1 , y1 , x2 , y 2 .......... .xn , y n be n pairs of values of two
variables x and y. If x and y be the means of the values of x and y valuables
respectively, then the correlation coefficient between x and y is
rxy
x x y y
i i
x x y y
2 2
i i
Let us take the following transformations,
xi a yi b
ui And, vi
h k
xi a hui y i b kvi
xi a hui y i b kvi
x a hu y b kv
Now, rxy a hu a hu b kv b kv
i i
a hu a hu b kv b kv
2 2
i i
hk u u v v
i i
hk u u v v
2 2
i i
u u v v
i i
u u v v
2 2
i i
ruv.
This implies that correlation coefficient (r) does not depend on origin and scale of
measurement.
Proof Property 3: Let x1 , y1 , x2 , y 2 .......... .xn , y n be n pairs of values of two variables
x and y. If x and y be the means of the values of x and y valuables respectively, then the
correlation coefficient between x and y is
rxy
x x y y
i i
x i x y i y
x x y y x x y y
2 2 2 2
i i i i
Let pi
xi x xi x
2
pi
2
i xi x
2
2
x x
xi x 2
pi 1
2
xi x
2
yi y y i y 2
Similarly, qi qi 1.
2
i yi y
2
2
y y
xi x yi y
Thus, pi qi
xi x y i y
2
xi x yi y r
pi q i
xi x y i y
2 2
Now for any value of p i and q i , we can write,
p q 0
2
i i
p 2 p q q 0
2 2
i i i i
1 2r 1 0
2 2r 0
1 r 0
Thus, 1 r 0
r 1 .......... ......(1)
Again, 1 r 0
r 1
r 1 .......... .......( 2)
Hence combining (1) and (2) we get,
1 r 1 (Proved)
Example: Calculate the coefficient of correlation from the following data of demand
and price of a commodity.
Price: 4 6 9 12 14 20
Demand: 14 12 10 14 13 16
Solution: Let price and demand be denoted by x and y respectively.
x y
xy n
We know, r =
( x ) ( y )
2 2
{ x
2
}{ y 2
}
n n
Table for necessary calculations:
x y x2 y2 xy
4 14 16 196 56
6 12 36 144 72
9 10 81 100 90
12 14 144 196 168
14 13 196 169 182
20 16 400 256 320
x 65 y 79 x 873
2
y 1061
2
xy 888
65 79
888
Now r = 6
(65) 2 (79) 2
{873 } {1061 }
6 6
888 855.83 32.17
= 0.55
(873 704.17) (10611040.17) 59.30
Comment: The results show that the relationship between demand and price is positive
i.e. if the demand of the commodity goes up, the price will go up. and if the demand
goes down , the price will go down.
Example: The following table gives the age and blood pressure of 10 patients.
Age: 56 42 36 47 49 42 60 72 63 55
Pressure: 147 125 118 128 145 140 155 160 149 150
Compute the coefficient of correlation between the age and blood pressure.
Rank Correlation: The correlation between the ranks of two variables is known as rank
correlation. Rank correlation method is applied when the rank order data are available or
when each variable can be ranked in some order. The measure based on this method is
known as rank correlation coefficient.
The rank correlation method is recommended when
1. The values of the variables are available in rank order form.
2. The data are qualitative in nature and can be ranked is some order.
3. The data were originally quantitative in nature but because of smallness of the
sample size or for convenience in fitting the requirements of analytical
techniques were converted into ranks.
Computing rank correlation:
The Spearman rank correlation coefficient rs is just the ordinary sample correlation
coefficient r applied to the rank order data. The method calls for computing the sum of
the squared differences between each pair of ranks, after each of the two variables to be
correlated is arranged in order of ranks. Then if no tie in ranks exists we can apply the
following formula for computing rs :
6 d i2
rs 1
n n2 1
where d i is the difference between ranks of the ith pair and n is the number of pairs
included.
A convenient and simple formula for computing rs is as follows:
xi yi C 12 xi yi C nn 1
2
rs
1
n n 1
2
n n2 1 where, C =
4
12
Example: Suppose we wish to determine whether the marks given by two independent
examiners to 10 students in an examination are correlated. Let x and y respectively be
the ranks of the marks given by the first examiner and the second examiner. Table
below shows these marks, the rank ordering and the squares of the difference between
the paired values on the basis of the marks.
First examiner Second examiner
Student Marks xi Marks yi d i = xi y i d i2 xi y i 2
1 65 10 30 9 +1 1
2 70 9 25 10 -1 1
3 76 7 35 8 -1 1
4 75 8 40 6 +2 4
5 80 5 38 7 -2 4
6 78 6 42 5 +1 1
7 83 4 48 3 +1 1
8 84 3 50 2 +1 1
9 85 2 55 1 +1 1
10 90 1 45 4 -3 9
6 24
Now rs 1 = 1.0 – 0.15 = 0.85
10 (102 1)
Computing r s for repeated ranks:
The usual formula for rank correlation assumes that no two observations would have the
identical or equal ranks. If however there are ties (i.e. when two observations of the
same variable are identical) some adjustment in the usual formula is needed to computer
rs.
The modified formula is,
rs
x y
i i C
, where C =
1
nn 12
x 2
i y
C 2
i
c 4
Regression Analysis
The regression analysis is a technique of studying the dependence of one variable
(called dependent variable) on one or more variables (called independent variables),
with a view to estimate or predict the average value of the dependent variable in terms
of the known or fixed values of the independent variables. The dependent and the
independent variables are also called the explained and the explanatory variables
respectively.
The regression technique is primarily used to:
(i) Estimate the relationship that exists, on the average, between the dependent
variable and the explanatory variables.
(ii) Determine the effect of each of the explanatory variables on the dependent
variable, controlling the effects of all other explanatory variables.
(iii) Predict the value of the dependent variable for a given value of the explanatory
variable.
Regression Model:
A model is simply a mathematical equations that describes the relationship between a
dependent variable and a set of independent variables.
A mathematical model in its simplest form involving two variables may be of the type
Y= X . This is the so called linear first order model, which says that for a given
X, a corresponding observation Y consists of the value X plus an amount , the
increment by which any individual Y may fall off the regression line X .
The parameter is the average value of Y for X= 0 and is called the Y intercept. The
parameter is the slope of the population regression line, also known as the population
regression coefficient. It represents the amount of increases in Y for each unit increase
in X.
Although we can not find the above parameters exactly without examining all possible
occurrences of Y and X, we can use the information provided by the actual sample
observations to provide us with the estimates a and b of and respectively. Thus we
can write
Y = a + bX
where Y denotes the predicted value of Y for a given X when a and b are determined.
The above Equation could then be used as a predictive equation. Substitution for a
value of X would provide a prediction of true mean value of Y for that X.
The least squares method:
One of the important objectives of regression analysis is to find the estimates for
and in the regression line y / x X . from observed data, we shall designate
these estimate by a and b respectively. The parameters is called the regression
coefficient of Y on X and measures the average increase in Y for a unit increase in X.
may be zero, positive or negative depending on the strength of relationship between X
and Y. is the intercept of the unknown regression equation on the Y axis. The
estimates a and b will be called least squares estimates of and respectively.
The least squares method is thus a technique for minimizing the sum of suqares of the
differences between the observed values and the estimated values of the dependent
variable.
The estimating line Yi = a+ bX i is completely defined if the statistics a ( the Y intercept)
and b ( the slope of the line) are known. As it appears from the diagram, Y i is the ith
observation of the variable Y associated with X i , the ith observation on X. Then the
least squares line is the line that minimizes:
2
ei yi yi yi a bX i
2 2
where
e i = deviation of y i from yi
= yi yi
= yi a bX i
The difference (Y i - yi ) between the observed and the estimated value of Y at X = X i is
called the residual or deviation corresponding to Y i . The term ei2 is known as the sum
of squares of residuals.
One problem now is to compute the values of a and b that make the sum of squares e i2 as
small as possible i.e. the values of a and b are to be so chosen that ei2 is the minimum.
One method of doing this is to set the partial derivatives of ei2 with respect to both a
and b equal to zero and solve the resulting equations. Thus differentiating first with
respect to a and equating to zero
ei =- 2 yi a bX i 0
2
a
So that, yi na b X i ........ (1)
Again differentiating the same function with respect to b and equating to zero
ei =- 2 yi a bX i 0
2
a
X i Yi a X i b X i
2
......... (2)
The equations (1) and (2) are known as the normal equations or least squares equations
and the resulting estimates a and b are known as the least squares estimates of and
respectively.
To solve these equations for a and b we multiply equation (1) by X i and (2) by n,
which yield
X i yi na X i b X i
2
......................... (3)
n X i Yi na X i nb X i 2 ......................... (4)
Subtracting (4) from (3) we get,
b{n xi2 ( xi ) 2 } n xi yi xi yi
xi Yi
n X i Yi X i Yi X i Yi
b= = n
n X i2 X i 2 X i 2
Xi
2
n
After the value of b has been obtained, we can compute the value of a by substituting
the value of b into either equations. Thus from (1) we get
Yi Xi
a b
n n
Y bX
Thus the fitted regression line is, Ŷi a bX i
Example: A departmental store has the following statistics of sales (y) for a period of
last one year of 10 salesman, who have varying years of experience (x).
Salesperson Years of experience Annual sales (in ‘000 Tk.)
1 1 80
2 3 97
3 4 92
4 4 102
5 6 103
6 8 111
7 10 119
8 10 119
9 10 123
10 13 136
(i) Find the regression line of y on x
(ii) Predict the annual sales volume of persons who have 12 and 15 years of sales experience.
Solution: Let the sales be Y and the experience be X. And the fitted regression line
is Ŷi a bX i
xi Yi
X i Yi
Where, b = n and a Y b X =
Y i
b
X i
X i 2 n n
Xi
2
n
The table for necessary calculations:
Salesperso Xi Yi Xi2 X i Yi
n
1 1 80 1 80
2 3 97 9 291
3 4 92 16 368
4 4 102 16 408
5 6 103 36 618
6 8 111 64 888
7 10 119 100 1190
8 10 123 100 1230
9 11 117 121 1287
10 13 136 169 1768
Totals 70 1080 632 8128
Xi Yi Xi
2 X i Yi
70 1080
8128
10 8128 7560
Now b = = 4.0
(70) 2
632 490
632
10
1080 70
a= 4.0 = 108 – 28 = 80.0
10 10
The fitted regression line of Y on X is, Yˆi 80 4 X i .................. (A)
Putting X=12 and X=15 in (A) we shall get the sales of the workers with experience 12 and 15 years
respectively. Thus,
Yˆ12 80 4 12 128 (in ‘000 Tk.) and Yˆ15 80 4 15 140 (in ‘000 Tk.)
Coefficient of determination
One convenient way to evaluate the strength of regression equation is to compute
coefficient of determination, which shows the proportion of the total variation in the
dependent variable Y explained by the explanatory variable X. Coefficient of
determination is computed by taking the square of the correlation coefficient and is
denoted by r2.
If r = 0.803 then r2 = 0.6448 64% of the variation in Y is explained by X.
The larger the value of r2, the better the fitted regression model in explaining the
variability in the observed values of Y.
Assignment Problem 12: The following table shows the ages and heights of some
students of a school.
Students’ ID Ages Heights
1 12 90
2 15 92
3 11 70
4 10 65
5 20 81
6 21 82
7 16 64
8 18 59
(a) Draw a scatter diagram for the data and comment on the relationship between
Ages and heights of the students.
(b) Calculate product moment correlation coefficient (r) and interpret the result.
(c) Fit a regression line and estimate the height of the student with age 25 years.
(d) Calculate the coefficient of determination and interpret the result.
Assignment Problem 13: The following data relate to the percentage of unemployment
(x) and percentage of change in wages (y) over several years.
x 16 22 23 17 18 24 27 20
y 50 32 27 21 41 28 23 35
(a) Without mathematical calculation can you predict the relationship between x
and y.
(b) Calculate product moment correlation coefficient (r) and interpret the result.
(c) Fit a regression line and estimate the wages for 50 unemployment.
(d) Calculate the coefficient of determination and interpret the result.
Good Luck!