Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views12 pages

Lecture 3

The document discusses the theory of correlations, focusing on the relationship between two or more variables and the methods to quantify this relationship using correlation coefficients. It explains the definitions of correlation, the significance of correlation coefficients, and the distinction between correlation and causation. Additionally, it outlines methods for correlation analysis, including graphical and numerical approaches, particularly using Spearman's Rank Correlation and Pearson's Coefficient of Correlation.

Uploaded by

harawataona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views12 pages

Lecture 3

The document discusses the theory of correlations, focusing on the relationship between two or more variables and the methods to quantify this relationship using correlation coefficients. It explains the definitions of correlation, the significance of correlation coefficients, and the distinction between correlation and causation. Additionally, it outlines methods for correlation analysis, including graphical and numerical approaches, particularly using Spearman's Rank Correlation and Pearson's Coefficient of Correlation.

Uploaded by

harawataona
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

AAE-223: Statistics for Economist 2

Lecture Notes 3

Assa Mulagha-Maganga
Dept of Agricultural and Applied Economics, LUANAR
Department of Mathematical Sciences (Statistics), Chancellor College

Summer 2022

3 Theory of Correlations

3.1 Theory of Correlations


The various statistical techniques demonstrated in the previous chapters have dealt in an-
alyzing data with only one variable. In general, however, natural phenomenon including
economics, agribusiness, health and other fields of studies are concerned of analyzing of two
or more variables; and therefore, it crucially important to make statistical inferences about
the degree of association or relationships and the direction of relationship between variables.
Hence, correlation analysis will help us to quantify the relationship, determine the validity
and reliability of the co-variation or association between two or more random variables, as
well as, help us to make decision on the nature of the paired variables, even may lead us to
identify for possible causality case (Edriss, 2012).
In this lecture, we consider the degree of relationship between variables, which seeks to de-
termine how well a linear or other equation describes or explains the relationship between
variables. If all values of the variables satisfy an equation exactly (lie on the line of the best
fit), we say that the variables are perfectly correlated or that there is perfect correlation be-
tween them. Thus, the circumferences C and radii r of all circles are perfectly correlated since
c = πr2 . If two dice are tossed simultaneously 100 times, there is no relationship between
corresponding points on each die (unless the dice are loaded); that is, they are uncorrelated.
Such variables as the height and weight of individuals would show some correlation.

3.2 Definition of correlation


Correlation coefficient is a measure of association between two variables, and it ranges be-
tween −1 and 1. If the two variables are in perfect linear relationship, the correlation

1
Statistics for Economists 2

coefficient will be either 1 or -1. The sign depends on whether the variables are positively or
negatively related. The correlation coefficient is 0 if there is no linear relationship between
the variables. Two different types of correlation coefficients are in use. One is called the
Pearson product moment correlation coefficient, and the other is called the Spearman rank
correlation coefficient, which is based on the rank relationship between variables.
One visual way to determine if there is correlation between variables is to use a scatter plot.
Scatter plots are similar to line graphs in that they use horizontal and vertical axes to plot
data points. However, they have a very specific purpose. Scatter plots show how much
one variable is affected by another. The relationship between two variables is called their
correlation.
Scatter plots usually consist of a large body of data. The closer the data points come when
plotted to making a straight line, the higher the correlation between the two variables, or
the stronger the relationship. If the data points make a straight line going from the origin
out to high x- and y-values, then the variables are said to have a positive correlation. If the
line goes from a high-value on the y-axis down to a high-value on the x-axis, the variables
have a negative correlation.
It must be emphasized that in every case the computed value of r measures the degree
of the relationship relative to the type of equation that is actually assumed. Thus, if a
linear equation is assumed and correlation coefficient value of near zero is realized, it means
that there is almost nolinear correlation between the variables. However, it does not mean
that there is no correlation at all, since there may actually be a high nonlinear correlation
between the variables. In other words, the correlation coefficient measures the goodness of
fit between (1) the equation actually assumed and (2) the data. Unless otherwise specified,
the term correlation coefficient is used to mean linear correlation coefficient. It should also
be pointed out that a high correlation coefficient (i.e., near 1 or -1) does not necessarily
indicate a direct dependence of the variables. The correlation coefficient is scale free and
therefore its interpretation is independent of the units of measurement of two varibles, say
x and y. In this lecture, the following methods of finding the correlation coefficient between
two variables x and y are discussed:

1. Spearman’s Rank Correlation method

2. Karl Pearson’s Coefficient of Correlation method

3.3 Correlation and Causation


If there is a strong relationship (say, r = 0.91) between two variables, we are tempted to
assume that an increase or decrease in one variable causes a change in the other variable. For
example, it can be shown that the consumption of Malawian peanuts and the consumption
of quinin have a strong correlation. However, this does not indicate that an increase in
the consumption of peanuts caused the consumption of quinin to increase. Likewise, the
incomes of professors and the number of inmates in zomba mental hospital have increased
proportionately. Further, as the population of donkeys has decreased, there has been an

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 2


Statistics for Economists 2

increase in the number of doctoral degrees granted. Relationships such as these are called
nonsense or spurious correlations. What we can conclude when we find two variables with
a strong correlation is that there is a relationship or association between the two variables,
not that a change in one causes a change in the other.

3.4 Methods of Correlation Analysis


3.4.1 The graphical approach

The scatter diagram method is a quick at-a-glance method of determining of an apparent


relationship between two variables, if any. A scatter diagram (or a graph) can be obtained on
a graph paper by plotting observed (or known) pairs of values of variables x and y, taking the
independent variable values on the x-axis and the dependent variable values on the y-axis.
It is common to try to draw a straight line through data points so that an equal number of
points lie on either side of the line. The relationship between two variables x and y described
by the data points is defined by this straight line. The pattern of data points in the diagram
indicates that the variables are related.
35

30

25
mpg

20

15

10

2 3 4 5
wt

Figure 1: Negative linear relationsip

3.4.2 The Numerical approach

a. Spearman rank correlation coefficient

Spearman’s Rank correlation coefficient is used to identify and test the strength of a rela-
tionship between two sets of data. It is often used as a statistical method to aid with either
proving or disproving a hypothesis e.g. the depth of a river does not progressively increase
the further from the river bank. The formula used to calculate Spearman’s Rank is shown
below.

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 3


Statistics for Economists 2

35

30

25
mpg

20

15

10
3.0 3.5 4.0 4.5 5.0
drat

Figure 2: Positive linear relationsip

35

30

25
mpg

20

15

10
16 18 20 22
qsec

Figure 3: No linear relationsip

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 4


Statistics for Economists 2

6 d2
P
r = 1 − −3
n −n

How can the calculation be carried out in Excel?

Once the data has been collected, Excel can be used to calculate and graph Spearman’s
Rank correlation to discover if a relationship exists between the two sets of data, and how
strong this relationship is. Please note this example uses a dataset of 10 samples, but your
dataset should include a minimum of 15 to be valid.

Step 1: Create a table in Excel and enter your data sets.

Sample Width (cm) Width (Rank) Depth (Cm) Depth (Rank)


1 0 0
2 50 10
3 150 28
4 200 42
5 250 59
6 300 51
7 350 73
8 400 85
9 450 104
10 500 96

Step 2: Rank each set of data (width rank and depth rank

Rank 1 will be given to the largest number in column 2. Continue ranking till all widths
have been ranked. Once all the widths have been ranked then do exactly the same for depth.
Sample Width (cm) Width (Rank) Depth (Cm) Depth (Rank)
1 0 10 0 10
2 50 9 10 9
3 150 8 28 8
4 200 7 42 7
5 250 6 59 5
6 300 5 51 6
7 350 4 73 4
8 400 3 85 3
9 450 2 104 1
10 500 1 96 1
If there are two samples with the same value, the mean (average) rank should be used - for
example if there were 3 samples all with the same depth ranked 6th in order you would add

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 5


Statistics for Economists 2

the rank values together (6 + 7 + 8 = 21) then divide this number by the number of samples
with the same depth number, in this case 3 (21/3=7) so they would all receive a rank of 7.
The next greatest depth would be given a value of 9.

Step 3

: The next stage is to find d (the difference in rank between the width and depth). First,
add a new column to your table, and then calculate d by subtracting the depth rank column
(column 5) from the width rank column (column 3). For example, for sample 6 width rank
is 5 and the depth rank is 6 so d = 5 − 6 = −1. To calculate d in Excel, select the cell you
wish to enter the information into and type =. Now click on the width rank cell you want to
use and type -. Finally, click on the depth rank cell and press enter. The value of d should
appear in the first box you selected.
Sample Width (Cm) Width (Rank) Depth (Cm) Depth (Rank) d
1 0 10 0 10 0
2 50 9 10 9 0
3 150 8 28 8 0
4 200 7 42 7 0
5 250 6 59 5 1
6 300 5 51 6 -1
7 350 4 73 4 0
8 400 3 85 3 0
9 450 2 104 1 1
10 500 1 96 1 1

Step 4

: The next step is to calculate d2 . Add another column to your table and label it. To
calculate d2 type in the first cell =POWER(number, power). In this case the number is the
value of d and the power is 2 as we are trying to find the square value e.g. for sample 6 the
value of d is -1 so you would enter into the cell =POWER(-1,2) then press enter and the
value you should get is 1.
Sample Width (Cm) Width (Rank) Depth (Cm) Depth (Rank) d d2
1 0 10 0 10 0
2 50 9 10 9 0
3 150 8 28 8 0
4 200 7 42 7 0
5 250 6 59 5 1 1
6 300 5 51 6 -1 1
7 350 4 73 4 0
8 400 3 85 3 0
9 450 2 104 1 1 1
10 500 1 96 1 1 1

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 6


Statistics for Economists 2

Repeat the same process until all of your samples have a value of d2 . Once all the d2 values
have been calculated add them together to calculate d2 . The quickest way to do this
P

in Excel is to click on the cell underneath your last entry into the d2 column, click on the
P
autosum symbol (which you can find on the tool bar at the top of the page), and press
enter. (Depending on which version of Excel you are using, you may have to select the
column you wish to add together before you press enter.)

Step 5

: Now we have the d2 values, but to complete the equation we still need to calculate n3 − n.
n is the number of samples, so in this case is 10. As in step 4, type into the cell you wish
to use =POWER(number,power) which will give you a value for n3 . Remember this time
that ‘number’ is the number of samples and ‘power’ is 3 as you are cubing not squaring the
value. Once n3 has been calculated, subtract the value of n from it.
Sample Width (Cm) Width (Rank) Depth (cm) Depth (Rank) d d2
1 0 10 0 10 0
2 50 9 10 9 0
3 150 8 28 8 0
4 200 7 42 7 0
5 250 6 59 5 1 1
6 300 5 51 6 -1 1
7 350 4 73 4 0
8 400 3 85 3 0
9 450 2 104 1 1
10 500 1 96 1 1
P 2
d 4
n 10
n3 1000
n3 − n 990

Step 6

: All that is left to do now is to insert the values into the equation to calculate r.

6 d2
P
r =1− 3
n −n
The formula you should enter into a cell in this case would be = 1 − ((6 ∗ 4)/990) (N.B * is
the symbol for multiply)
End result - The result at the end should always be between the value of +1 and −1. In
this case the value is 0.9757 to 4 decimal places, or 0.98 to 2 decimal places.

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 7


Statistics for Economists 2

Implementation of spearman rank correlation in R

R Language provides two methods to calculate the correlation coefficient. By using the
functions cor() or cor.test() it can be calculated. It can be noted that cor() computes the
correlation coefficient whereas cor.test() computes test for association or correlation between
paired samples. It returns both the correlation coefficient and the significance level(or p-
value) of the correlation.

Syntax: cor(x, y, method = "spearman")

Example of Taking two numeric variables of x and y in a dataset called df.

df <- data.frame(x = c(15, 18, 21, 15, 21),


y = c(25, 25, 27, 27, 27))

# Calculating spearman rank Correlation coefficient

result <- cor(df$x, df$y, method = "spearman")

## Spearman correlation coefficient is: 0.4564355

b. Karl Pearson product moment correlation coefficient

In statistics, it’s a measuring tool to determine whether there is a linear relationship between
two variables - or not. It quantifies the strength and the direction of the relationship which
can be identified by the correlation coefficient. A correlation exists when two variables are
measured and when there is a change in one, there is a change in another, whether it’s in
the same or opposite direction. There are other correlation measurement tools like Kendall’s
rank correlation, but those measure different types of associations and aren’t alternatives to
using the Pearson Correlation Coefficient model.
If you were to use the Pearson correlation measurement as an equation it can get pretty
complicated. In definition, the Pearson Product-Moment Correlation is the covariance of
two variables divided by the product of their standard deviations. The equation looks like
this:

(x − x̄)(y − ȳ)
P
r = qP
(x − x̄)2 ( (y − ȳ))
P

Formula can be written in the equivalent form

xy − (
P P P
n y) x)(
r=q P
[n x2 − ( x2 )][n y 2 − ( y 2 )]
P P P

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 8


Statistics for Economists 2

Assumptions of Using Pearson’s Correlation Coefficient i. Pearson’s correlation coefficient is


appropriate to calculate when both variables x and y are measured on an interval or a ratio
scale.

ii. Both variables x and y are normally distributed, and that there is a linear relationship
between these variables.

iii. The correlation coefficient is largely affected due to truncation of the range of values in
one or both of the variables. This occurs when the distributions of both the variables
greatly deviate from the normal shape.

iv. There is a cause and effect relationship between two variables that influences the dis-
tributions of both the variables. Otherwise correlation coefficient might either be
extremely low or even zero.

Advantage and Disadvantages of Pearson’s Correlation Coefficient The correlation coefficient


is a numerical number between – 1 and 1 that summarizes the magnitude as well as direction
(positive or negative) of association between two variables. The chief limitations of Pearson’s
method are:

i. The correlation coefficient always assumes a linear relationship between two variables,
whether it is true or not.

ii. Great care must be exercised in interpreting the value of this coefficient as very often
its value is misinterpreted.

iii. The value of the coefficient is unduly affected by the extreme values of two variable
values.

iv. As compared with other methods the computational time required to calculate the
value of r using Pearson’s method is lengthy.

Example

Find the coefficient of linear correlation between the variables X (number o sales calls) and
Y (number of laptops sold) presented in Table below
X 1 3 4 6 8 9 11 14
Y 1 2 4 4 5 7 8 9

Solution

The work involved in the computation can be organized as in Table below

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 9


Statistics for Economists 2

(x − x̄)(y − ȳ)
P
r= qP
(x − x̄)2 ( (y − ȳ)2 )
P

cov(x, y)
=q
var(x)var(y)
84
=q = 0.977
(132)(56)

This shows that there is a very high linear correlation between the variables
X Y x − x̄ y − ȳ (x − x̄)2 (x − x̄)(y − ȳ) (y − ȳ)2
1 1 -6 -4 36 24 16
3 2 -4 -3 16 12 9
4 4 -3 -1 9 3 1
6 4 -1 -1 1 1 1
8 5 1 0 1 0 0
9 7 2 2 4 4 4
11 8 4 3 16 12 9
14 9 7 4 49 28 16
(x − x̄)2 (x − x̄)(y− (y − ȳ)2
P P P P P
x = 56 y = 40
x̄ = 56/8 = 7 ȳ = 40/8 = 5 =132 ȳ) = 84 =56

Implementation of pearson correlation in R

R Language provides two methods to calculate the pearson correlation coefficient. By using
the functions cor() or cor.test() it can be calculated. It can be noted that cor() computes
the correlation coefficient whereas cor.test() computes the test for association or correlation
between paired samples. It returns both the correlation coefficient and the significance
level(or p-value) of the correlation.

Syntax: cor(x, y, method = "spearman")

Example of Taking two numeric variables of x and y in a dataset called df.

df <- data.frame(x = c(15, 18, 21, 15, 21),


y = c(25, 25, 27, 27, 27))

# Calculating spearman rank Correlation coefficient

result <- cor(df$x, df$y, method = "pearson")

## Pearson correlation coefficient is: 0.4564355

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 10


Statistics for Economists 2

3.5 Probable Error and Standard Error of Coefficient of Correla-


tion
The probable error (PE) of coefficient of correlation indicates extent to which its value
depends on the condition of random sampling. If r is the calculated value of correlation co-
efficient in a sample of n pairs of observations, then the standard error SEr of the correlation
coefficient r is given by
1−r
SEr = √
n

The probable error of the coefficient of correlation is calculated by the expression:

P E = 0.6745SEr
1 − r2
= 0.6745 √
n

Thus with the help of PEr we can determine the range within which population coefficient of
correlation is expected to fall using following formula: ρ = r ± P Er where ρ (rho) represents
population coefficient of correlation.

1. If r < P Er then the value of r is not significant, that is, there is no relationship between
two variables of interest.

2. If r > 6P Er then value of r is significant, that is, there exists a relationship between
two variables.

Example 3.1

(y − ȳ)2 = 90. Find out correlation


P
If covariance of 10 pairs of items is 7, variance of x is 36,
coefficient, r.

Solution 3.1

We know that correlation, r, is given by:

Cov(x, y)
r=
σx σy

Given, Cov(x, y) = 7, n = 10, σx2 = 36, (y − ȳ)2 = 90. Since Var(x) = σx2 = 36, Std. dev
P

(σx = 6.
In addition, sP s
(y − ȳ)2 90
σy = = =3
n 10

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 11


Statistics for Economists 2

Co-relation coefficient, r, will be:

Cov(x, y) 7
r= = = 0.39
σx σy 6×3

Example 3.2

Calculate Karl Pearson’s coefficient of correlation from the following data. Interpret your
result
σx = 10, σy = 12, x̄ = 25, and ȳ = 35
Summation of product of deviation from actual arithmetic means of two series is 24 and
number of observations are 20

Solution 3.2

(x − x̄)(y − ȳ) = 24 and n = 20. Then


P
Given σx = 10, σy = 12, x̄ = 25, and ȳ = 35,

(x − x̄)(y − ȳ)
P
24
Cov(x, y) = = = 1.2
n 10
We know that
Cov(x, y) 1.2
r= = = 0.01
σx σy 10 × 12

Since magnitude of r is very small, correlation between x and y is negligible.

Example 3.3

If r = 0.97 and n = 8, find out the probable error of the coefficient of correlation and
determine the limits for population correlation, r.

Solution 3.3

Given: r = 0.97, n = 8. Then

1 − r2 1 − 0.972 0.6745 × 0.0591


P Er = 0.6745 √ = 0.6745 √ = √ = 0.014
n 8 2.828
Limits of population correlation = r ± P Er = 0.97 ± 0.014 = 0.956 to 0.984

By Assa Mulagha-Maganga, Lilongwe University of Agriculture and Natural Resources 12

You might also like