CORRELATION ANALYSIS
Objectives:
The overall objective of this lesson is to give
you an understanding of bivariate linear
correlation, thereby enabling you to
understand the importance as well as the
limitations of correlation analysis.
Bivariate Data:
Definition: When we come across a
large number of problems involving the
use of two or more than two variables
with the help of which their
relationship are studied then it is called
bivariate quantitative data.
Slide 2
For example
1. interest rate of bonds and prime
interest rate;
2. advertising expenditure and sales;
3. income and consumption;
4. crop-yield and fertilizer used;
5. height and weights and so on is the
example of bivariate data or
distribution.
Slide 3
Scatter Plots and Correlation
• A scatter plot (or scatter diagram) is used
to show the relationship between two
variables
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
–Only concerned with strength of the
relationship
–No causal effect is implied
Scatter Plot Examples
Linear relationships Curvilinear relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
Strong relationships Weak relationships
y y
x x
y y
x x
Scatter Plot Examples
(continued)
No relationship
x
Scatter Plot Examples
Rectangular coordinate
Two quantitative variables
One variable is called independent (X) and the
second is called dependent (Y)
Points are not joined
No frequency table
Scatter diagram:
It is the simplest way of the diagrammatic
representation of bivariate data. Thus for
the bivariate distribution (xi, yi); i=j=
1,2,…n,
If the values of the variables X and Y be
plotted along the X-axis and Y-axis
respectively in the xy-plane, the diagram
of dots so obtained is known as scatter
diagram.
Introduction
In the previous two topics, we
concentrated entirely on distributions
and measures of one variable;
but in reality, we normally collect data on
several items at once. We are interested
in links, or relationships, between the
different variables (or, sometimes,
between variables and attributes).
Definition:
Correlation is the study of statistical
relationship between two or more
variables.
In other words, correlation is the degree or
intensity of association or inter-relationship
between two (or more) variables.
The correlation is a measure of how close
the relationship between x and y is to a
straight line. Slide 11
Karl Pearson’s Correlation Coefficient:
A measure of intensity or degree of linear
relationship between two variables is called
coefficient of correlation. Correlation is
measured by the coefficient of correlation
which is denoted by ρ.
Slide 12
It is also called Pearson's correlation or
product moment correlation coefficient.
It measures the nature and strength between
two variables of the quantitative type.
Mathematical definition coefficient of correlation:
If x and y be two random variables of a
bivariate population, then the correlation
coefficient between these variables is
defined as ρxy
or ρ and that between the random
variables x and y of a sample is denoted
by rxy or r.
The sign of r denotes the nature of
association
while the value of r denotes the
strength of association.
Theoretical formulae:
=
Cov( x, y )
rxy =
v ( x).v( y )
( xi x)( yi y)
n
i1
= n
(xix) ( yi y)
2 2
i1
Mathematical formulae:
xy x y
n
=
x
2 x
2
y
2 y
2
n n
sp ( x, y )
rxy =
ss ( x).ss ( y )
Types of correlation:
Mainly, there are three types of correlation.
Depending on its extent and direction there
are five types of correlation. Each type of
correlation described mathematically and
graphically below:
Continue
Positive Correlation
i). Perfect Positive Correlation
ii) Partial Positive Correlation
Negative Correlation
i). Perfect Negative Correlation
ii) Partial Negative Correlation
Zero Correlation
Slide 19
Perfect Positive correlation:
Y axis
X axis
Fig 1: Perfect positive (r = +1)
Perfect Positive correlation:
If the two variables deviate in the same direction
in one unit. i.e., if the increase in one variable one
unit results in a corresponding increase one unit in
the other variable, correlation is said to be perfect
positive correlation.
In this, the two variables denoted by X and Y are
directly proportional and fully correlated with each
other. The correlation coefficient r = +1.
i.e., both variables rise or fall in the same
proportion.
Example:
Perfect correlation are not found in nature
but some approaching to that extent are
there such as height and weight, age and
height, age and weight of students to a
certain age.
X varies directly and proportionately to Y,
(X ∞Y). If all the data points lie exactly on
an upward sloping line, then r will be +1;
(in figure 1)
Slide 22
Perfect Negative Correlation
Y axis
X axis
Fig 2: Perfect negative (r = -1)
Slide 23
Perfect Negative correlation
If the two variables constantly deviate in
the opposite direction in one unit.
i.e., if increase in one variable in one
unit results in corresponding decrease
one unit opposite direction in the other
variable, correlation is said to be perfect
negative correlation.
Example:
Perfect negative correlation are not found
in nature but some approaching to that
extent are there such as mean weekly
temperature and number of colds in
winter;
pressure and volume gas at a particular
temperature, etc. X varies as (X ∞ ).
Partial positive & negative correlation
Y axis Y axis
X axis X axis
Fig 4: Partial positive (0< r < 1) Fig 5: Partial positive (-1 > r >0)
Partial positive correlation:
If the two variables deviate in the same
direction,
i.e., if the increase (or decrease) in one
variable results in a corresponding increase
(or decrease) in the other variable, correlation
is said to be partial or moderately positive.
In this case, the non-zero values of
coefficient(r) lie between 0 and +1,
i.e., 0 < r < 1.
Example:
1. interest rate of bonds and prime interest
rate
2. advertising expenditure and sales;
3. income and consumption;
4. crop-yield and fertilizer used;
5. height and weights and so on.
Partial negative correlation:
If the two variables constantly deviate in
the opposite direction
i.e., if increase (or decrease) in one variable
results in corresponding decrease (or
increase) in the other variable, correlation
is said to be inverse or negative.
Example:
Income and infant mortality rate of cow;
Rainfall and grass
In such moderately negative correlation, the
scatter diagram will be of the same type but
mean imaginary line will rise from the
extreme values of one variable in following
figure
Uncorrelated or Zero Correlation:
Y axis
Y axis
Fig 3: Zero correlation (r = 0)
No or Zero Correlation:
If there is no relationship between the two
variables such that the value of one
variable change and the other variable
remain constant is called no or zero
correlation.
Example:
1. There is no correlation between a man height
and the amount they earn.
2. Height and pulse rate of man;
Assumptions:
The concerned variables are linearly related.
i.e., by plotting them on a graph paper, a
straight line would be obtained.
There exists cause and effect relationship
between the (concerned) related variables
A large number of independent causes are
operating both the correlated variables so as
produce a normal distribution.
Continue
Both the variables are random
Since the variables are independent, there
exists regression of one variable on the
other.
Scatter diagram:
It is the simplest way of the diagrammatic
representation of bivariate data.
Thus for the bivariate distribution (xi,yi); i
= j = 1,2,…n, If the values of the variables
X and Y be plotted along the X-axis and Y-
axis respectively in the xy-plane, the
diagram of dots so obtained is known as
scatter diagram.
Prosperities of Correlation Coefficient:
Correlation coefficient is independent of
change of origin and scale.
The value of correlation coefficient lies
between -1 and +1 i.e., -1 ≤ r ≤ +1.
Correlation coefficient is the geometric
mean of two regression coefficients.
Correlation coefficient is symmetric
with respect to the dependence of the
variables.
The value of correlation coefficient is
very much influenced by large items,
if they are present in data.
Necessity of Studying Correlation:
The Pearson correlation coefficient is used for
assessing the linear (straight line) association
between an X and a Y variable, and requires
interval or ratio measurement.
Symbol for the sample correlation coefficient
is r, which is the sample estimate of _ that can
be obtained from a sample of pairs (X, Y) of
values for X and Y.
The correlation varies from negative one to
positive one (–1 ≤ r ≤ +1).
Correlation of + 1 or –1 refers to a
perfect positive or negative X, Y
relationship, respectively. Data falling
exactly on a straight line indicates that |r|
= 1.
Interpret r
The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated
by the following diagram.
strong intermediate weak weak intermediate strong
-1 -0.75 -0.25 0 0.25 0.75 1
indirect Direct
perfect
correlation
no relation
Interpret r
If r is very close to +1, we say there is a
strong positive correlation
y increases as x increases, and the
relationship is good.
If r is close to -1, there is a strong negative
correlation: y decreases as x increases.
When r is close to zero (either positive or
negative) there is very little relationship
between the two variables.
Spearman’s rank correlation
Sometimes we come across statistical
series in which the variables under
consideration are not capable of
quantitative measurement but can be
arranged in serial order.
This happens when we are dealing
with qualitative characteristics
(attributes) such as honesty, beauty,
character, morality, etc.,
Let the random variables X and Y denote
the ranks of the individuals in the
characteristics A and B respectively.
If we assume that there is no tie, i.e., if no
two individuals get the same rank in a
characteristic then, obviously, X and Y
assume numerical values ranging from 1 to
N.
Spearman Rank Correlation
Coefficient (rs)
It is a non-parametric measure of correlation.
This procedure makes use of the two sets of
ranks that may be assigned to the sample
values of x and Y.
Spearman Rank correlation coefficient could
be computed in the following cases:
Both variables are quantitative.
Both variables are qualitative ordinal.
One variable is quantitative and the other is
qualitative ordinal.
Example
In a study of the relationship between level
education and income the following data was
obtained. Find the relationship between them
and comment.
sample level education Income
numbers (X) (Y)
A Preparatory. 25
B Primary. 10
C University. 8
D secondary 10
E secondary 15
F illiterate 50
G University. 60
Answer:
Rank Rank di di2
(X) (Y) X Y
A Preparatory 25 5 3 2 4
B Primary. 10 6 5.5 0.5 0.25
C University. 8 1.5 7 -5.5 30.25
D secondary 10 3.5 5.5 -2 4
E secondary 15 3.5 4 -0.5 0.25
F illiterate 50 7 2 5 25
G university. 60 1.5 1 0.5 0.25
∑ di2=64
Apply the following formula
6 (di)
2
rs 1
n(n 1)
2
The value of rs denotes the magnitude and
nature of association giving the same
interpretation as simple r.
6 64
rs 1 0.1
7(48)
Comment:
There is an indirect weak correlation between level
of education and income.
Uses of correlation
1. It is used in physical and social sciences.
2. It is useful for economists to study the
relationship between variables like price,
quantity etc. Businessmen estimates costs, sales,
price etc. using correlation.
3. It is helpful in measuring the degree of
relationship between the variables like income
and expenditure, price and supply, supply and
demand etc.
4. Sampling error can be calculated.
5. It is the basis for the concept of regression.