Unit 3 Correlation and
Regression
Introduction
• Correlation and regression are statistical methods that are commonly used to
compare two or more variables. For example, Comparison between income and
expenditure, price and demand, etc.
• Correlation measures the association between two or more variables and
quantitates the strength of their relationship. It evaluates only the existing data.
• Regression means average relationship between two or more variables and this
relationship is used to estimate the most likely values of one variable for
specified values of the other variables.
Correlation
• It is exist between two variables
• It is use to represent linear relationship between two
variables.
• Two variables are said to be correlation if a change in a
one variable affects a change in a other variable .
• Such data connecting two variables is called bivariate
data.
• Thus ,Correlation is a statistical analysis which measures
and analysis the degree to which two variables fluctuate
with reference to other.
• Examples :
• Relationship between heights and weights
• Relationship between price and demand of commodity.
• Relationship between rainfall and yields of crops
TYPES OF CORRELATIONS
Correlation is classified into four types:
I. Positive and negative correlations
2. Linear and nonlinear correlations
3. Partial and total correlations
4. Simple and multiple correlations
TYPES OF CORRELATIONS
POSITIVE CORRELATION NEGATIVE CORRELATION
• If both the variables vary in the same direction, • If both the variables vary in the opposite
the correlation is said to be positive. direction, correlation is said to be negative.
• In other words, if the value of one variable • In other words, if the value of one variable
increases, the value of the other variable also increases, the value of the other variable
increases, or, if value of one variable decreases, decreases, or, if the value of one variable
the value of the other variable decreases, decreases the value of the other variable
• e.g., the correlation between heights(cm) and increases,
weights(kg) of group of persons is a positive • e.g. the correlation between the price($ per unit)
correlation and demand(unit) of a commodity is a negative
correlation.
Height 150 157 163 170 178 Price 10 8 6 5 4
Weight 58 62 68 73 80 Demand 100 200 300 400 500
TYPES OF CORRELATIONS
LINEAR CORRELATION NONLINEAR CORRELATION
If the ratio of change between two If the ratio of change between two
variables is constant, the correlation is variables is not constant, the correlation
said to be linear. If such variables are is said to nonlinear. The graph of a
plotted on a graph paper, a straight line is nonlinear or curvilinear relationship will
obtained, e.g., be a curve, e.g.,
Milk(litter) 5 10 15 20 25 Expenses 3 6 9 12 15
Paneer (kg) 2 4 6 8 10 Sales 8 12 15 15 16
TYPES OF CORRELATIONS
Simple Correlation Partial Correlation
When only two variables are studied, the When more than two variables are studied
relationship is described as simple excluding some other variables, the
correlation. e.g., the quantity of money and relationship is termed as partial
price level, demand and price, etc. correlation.
Multiple Correlation Total Correlation
When more than two variables are studied, When more than two variables are studied
the relationship is described as multiple without excluding any variables, the
correlation, e.g., relationship of price, relationship is termed as total correlation.
demand, and supply of a commodity
METHOD OF STUDYING CORRELATION
There are two different methods of studying correlation,
(1) Graphical methods
(2) Mathematical methods.
Graphical methods are
(a) Scatter diagram
(b) Simple graph
Mathematical methods are
(a) Karl Pearson’s coefficient
(b) Sperman’s rank coefficient of correlation
SCATTER DIAGRAM
• A scatter diagram is a graphical representation of the relation between two or more variables.
In the scatter plot of two variables x and y, each point on the plot is an x-y pair.
• There are various correlationships between two variables represented by the following scatter
diagrams.
SCATTER DIAGRAM
Perfect Positive Correlation
If all the plotted points lie on a straight line rising from
the lower left-hand corner to the upper right-hand
corner, the correlation is said to be perfectly positive.
Perfect Negative Correlation
If all the plotted points lie on a straight line falling from
the upper-left hand corner to the lower right-hand
corner, the correlation is said to be perfectly negative.
SCATTER DIAGRAM
High Degree of Positive Correlation
If all the plotted points lie in the narrow strip, rising from the lower
left-hand corner to the upper right-hand corner, it indicates a high
degree of positive correlation.
High Degree of Negative Correlation
If all the plotted points lie in a narrow strip, falling from the upper
left-hand corner to the lower right-hand corner, it indicates the
existence of a high degree of negative correlation.
SCATTER DIAGRAM
No Correlation
If all the plotted points lie on a straight line parallel to the x-axis or
y-axis or in a haphazard manner, it indicates the absence of any
relationship between the variables.
SIMPLE GRAPH
• A simple graph is a diagrammatic representation of bivariate data to find the correlation
between two variables.
• The values of the two variables are plotted on a graph paper. Two curves are obtained, one for
the variable x and the other for the variable y.
• If both the curves move in the same direction, the correlation is said to be positive. If both the
curves move in the opposite direction, the correlation is said to be negative.
• This method is used in the case of a time series. It does not reveal the extent to which the
variables are related.
• Thus a scatter diagram is simple and
nonmathematical method to find out the
correlation between the variables .
• It gives an indication of the degree of linear
correlation between the variables .
• It is easy to understand.
Mathematical method :
• Karl Pearson's coefficient of correlation
The coefficient of correlation is the measure
of correlation between two random variables x
and y , it denoted by r
Note:
• The value of r lie between -1 ≤ r ≤ 1
• The correlation coefficient either positive or negative.
• The sign of correlation coefficient indicates the sign of
linear relation ship.
• The magnitude (value ) of correlation indicates the
strength of correlation.
The value of r ranges between ( -1) and ( +1)
The value of r denotes the strength of the
association as illustrated
by the following diagram.
strong intermediate weak weak intermediate strong
-1 -0.75 -0.25 0 0.25 0.75 1
Negative Positive
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or correlation
between the two variables.
If 0 < r < 0.25 = weak correlation.
If 0.25 ≤ r < 0.75 = intermediate correlation.
If 0.75 ≤ r < 1 = strong correlation.
If r = 1 = perfect correlation.
• For ex. Correlation r = 0.9 suggests a strong positive
linear relation ship between two variables and if
r = -0.2 suggest a weak negative correlation between
two variables
• If two random variables are independent then r = 0
Example 1:
The following data represents the number of hours 12 different students watched
television during the weekend and the scores of each student who took a test the
following Monday.
a.) Display the scatter plot.
b.) Calculate the correlation coefficient r.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10
Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
xy
x2
y2
Ex-2 Calculate the Karl Pearson's coefficient of correlation between x and y using
the following data .
X 2 4 5 6 8 11
Y 18 12 10 8 7 5
X Y X² Y² XY
2 18
4 12
5 10
6 8
8 7
11 5
36 60
Ex-3 Calculate the coefficient of correlation from the
following data
X 12 9 8 10 11 13 7
Y 14 8 6 9 11 12 3
X Y X² Y² XY
12 14
9 8
8 6
10 9
11 11
13 12
7 3
70 63
Ex-4 Calculate the coefficient of correlation
following data
X 9 8 7 6 5 4 3 2 1
Y 15 16 14 13 11 12 10 8 9
X Y X² Y² XY
9 15
8 16
7 14
6 13
5 11
4 12
3 10
2 8
1 9
45 108
Ex-5 Calculate the correlation coefficient between the
following data
X 5 9 13 17 21
Y 12 20 25 33 35
X Y X² Y² XY
Rank correlation
• Let a group of n individuals be arranged in order to
merit with respect to some characteristics . The same
group would give a different order (rank ) for different
characteristics . Considering the orders corresponding
to two characteristics x and y ,the correlation between
these n pairs of ranks is called the rank correlation in
the characteristics x and y for that group of individuals
• In statistics a rank correlation is any of several statistics that
measure an ordinal association—the relationship
between rankings of different ordinal variables or different
rankings of the same variable, where a "ranking" is the
assignment of the ordering labels "first", "second", "third", etc.
to different observations of a particular variable.
• A rank correlation coefficient measures the degree of
similarity between two rankings, and can be used to assess
the significance of the relation between them
• When we are given the actual data and not a rank ,it will
be necessary to assign a rank.
• Rank can be assigned by taking either the highest value as
1 or the lowest value as 1
Spearman’s Rank Correlation Coefficient
• Spearman's correlation coefficient, measures the
strength and direction of association between two
ranked variables
• The Spearman’ s Rank coefficient of two different
characteristic x and y is denoted by 𝑟𝑠
• d= x-y
Step in calculating Spearman’s rank
coefficient
• Convert the observes value to rank
• Find the difference between the rank ,square them and
sum of the squared difference
• Write the formula and solve it and conclude based on
finding value of r
• The rank correlation lie in between [-1 ,1]
• If r = +1 indicates a perfect positive association of ranks
• r = -1 indicates a perfect negative association of ranks
• r=0 indicates a no association between the ranks
• If r is closer to zero , then it is weaker association between the
ranks
Ex 6: The following table provides data about the percentage of students
who have free university meals and their CGPA scores. Calculate the
Spearman’s Rank correlation between the two and interpret result
State University % of students having free % of student scoring above
meals 8.5 CGPA
Pune 14.4 54
Chennai 7.2 64
Delhi 27.5 44
Kanpur 33.8 32
Ahmedabad 38 37
Indore 15.9 68
Guwahati 4.9 62
Let x =rank of students having free meals
y= rank of students scoring above 8.5 CGPA
State university X Y d= x-y d²
Pune
Chennai
Delhi
Kanpur
Ahmedabad
Indore
Guwahati
Ex-7 The ICC ranking for ODI and test matches for nine team
as shown bellows
Check whether there is correlation between ranks
Team test ranking ODI ranking
India 1 1
Australia 2 3
South Africa 3 2
Srilanka 4 7
Pakistan 5 6
England 6 4
Newzealand 7 5
bangladesh 8 8
Westindies 9 9
Team test ranking ODI ranking d d²
India
Australia
South Africa
Srilanka
Pakistan
England
Newzealad
bangladesh
Westindies
Ex-8 Ten participants in a contest are ranked by two
judges are as follows ,Calculate the rank correlation
coefficient
X 1 3 7 5 4 6 2 10 9 8
Y 3 1 4 5 6 9 7 8 10 2
x y d=x-y d²
Ex-9 Ten students got the following percentage of
marks in mathematics and physics ,Find rank
correlation coefficient .
Mathematics (x) 8 36 98 25 75 82 92 62 5 35
Physics(y) 84 51 91 60 68 62 86 58 35 49
Ex-10 Ten competitors in a musical test were
ranked by three judges a, b and c in the following
order. Using the rank correlations method ,find
which pair of judges has nearest approach to
common liking in music.
Rank by a 1 6 5 10 3 2 4 9 7 8
Rank by b 3 5 8 4 7 10 2 1 6 9
Rank by c 6 4 9 8 1 2 3 10 5 7
Rank by a Rank by b Rank by c 𝑑1 =x-y 𝑑2 =y-z 𝑑3 =z-x 𝑑𝟏 ² 𝑑𝟐 ² 𝑑3 ²
x y z
Example 11: The competitors in a beauty contest are ranked
by three judges in the following order.
Use rank correlation coefficient to discuss which pair of judges
has nearest approach to beauty.
1st Judge 1 5 4 8 9 6 10 7 3 2
2nd Judge 4 8 7 6 5 9 10 3 2 1
3rd Judge 6 7 8 1 5 10 9 2 3 4
REGRESSION
• Regression is defined as a method of estimating the value of one variable when
that of the other is known and the variables are correlated.
• Regression analysis is used to predict or estimate one variable in terms of the
other variable.
• It is a highly valuable tool for prediction purpose in economics and business.
• It is useful in statistical estimation of demand curves, supply curves, production
function, cost function, consumption function, etc.
TYPES OF REGRESSION
Regression is classified into two types
1. Simple and multiple regressions
2. Linear and nonlinear regressions
TYPES OF REGRESSION
Simple and Multiple Regressions Linear and Nonlinear Regressions
Depending upon the study of the Depending upon the regression curve,
number of variables., regression may be regression may be linear or nonlinear.
simple or multiple.
• Linear Regression
• Simple Regression
The regression analysis for studying only If the regression curve is a straight line,
two variables at a time is known as the regression is said to be linear.
simple regression. • Nonlinear Regression
• Multiple Regression If the regression curve is not a straight
The regression analysis for studying line i.e., not a first-degree equation in
more than two variables at a time is the variables 𝑥 and 𝑦, the regression is
known as multiple regression. said to be nonlinear or curvilinear
LINES OF REGRESSION
Line of Regression of y on x Line of Regression of x on y
• It is the line which gives the best • It is the line which gives the best
estimate for the values of 𝑦 for any estimate for the values of 𝑥 for any
given values of 𝑦.
given values of 𝑥.
• The regression equation for 𝑥 on 𝑦 is
• The regression equation of 𝑦 on 𝑥 is given by
given by 𝜎𝑥
𝜎𝑦
𝑥 − 𝑥ҧ = 𝑟 (𝑦 − 𝑦)
ത
𝜎𝑦
𝑦 − 𝑦ത = 𝑟 (𝑥 − 𝑥)ҧ
𝜎𝑥
• It is also written as 𝑥 = 𝑎 + 𝑏𝑦
• It is also written as 𝑦 = 𝑎 + 𝑏𝑥
Note: 𝑥ҧ and 𝑦ത are means of 𝑥 series and 𝑦 series respectively. 𝜎𝑥 and 𝜎𝑦 are
standard deviations of 𝑥 series and 𝑦 series respectively, 𝑟 is the correlation
coefficient between 𝑥 and 𝑦.
REGRESSION COEFFICIENTS
• The slope 𝑏 of the line of regression of 𝑦 on 𝑥 is also called the coefficient of
regression of 𝑦 on 𝑥.
• It represents the increment in the value of 𝑦 corresponding to a unit change in
the value of 𝑥.
𝑏𝑦𝑥 = Regression coefficient of 𝑦 on 𝑥
𝜎𝑦
=𝑟
𝜎𝑥
Similarly, 𝑏𝑥𝑦 = Regression coefficient of 𝑥 on 𝑦
𝜎𝑥
= 𝑟
𝜎𝑦
EXPRESSION FOR REGRESSION COEFFICIENT
σ𝑥σ𝑦
σ 𝑥𝑦 −
𝑟= 𝑛
σ𝑥 2 σ𝑦 2
σ 𝑥2 − σ 𝑦2 −
𝑛 𝑛
σ𝑥 2 σ𝑦 2
𝜎𝑥 = σ 𝑥2 − 𝜎𝑦 = σ 𝑦2 −
𝑛 𝑛
σ𝑥σ𝑦
𝜎𝑦 σ 𝑥𝑦−
𝑛
𝑏𝑦𝑥 = 𝑟 = σ𝑥 2
𝜎𝑥 2
σ𝑥 −
𝑛
σ𝑥σ𝑦
𝜎𝑥 σ 𝑥𝑦−
𝑛
𝑏𝑥𝑦 = 𝑟 = σ𝑦 2
𝜎𝑦 2
σ𝑦 −
𝑛
PROPERTIES OF REGRESSION COEFFICIENTS
1. The coefficient of correlation is the geometric mean of the coefficients of regression, i.e.,
𝒓 = 𝒃𝒚𝒙 𝒃𝒙𝒚
2. The arithmetic mean of regression coefficients is greater than or equal to the coefficient of
1
correlation i.e. 𝑏𝑦𝑥 + 𝑏𝑥𝑦 ≥ 𝑟
2
3. Both regression coefficients will have the same sign i.e., either both are positive or both are
negative.
4. The sign of correlation is same as that of the regression coefficients, i.e., 𝒓 > 𝟎 if 𝒃𝒙𝒚 > 𝟎
and 𝒃𝒚𝒙 > 𝟎; and 𝒓 < 𝟎 if 𝒃𝒙𝒚 < 𝟎 and 𝒃𝒚𝒙 < 𝟎.
PROPERTIES OF LINES OF REGRESSION
ഥ, 𝒚
1. The two regression lines 𝒙 on 𝒚 and 𝒚 on 𝒙 always intersect at their means 𝒙 ഥ .
2. Since 𝒓𝟐 = 𝒃𝒚𝒙 𝒃𝒙𝒚 , i.e., 𝒓 = 𝒃𝒚𝒙 𝒃𝒙𝒚 , therefore 𝒓, 𝒃𝒚𝒙 , 𝒃𝒙𝒚 all have the same sign.
3. If 𝒓 = 𝟎, the regression coefficients are zero.
4. The regression lines become identical if 𝒓 = ±𝟏. It follows from the regression equation that
𝒙=𝒙 ഥ and 𝒚 = 𝒚ഥ. If 𝒓 = 𝟎, these lines are perpendicular to each other.
EXAMPLES
Example 13: The regression lines of a sample are 𝒙 + 𝟔𝒚 = 𝟔 and 𝟑𝒙 + 𝟐𝒚 = 𝟏𝟎. Find
(1) sample means 𝒙 ഥ and 𝒚ഥ, and
(2) the coefficient of correlation between 𝒙 and 𝒚.
(3) Also estimate 𝒚 when 𝒙 = 𝟏𝟐.
EXAMPLES
Example 14: The following data regarding the height (y) and weight (x) of 100 college students are given:
σ 𝒙 = 𝟏𝟓𝟎𝟎𝟎, σ 𝒙𝟐 = 𝟐𝟐𝟕𝟐𝟓𝟎𝟎, σ 𝒚 = 𝟔𝟖𝟎𝟎, σ 𝒚𝟐 = 𝟒𝟔𝟑𝟎𝟐𝟓, σ 𝒙𝒚 = 𝟏𝟎𝟐𝟐𝟐𝟓𝟎
Find the coefficient of correlation between height and weight and also the equation of regression of height
and weight.
EXAMPLES
Example 15: Find the regression coefficients 𝒃𝒚𝒙 and 𝒃𝒙𝒚 and hence, find the correlation
coefficient between 𝒙 and 𝒚 for the following data:
𝒙 4 2 3 4 2
𝒚 2 3 2 4 4
𝒙 𝒚 𝒙𝟐 𝒚𝟐 𝒙𝒚
EXAMPLES
Example-16: The number of bacterial cells (𝒚) per unit volume in a culture at different hours (𝒙) is
given below: 𝒙 0 1 2 3 4 5 6 7 8 9
𝒚 43 46 82 98 123 167 199 213 245 272
Fit lines of regression of 𝒚 on 𝒙 and 𝒙 on 𝒚. Also, estimate the number of bacterial cells after 15 hours.
Solution:
EXAMPLES
Example 17: The regression lines of a sample are 𝟒𝒙 − 𝟓𝒚 + 𝟑𝟎 = 𝟎 and 20𝒙 − 𝟗𝒚 + 𝟏𝟎𝟕 = 𝟎.
Find
(1) Find the both regression coefficient
(2) Find r and 𝜎𝑦 when 𝜎𝑥 = 3