BMA301 DISTANCE 29TH JULY 2024
Correlation:
A measure of the degree of the relationship between two or more
variables.
Can we measure this in three ways:
Graphical technique: SCATTER DIAGRAMS
The variables measurements are taken as points on the same x-y plane
and their slope observed
5 levels of Correlation:
Perfect Positive Correlation
This is where the data points are all on an imaginary line sloping
upwards from left to right, i.e. it has a positive slope
Perfect Negative Correlation
This is where the data points are all in an imaginary line sloping
downwards from left to right, i.e. the line has a negative slope
Positive Correlation
This is where the data points, though not on a straight line, are
arranged in a generally upwards sloping design from left to right. That
is, a line drawn through the data points will have a generally upward
slope.
Negative Correlation
This is where the data points, though not on a straight line, are
arranged in a generally downwards sloping design from left to right.
That is, a line drawn through the data points will have a generally
downward (negative) slope.
No Correlation (ZERO Correlation)
Where the points are scattered all over the plane with no pattern
Karl Pearson’s Correlation Coefficient (Parametric Method)
Spearman’s Rank Correlation Coefficient (Non-parametric Method)
Obtain the Correlation Coefficients (Karl Pearson’s and Rank
Spearman’s) for this data:
Cost Sales a = (x – mnx) b = (y – mny) Sq(a) Sq(b) ab
(X) (Y) x - 22 y - 45
('000) ('000)
20 30 20 – 22 = -2 30 – 45 = -15 4 225 30
40 60 18 15 324 225 270
20 40 -2 -5 4 25 10
30 60 8 15 64 225 120
10 30 -12 -15 144 225 180
10 40 -12 -5 144 25 60
20 40 -2 -5 4 25 10
20 50 -2 5 4 25 -10
20 30 -2 -15 4 225 30
30 70 8 25 64 625 200
220 450 0 0 760 1850 900
Mean x – 220/10 = 22 and Mean y = 450/10 = 45
Correlation Coefficient=r=
∑ ( x−x ) ( y− y )
√∑ ( x −x )2 ∑ ( y− y )2
900
¿ =0 .759
√(760)(1850)
Indicating a STRONG POSITIVE CORRELATION BETWEEN COSTS AND
SALES
For Rank Spearman’s:
Cost Sales Rank X Rank Y Diff Sq(Diff)
(X) (Y)
('000) ('000)
(3+4+5+6+7)/ (1+2+3)/3 = 2 5 – 2 = 9
20 30 5=5 3
40 60 10 (8+9)/2 = 8.5 1.5 2.25
20 40 5 (4+5+6)/3=5 0 0
30 60 (8+9)/2 = 8.5 8.5 0 0
10 30 (1+2)/2 = 1.5 2 -0.5 0.25
10 40 1.5 5 -3.5 12.25
20 40 5 5 0 0
20 50 5 7 -2 4
20 30 5 2 3 9
30 70 8.5 10 -1.5 2.25
TOTAL 0 39
6∑ d
2
6 ( 39 )
R=1− =1− =0 .764
n ( n −1 )
2
10 ( 100−1 )
Regression Analysis
Whereas Correlation Coefficient measures the strength and direction of
the relationship between two (or more) variables, we need to have an
equation that expresses the linear relationship between the variables.
This linear equation can then be used to estimate the value of one
variable (the dependent variable) on the basis of a selected value(s) of
the other variable(s) (the independent variable(s)). This technique is
called Regression Analysis.
Regression means stepping back or returning to a previous state.
This is a predictive measure of the nature of the relationship between
variables. Here we have a dependent variable and an independent
variable.
Some definitions of terms used in Regression Analysis:
Regression equation
This is an equation that expresses the linear relationship between
variables usually denoted by:
Y =a+bX
Least Square Principle
This is a technique of determining the regression equation by
minimizing the sum of squares of the vertical distances between the
actual Y values and the predicted Y values.
Slope of a line
This is the ratio of the changes in the dependent variables to the
changes in the independent variables given by:
Δ y y 2− y 1
b= =
Δ x x 2−x 1
Y-intercept
This is the point at which the regression line cuts the y-axis. It can also
be defined as the value of the dependent variable when the
corresponding value of the independent variable is zero.
Linear Regression
This is the use of the regression line to provide estimates of the values
of the dependent variables from values of the independent variables.
The regression line is the line that describes the tendency of variables
to regress towards their averages. This line is obtained by plotting the
values of the independent variables and their corresponding dependent
variables obtained or estimated using the linear regression equation.
There are always two regression lines: Regression of Y on X and
Regression of X on Y. The Regression line of Y on X will give the
estimates of the variable Y for given values of X, while the Regression
line of X on Y will give the estimates of the variable X for given values of
Y. When there is perfect positive or perfect negative correlation, the
two lines will coincide, hence one line. The farther apart the two lines
are the lesser the degree of correlation. If there is no correlation, then
the two lines will be perpendicular to each other.
The main uses of regression analysis are:
To provide estimates of the dependent variables based on some
independent variables
To obtain the measure of the error involved in using the
regression line as a basis for estimation. The Standard Error of
Estimate is calculated for this purpose. The less the error the
better the estimate.
To measure the degree of association (correlation) that exists
between the two variables. Here we use the Coefficient of
Determination.
Distinction between Regression and Correlation
Coefficient of Correlation measures the degree of relationship
between the variables X and Y, while the Regression Line
measures the nature of relationship between the variables.
Regression analysis investigates the cause and effect, while such
cannot be seen from correlation analysis.
Correlation analysis is only applicable in linear relationships while
Regression is applicable to both linear and curvilinear
relationships
It is possible to have positive or negative correlation based on
calculations while in actual fact there is no correlation, but for
regression it is not possible to have “false” regression.
Regression Coefficient
This is the rate of change in the dependent variable with respect to the
Δy
changes in the independent variable Δx
Y =1 . 5699+0 . 4769 X
For every unit change in X, there will be a 0.4769 change in Y
This is obtained from the formula, for Regression of Y on X
s y ∑ ( x−x ) ( y− y ) Cov ( x , y )
b yx =r = =
sx ∑ ( x −x ) 2
Variance(x )
And for the Regression of X on Y
s x ∑ ( x−x )( y − y ) Cov ( x , y )
b xy =r = =
sy ∑ ( y− y )2 Var ( y )
We have y=a+bx as the regression equation: x, y are variables, b is the
Regression Coefficient (Slope) and ‘a’ is the constant.
X is the INDEPENDENT variable and y is the DEPENDENT variable
We can use the INDEPENDENT variable to predict the value of the
DEPENDENT once we are able to establish the values of ‘a’ and ‘b’ from
SAMPLE DATA.
The value of ‘a’ and ‘b’ are obtained as follows:
Covariance(x , y)
a= y−b x ∧b=
Variance(x )
b=
∑ ( x−x )( y − y ) = 9 . 2 =0 . 434
∑ ( x−x )2 21 . 2
X Y ( x−x ) ( y− y) ( x−x )2 ( y− y )2 (x−x )( y− y )
X – 3.2 Y – 5.8
3 5 -0.2 -0.8 0.04 0.64 0.16
7 8 3.8 2.2 14.44 4.84 8.36
1 6 -2.2 0.2 4.84 0.04 -0.44
2 5 -1.2 -0.8 1.44 0.64 0.96
3 5 -0.2 -0.8 0.44 0.64 0.16
1 29 0 0 ∑ ( x −x )2 ∑ ( y− y )2 ∑ (x−x )( y− y )
6 21.2 6.8 9.2
a=5.8−0.434 ( 3.2 )=5.8−1.388=4.412
So the Regression Equation is Y = 4.412 + 0.434X
If x = 10, then Y = 4.412 + 0.434 (10) = 8.752
Once we have the values of ‘a’ and ‘b’, we put them in the equation
y=a+bx and use it to find values of y given x.
If b is negative, then the graph of the equation slopes downwards.
If b is positive, then the graph of the equation slopes upwards.
Properties of the Regression Coefficient:
1. The correlation coefficient can be calculated from the Regression
Coefficients:
r =√ b yx∗b xy
2. If one Regression Coefficient is greater than one, the other must
be less than one.
3. Both Regression Coefficients will have the same sign (either both
negative or positive)
4. The Coefficient of Correlation will have the same sign as that of
the Regression Coefficients
5. Regression Coefficients are independent of change of origin but
not of scale. This means that we can subtract an assumed value
from each of the observations does not change the regression
coefficient, but division does affect the coefficient.
Method of Least Squares
Consider the following data on some costs of advertising and sales for a
company.
Cost Sales
('000) ('000)
20 30
40 60
20 40
30 60
10 30
10 40
20 40
20 50
20 30
30 70
We may want to find the Regression of the Sales on Costs. If we plot
the Sales/Costs on a Cartesian plane, it would look something like this:
80 Costs and Sales in '000
70
60
SALES (000)
50
40
30
20
10
0
5 10 15 20 25 30 35 40 45
COSTS (000)
This kind of diagram is called a scatter diagram and shows the data
spread on the plane. The closeness of the dots of data indicates the
near perfect correlation.
A regression line would be a line drawn in such a way that the sum of
squares of the vertical distances from the points to the line is least. This
is the principle of getting the regression line via the Least Square
method.
We note that the Regression Equation being a linear equation of the
form:
s y ∑ ( x−x ) ( y− y )
Y =a+bX , w h ere a=Y −b X ∧b=r =
sx ∑ ( x−x )2
To calculate these we use the following table:
Cost Sales 2 2
(x−22) ( y−45) (x−22) ( y−45) (x−22)( y−45)
('000) ('000)
20 30 -2 -15 4 225 30
40 60 18 15 324 225 270
20 40 -2 -5 4 25 10
30 60 8 15 64 225 120
10 30 -12 -15 144 225 180
10 40 -12 -5 144 25 60
20 40 -2 -5 4 25 10
20 50 -2 5 4 25 -10
20 30 -2 -15 4 225 30
30 70 8 25 64 625 200
220 450 0 0 760 1850 900
The coefficient of X is called the Regression Coefficient and it is also the
slope of the Regression line. It measures the rate of change in the
variables (the change in one variable corresponding to the change in
the other variable)
In the above case, the Regression Coefficient is obtained by:
s y ∑ ( x−x ) ( y− y ) 900
b=r = = =1 . 184
sx ∑ ( x−x )2 760
The value of the y-intercept ‘a’ is obtained by:
a=Y −b X ⟹ a=45−1 . 184 ( 22 )=18 . 948
Thus the linear regression equation is:
Y =18 . 948+1 . 184 X
We can use this equation to estimate any value of Y for some given
value of X.
If x=25 , t h en y =18.948+ ( 1.184∗25 )=48.548
Multiple Regression
Multiple Regressions occurs when we have two or more independent
variables being used to estimate the values of a dependent variable.
Curve Fitting
Curve fitting in Regression analysis is the process where we use the
regression equation to find the line that “best fit” the data from a
family of possible values. This is by finding the ideal values of “a” and
“b”. The values of ‘a’ and ‘b’ are obtained via the Least Square method.
When the Least Square deviations are taken about the mean, we get a
line of best fit.
Practice Questions
1. Explain the concept of regression and explain its significance in
statistical analysis.
2. Distinguish between Regression and Correlation.
3. The following table gives the marks students obtained in two
examinations papers: Psychology and Statistics.
Compute the Coefficient of Correlation and find the TWO lines of
regression. Comment on the results.
Psychology 80 45 55 56 58 62 65 68 70 75 85
Statistics 82 56 50 48 60 62 64 65 70 74 90