CBMEC 107 -Business Statistics Module 12.
Regression and Correlation Analysis
Module 12. REGRESSION AND CORRELATION
ANALYSIS
Overview: In this module we shall establish the association or relationship between
.
variables through the study of regression and correlation analysis.
Learning Outcomes : At the end of this module, students should be able to
Construct a regression line
Formulate predicting equation using least squares estimation
Perform at least simple linear regression and correlation analysis
Indicative Contents:
Regression Line
Scatter Diagram
Least Squares Estimation
Correlation
Coefficient of Correlation
Interpretation of the Coefficient of Correlation
The Pearson’s Product-Moment Coefficient of Correlation
Spearman’s Rank-Order Correlation
Module 12. REGRESSION AND CORRELATION ANALYSIS
Regression Analysis
Regression analysis is a statistical technique used for determining the probable form of the
relationship between variables. The ultimate objective when using this method of analysis is usually to
predict or estimate the value of one variable corresponding to a given value of another variable.
Simple regression analysis a form of linear relationship consisting only one independent variable X to
predict dependent variable Y. Objective: To find the possible relationship between two variables X and
Y, where X and Y are paired variables.
Regression Line
We assume that the value of X is known in advance and that the value assumed by Y depends in part on
the particular value of X under consideration while Y called the dependent or response variable, the
variable X whose value is used to help predict the behavior of Y is called the independent or predictor
variable or the regressor.
In the simple linear regression model
Page 1
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
y = a + βx
where a = denotes the intercept and β = the slope of the regression line.
Scatter Diagram
Scatter diagram (also called scatter plots, scatter graphs and correlation chart) are similar to line
graphs. A line graph uses a line on an X-Y axis to plot a continuous function, while a scatter plot uses
dots to represent individual pieces of data.
In statistics, these plots are useful to see if two variables are related to each other. For example, a
scatter chart can suggest a linear relationship (i.e. a straight line).
Scatter diagrams are useful to determine the relationship
between two variables. This relationship can be between
two causes, or a cause and an effect, etc. It can be
positive, negative or no relationship at all. The first
variable is independent, and the second variable depends
on the first.
We draw this graph with two variables. The first variable
is independent and the second variable depends on the
first. (Figure 1)
Figure 1. Scatter Diagram
Types of Scatter Diagram
a) Scatter Diagram with No Correlation
o This diagram is also known as “Scatter Diagram with Zero Degree of Correlation”.
o Here, the data point spread is so random that you cannot draw a line through them.
o Therefore, you can say that these variables have no correlation.
b) Scatter Diagram with Moderate Correlation
o This diagram is also known as “Scatter Diagram with a Low Degree of Correlation”.
o Here, the data points are a little closer and you can see that some kind of relationship exists between
these variables.
c) Scatter Diagram with Strong Correlation
o This diagram is also known as “Scatter Diagram with a High Degree of Correlation”.
o In this diagram, data points are close to each other and you can draw a line by following their pattern.
o In this case, you say that these variables are closely related.
o
Page 2
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
a) No Correlation b) Moderate Correlation c) Strong Correlation
Figure 2. Types of Scatter Diagram
Example 1
Below is the summary of solvent compound and drying time
Amount (ml) 17 11 10 18 20 5 22 14 17 25 22 8 11 18 21
Time 56 50 120 70 80 120 30 45 55 60 64 56 76 48 92
(in minutes)
Solution:
Draw the scatter diagram (Figure 3)
Figure 3. Scatter diagram of hypothetical data on the amount of solve compound
and the drying time.
Least Squares Estimation
The parameters α and β are estimated by the methods of least squares. From the many straight lines
that can be drawn through a scatter diagram, we choose the one that “best fits” the data. The estimated
regression line takes form
y = a + bx
If we let ei denote the vertical distances from a point (x, y) to the estimated regression line, then each
data point satisfies the equation
y = a + bxi + ei
The term ei is called the residual. Figure 2 illustrates this idea.
y
(x3, y3)
(x2, y2) e3
(x1, y1) e2
e1
e5
e4 (x5, y5)
Page 3
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
(x4, y4)
x
Figure 4. The least-squares procedure minimizes the sum of the squares of the residuals e i.
The residual for a data point that lies above the estimated regression line is positive; for the
point that lies below the regression, the residual is negative. If the residuals are summed, the negative
and the positive values will counteract one another and the sum will always be zero.
The estimates for α and β can be solved easily by
a = y - bx
n ∑ xiyi ∑ xi ∑ yi
b=
2
n ∑ xi2 ∑ xi
The graphs of the regression line are shown below, with relative positive, negative, and zero
slopes.
y y y
x x x
Positive slope (b > 0) Negative slope (b < 0) Zero slope (b = 0)
Figure 5. Relative positive, negative, and zero slopes for a regression line.
Example 2
The relationship between energy consumption and household income was studied, yielding the
following data on household income X (in units of P1000/month) and energy consumption Y (in
kilowatt hour/month).
see Data in Table 1 below
Page 4
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
Table 1
Household Household Income (x) Energy Consumption (y)
1 70 200
2 85 175
3 45 100
4 56 120
5 60 80
6 100 350
7 93 255
8 81 400
9 48 70
10 115 450
11 90 320
12 57 125
Summary statistics for these data are:
n = 12
∑x = 900 ∑x2 = 72,974 x = 75 σx = 21.36
∑y = 2,645 ∑y2 = 774,375 y = 220.42 σy = 126.28
∑ xy = 227,045
To estimate the simple linear regression line, we estimate the slope b and the intercept a. These
estimates are
n∑ -∑ ∑ (12) (227,045) - (900)(2,645)
b= = = 5.24
n ∑ x2 - (∑ x)2 (12) (72,974) - (900)
a = y - bx = 220.42 - 5.24 (75) = - 172.58
Hence the estimated regression equation is
y = - 172.58 + 5.24x
The graph of this equation is shown in Figure 6.
Page 5
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
1000
900
800
700
600
500
400
300
200
100 0 50 100 150 200 300 Income
Figure 6. A graph of the estimated line or regression of Y, the energy consumption on X,
the household income.
To predict the energy consumption when the household income is P200, 000 per month, we
substitute the value 200 for x in the equation
y = -172.58+5.24x
to obtain y = -172.58 = 5.24 (200) = 975.42 kilowatts
Correlation
It is desirable to observe and measure the association which occurs between two statistical series. For
example, it is desirable to know whether there is a relationship between changes in the cost of living
and changes in wages; the grades on an examination and the intelligent quotient of a group of students;
and the academic material retained in memory after various intervals of time and many other similar
associated data.
The relationship between two data may be established and measured by means of the correlation
method.
Coefficient of Correlation
The coefficient of correlation is used as the comparative measure of association. The coefficient of
correlation will have the limits 0 to 1.00. The value of 1 or -1 indicates perfect positive or negative
linear relationships, respectively. A value of 0 indicates no linear relationship. When this happens, we
say that X and Y are uncorrected.
The coefficient of correlation will be positive or negative: positive when the compared variables are
directly proportional, i.e., as one variable increases the other variable also increases, or as one variable
decreases the other variable also decreases; negative when the compared variables are inversely
Page 6
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
proportional, that is to say, an increase in the value of X results in a corresponding decrease in the
value of Y. Under these circumstances the line of regression slopes downward.
Figure 7. Perfect Positive Correlation, Figure 8. Perfect Negative Correlation
Figure 9. Uncorrelated, , Figure 10. Uncorrelated,
points indicate a relationship points are randomly scattered
between x and y, but the
relationship is not liner
Interpretation of the Coefficient of Correlation
Table 2
Coefficient of Correlation Interpretation
0 - ±0.20 Negligible relationship
±0.20 - ±0.40 Slight relationship
±0.41 - ±0.70 Moderate relationship
±0.71 - ±0.90 Marked or high relationship
±0.91 - ±1.00 Very high to perfect relationship
Page 7
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
The Pearson’s Product-Moment Coefficient of Correlation
The Pearson’s Product-Moment coefficient of correlation measures the linear relationship of two
variables, defined by
r=
σx σy
where, ∑xy ∑x ∑y
= -
N N N
This standard deviation of x,
∑x2 ∑x 2
σx = -
N
N
The standard deviation for y,
∑y2 ∑y 2
σY = -
N N
Consolidating the formulas,
N ∑xy - (∑x ) (∑y)
r= =
σx σy N ∑x2 - (∑x)2 N∑y2 – (∑y)2
Example 3
Table 3
x Y xy x2 y2
36 21 756 1296 441
42 18 756 1764 324
37 15 555 1369 225
31 11 341 961 121
25 15 375 625 225
28 9 252 784 81
33 10 330 1089 100
28 20 560 784 400
42 16 672 1764 256
39 11 429 1521 121
38 21 798 1444 441
Page 8
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
40 14 560 1600 196
419 181 6384 15001 2931
Computation of the coefficient of correlation:
∑xy ∑x ∑y 6384 419 181
= - = -
N N N 12 12 12
= 532 - (34.92) (15.08)
= 5.41
The standard deviation of x,
∑x2 ∑x 2
15001 419
2
σx = - =
N 12
N 12
= 1250.08 – 1219.41
= 30.67 = 5.54
The standard deviation for y,
∑y2 ∑y 2
2931 181
2
σy = - =
N 12
N 12
= 244.25 – 227.51
= 16.74 = 4.09
Then, the coefficient of correlation is
5.41 5.41
r= = = = 0.24
σx σy (5.54)(4.09) 22.66
There is a slight positive relationship between the two variables.
Spearman’s Rank-Order Correlation
The strength of relationship between two ranked variables can be measured by the Spearman’s Rank-
Order coefficient or correlation, defined by
6 ∑d12
Page 9
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
r=1 -
n (n2-1)
Example 4
The tables below show the score and judge rank of 8 contestants.
Ranks and Score of 8 Contestants
Contestant 1 2 3 4 5 6 7 8
Judge’s Rank 3 1 6 2 4 7 8 5
Score 26 40 52 25 20 60 37 48
Table 5
Ranks and Data in Table 2
Contestant 1 2 3 4 5 6 7 8
Judge’s Rank (xi) 3 1 6 2 4 7 8 5
Score’s Rank (yi) 5 4 2 6 8 1 5 3
Table 6
Differences and Square of Differences for the Contestant Ranks
Contestant xi yi di = x – y di2
1 3 5 -2 4
2 1 4 -3 9
3 6 2 4 16
4 2 6 -4 16
5 4 8 -4 16
6 7 1 6 36
7 8 5 3 9
8 5 3 2 4
Total ∑=110
Substituting values into the formula for r, we have
6 ∑d12
r = 1 -- -
n (n2-1)
6 (110)
r=1- = 1 - 1.309 = - 0.309
n (64-1)
Page 10
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
There is an inverse slight relationship between the judges’ rank and the contestants’ scores.
SUGGESTED ENRICHMENTACTIVITY
Watch video clips of relevant topics from You-tube or internet.
REFERENCES
Jonathan B. Cabero, Lorina G. Salamat and Antonina C. Sta. Maria (2013). Business Statistics. Anvil
Publishing Inc., Mandaluyong City.
Gerald Keller (2013). Business Statistics. Cengage Learning Asia Pte Ltd., Singapore
Faith B. Basilio et. al. (2003) Fundamentals of Statistics. Trinitas Publishing Inc., Bulacan
Internet-based references
Page 11
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
CB MECH 107: BUSINESS STATISTICS
EXERCISE NO. 12
REGRESSION AND CORRELATION ANALYSIS
Name: ________________________________ Date: __________________
Course and Year: _________________________ Rating: _________________
Direction: Answer the following problems in separate answer sheet. Use black pen only.
PROBLEMS:
1. A student wants to determine how are grades in college algebra and in Statistics related. From a
random of eight students she obtained the following scores.
Student Algebra, X Statistics, Y
1 85 84
2 67 68
3 52 60
4 60 65
5 91 82
6 91 87
7 60 62
8 75 77
Calculate the coefficient of correlation using the Pearson Product Moment coefficient of
correlation.
2. The values in X below are hours spent studying, and the values in Y are grades on a test.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
The Pearson Product Moment Correlation Coefficient of the above data is
A. 0.706
B. 0.737
C. 0.803
D. 0.889
Page 12
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
CB MECH 107: BUSINESS STATISTICS
EXERCISE NO. 12
REGRESSION AND CORRELATION ANALYSIS
Name: ________________________________ Date submitted __________________
SOLUTIONS TO PROBLEMS
Page 13
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
REVIEW QUESTIONS
REGRESSION AND CORRELATION ANALYSIS
Multiple Choice Questions. Encircle the letter of the best answer
1. r2 is known as 9. The lines of regression intersect at the point
A. Coefficient of determination A. (X, Y)
B. Multiple correlation coefficient B. (x, y)
C. Partial correlation coefficient C. (0, 0)
D. Semi-partial correlation coefficient D. (1, 1)
2. The sum of the residuals for all data points that 10. Regression coefficient is independent of
lie in the regression is always A. Origin
A. negative B. Scale
B. positive C. Both origin and scale
C. zero D. Neither origin nor scale
D. none of the above 11. If the two lines of regression are perpendicular
3. Correlation coefficients lie between to each other, the correlation coefficient r = is:
A. -2 and +2 A. 0
B. -1 and +1 B. −1
C. 0 and 1 C. 1
D. -1 and 0 D. Nothing can be said
4. In regression equation y= a+bx, x is called 12. Which of the following statements about outliers
A. regressor is not true?
B. predictor A. Outliers are values very different from the
C. independent variable rest of the data.
D. Any of these B. Influential cases will always show up as
5. The estimation of the linear regression line is outliers.
dependent of C. Outliers have an effect on the mean.
A. origin and intercept D. Outliers have an effect on regression
B. slope and intercept parameters.
C. horizontal and vertical axes 13. What is b0 in regression analysis?
D. point and slope A. The value of the outcome when all the
6. When the compared variables are inversely predictors are 0.
proportional, the line of regression B. The relationship between a predictor and the
A. is horizontal outcome variable.
B. slopes downward C. The value of the predictor variable when the
C. slopes upward outcome is zero.
D. is vertical D. The gradient of the regression line.
7. The least squares method minimizes which of 14. What is the degree of freedom used to calculate
the following? the test statistic t for a correlation test.
A. n (n-1)
A. sum of residuals
B. n − 1
B. sum of squared residuals
C. n − 2
C. sum of squares error
D. n – 3
D. total sum of squares
15. If ρ=0, the lines of regression are:
8. Spearman rank-order correlation is used for
A. Coincident
what type of data?
B. Parallel
A. nominal C. Perpendicular to each other
B. Ordinal D. None of the above
C. Interval
D. Ratio
Page 14
CBMEC 107 -Business Statistics Module 12. Regression and Correlation Analysis
16. The estimate of β in the regression equation D. The distribution of possible εi values have
Y=α+βX+e by the method of least square is: equal variances for all values of x.
A. Biased 19. An investigator reports that the arithmetic mean
B. Unbiased of two regression coefficients of a regression
C. Consistent line is 0.7 and the correlation coefficient is 0.75.
D. Efficient The investigation results are:
17. Which of the following is not one of the A. Valid
assumptions required for the t test for B. Invalid
determining whether the correlation is C. Inconclusive
significant? D. None of these
A. The data are interval or ratio level. 20. Homogeneity of three or more population
B. The variances are equal or σ12= σ22 correlation coefficients can be tested by
C. The two variables are distributed as a A. t-test
bivariate normal distribution. B. Z-test
D. All three are required assumptions. C. χ2-test
18. Which of the following is not true regarding the D. F-test
error term ε? 21. In multiple linear regression analysis, the square
A. Individual values of the error term εi are root of Mean Squared Error (MSE) is called the:
statistically dependent on each other. A. Multiple correlation coefficient
B. For a given value of x, there can exist many B. Standard error of estimate
values of εI. C. Coefficient of determination
C. The distribution of possible εi values for any D. None of these
x value is normal.
PROBLEMS
1. Calculate the test statistic t for a correlation hypothesis test when the sample correlation coefficient
is r = 0.889 and the sample size is n = 10.
A. 5.337
B. 5.491
C. 5.519
D. 5.664
2. Compute the slope of the regression equation based on these sample data.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
A. 9.638
B. 10.144
C. 10.835
D. 11.169
3. Compute the y intercept of the regression equation based on these sample data.
X = {3.2, 3.0, 1.0, 2.5, 1.9, 1.6, 3.1, 3.5, 4.2, 3.0}
Y = { 90, 88, 57, 86, 79, 71, 84, 97, 90, 91}
A. 52.338
B. 54.045
C. 55.159
D. 56.779
Page 15