Module Overview
Purpose of this Module
The purpose of this module is to discuss the relationships of
variables using Linear Regression and correlation
Linear Regression and Correlation
Linear regression is a technique that is appropriate to understand the association
between one independent (or predictor) variable and one continuous
dependent (or outcome) variable. In correlation analysis, we estimate
a sample correlation coefficient, more specifically the Pearson Product Moment
correlation coefficient. The sample correlation coefficient, denoted r, ranges between
-1 and +1 and quantifies the direction and strength of the linear association between
the two variables. The correlation between two variables can be positive (i.e., higher
levels of one variable are associated with higher levels of the other) or negative (i.e.,
higher levels of one variable are associated with lower levels of the other).
Module guide
This module discusses whether two variables are related to each
other by using Least-Squares Regression Line and Linear Correlation
Coefficient
Module Outcome/s
After this learning module, students will be able to:
1. Use linear regression to predict the value of a variable given certain
conditions.
2. Apply correlation to determine the relationship between two
variables
3. Articulate the importance of mathematics in one’s life.
4. Express appreciation for mathematics as a human endeavor.
Module Requirements
By the end of this module, the students will submit the
following activities provided.
Course Pre-Assessment
Matching Type: Find in the box the type of linear correlation in each picture of scatter
Diagram.
1. 4.
_________________ _________________
2. 5.
_________________ _________________
3. 6.
_________________ _________________
Perfect positive correlation Strong negative correlation
Strong positive correlation Positive correlation
Negative correlation Little or no linear
Key Terms
Linear Regression - is a linear approach to modelling the relationship between a scalar
response and one or more explanatory variables. The case of one
explanatory variable is called simple linear regression; for more
than one, the process is called multiple linear regression.
Correlation - or dependence is any statistical relationship, whether causal or not,
between two random variables or bivariate data. In the broadest sense
correlation is any statistical association, though it commonly refers to
the degree to which a pair of variables are linearly related.
Linear Regression
Linear Regression In many applications, scientists
try to determine whether two variables are related. If
they are related, the scientists then try to find an
equation that can be used to model the relationship. For
instance, the zoology professor R. McNeill Alexander wanted to determine whether
the stride length of a dinosaur, as shown by its fossilized footprints, could be used
to estimate the speed of the dinosaur. Stride length for an animal is defined as the
distance x from a particular point on a footprint to that same point on the next
footprint of the same foot. (See the figure at the right.) Because no dinosaurs were
available, Alexander and fellow scientist A. S. Jayes carried out experiments with
many types of animals, including adult men, dogs, camels, ostriches, and elephants.
The results of these experiments tended to support the idea that the speed y of an
animal is related to the animal’s stride length x. To better understand this
relationship, examine the data in Table 13.11, which are similar to, but less extensive
than, the data collected by Alexander and Jayes.
TABLE 1: Speed for Selected Stride Lengths
A graph of the ordered pairs in Table 1 is shown in Figure 1. In this graph,
which is called a scatter diagram or scatter plot, the x-axis represents the stride
lengths in meters and the y-axis represents the average speeds in meters per
second. The scatter diagram seems to indicate that for each of the three species, a
larger stride length generally produces a faster speed. Also note that for each
species, a straight line can be drawn such that all of the points for that species lie
on or very close to the line. Thus the relationship between speed and stride length
appears to be a linear relationship.
FIGURE 1: Scatter diagram for Table 1
After a relations hip between paired data, which are referred to as bivariate
data, has been discovered, a scientist tries to model the relationship with an
equation. One method of determining a linear relationship for bivariate data is called
linear regression. To see how linear regression is carried out, let us concentrate on
the bivariate data for the dogs, which is shown by the green points in Figures 1 and
2. There are many lines that can be drawn such that the data points lie close to the
line; however, scientists are generally interested in the line called the line of best fit
or the least-squares regression line.
FIGURE 2: Vertical deviations
The least-squares regression line is also called the least-squares line. The
approximate equation of the least-squares line for the bivariate data for the dogs is
ŷ = 3.2x - 1.1. Figure 2 shows the graph of these data and the graph ŷ = 3.2x - 1.1. In
Figure 2, the vertical deviations from the ordered pairs to the graph of ŷ = 3.2x - 1.1
are 0, -0.06, 0.5, -0.52, -0.16, -0.6, 0.34 and 0.2.
It is traditional to use the symbol ŷ (pronounced y-hat) in place of y in the
equation of a least-squares line. This also helps us differentiate the line’s y-values
from the y-values of the given ordered pairs.
The next formula can be used to determine the equation of the least-squares
line for a given set of ordered pairs.
In the formula for the least-squares regression line, ∑ x represents
the sum of all the x values, y represents the sum of all the y values, and ∑xy
represents the sum of the n products x1y1, x2y2, ... , xnyn. The notation x̅ represents
the mean of the x values, and y̅ represents the mean of the y values. The following
example illustrates a procedure that can be used to calculate efficiently the sums
needed to find the equation of the least-squares line for a given set of data.
Example 1: Find the Equation of a Least-Squares Line
Find the equation of the least-squares line for the adult men ordered pairs in
Table 1.
Solution:
The ordered pairs are (2.5, 3.4) , (3.0, 4.9) , (3.3, 5.5) , (3.5, 6.6) , (3.8, 7.0) , (4.0,
7.7) , (4.2, 8.3) , (4.5, 8.7). The number of ordered pairs is n = 8. Organize the data
in four columns, as shown in Table 2 Then find the sum of each column.
Table 2
Find the slope a.
Find x̅ and y̅.
Find the y-intercept b.
If a and b are each rounded to the nearest tenth, to
reflect the accuracy of the original data, then we have
as our equation of the least-squares line:
ŷ = ax + b
ŷ ≈ 2.7x - 3.3
See Figure 3
FIGURE 3: Least-squares line for speed versus stride
length in adult men
▼
Example 2: Use a Least-Squares Line to Make a
Prediction
Use the equation of the least-squares line from Example
1 to predict the average speed of an adult man for each
of the following stride lengths. Round your results to
the nearest tenth of a meter per second.
a. 2.8 m b. 4.8 m
Solution:
a. In Example 1, we found the equation of the least-squares line to be ŷ = 2.7x -
3.3. Substituting 2.8 for x gives…
ŷ = 2.7(2.8) - 3.3. = 4.26
Rounding 4.26 to the nearest tenth produces 4.3.
The procedure in Example 2a made use of an equation to
determine a point between given data points. This
procedure is referred to as interpolation. In Example 2b,
an equation was used to determine a point to the right of
the given data points. The process of using an equation to
determine a point to the right or left of given data points
is referred to as extrapolation. See Figure 4.
FIGURE 4: Interpolation and extrapolation
Linear Correlation Coefficient
To determine the strength of a linear relationship between two variables,
statisticians use a statistic called the linear correlation coefficient, which is
denoted by the variable r and is defined as follows.
If the linear correlation coefficient r is positive, the
relationship between the variables has a positive
correlation. In this case, if one variable increases,
the other variable also tends to increase. If r is
negative, the linear relationship between the
variables has a negative correlation. In this case, if
one variable increases, the other variable tends to
decrease. Figure 5 shows some scatter diagrams
along with the type of linear correlation that exists between the x and y variables.
The closer |r| is to 1, the stronger the linear relationship between the variables.
FIGURE 5
Linear correlation
Example 3: Find a Linear Correlation Coefficient
Find the linear correlation coefficient for stride length versus speed of an adult
man. Use the data in Table 1. Round your result to the nearest hundredth.
Solution:
The ordered pairs are (2.5, 3.4), (3.0, 4.9), (3.3, 5.5), (3.5, 6.6), (3.8, 7.0), (4.0, 7.7),
(4.2, 8.3), (4.5, 8.7). The number of ordered pairs is n = 8. In Table 2 we found:
The only additional value that is needed is…
Substituting the above values into the equation for the linear correlation coefficient
gives us…
To the nearest hundredth, the linear correlation coefficient is 0.99.
The linear correlation coefficient indicates the strength of a linear relationship
between two variables; however, it does not indicate the presence of a cause-and-
effect relationship. For instance, the data in Table 3 show the hours per week that a
student spent playing pool and the student’s weekly algebra test scores for those
same weeks.
TABLE 3: Algebra Test Scores vs. Hours Spent Playing Pool
The linear correlation coefficient for the ordered pairs in the table is r ≈ 0.98. Thus
there is a strong positive linear relationship between the student’s algebra test
scores and the time the student spent playing pool. This does not mean that the
higher algebra test scores were caused by the increased time spent playing pool. The
fact that the student’s test scores increased with the increase in the time spent
playing pool could be due to many other factors or it could just be a coincidence. In
your work with applications that involve the linear correlation coefficient r, it is
important to remember the following properties of r.
References/Suggested Readings
BIBLIOGRAPHY Sirug, Winston S., Mathematics in the Modern World: CHED Curriculum Compliant
Hengania, Catherine O., Et.Al., Mathematics in the Modern World
Jamison R.E., (2000).Learning the Language of Mathematics. Language and Learning
Across the Discipline (45-54).
Post Assessment:
Solve the problems.
1. Which of the scatter diagrams below suggests the …
a. strongest positive linear correlation between the x and y variables?
b. strongest negative linear correlation between the x and y variables?
2. Which of the scatter diagrams below suggests …
a. a near perfect positive linear correlation between the x and y variables?
b. little or no linear correlation between the x and y variables?
3. Given the bivariate data:
a. Draw a scatter diagram for the data.
b. Find n, ∑x, ∑y, ∑x2 , (∑x)2, ∑xy.
c. Find a, the slope of the least-squares line, and b, the y-intercept of the least-
squares line.
d. Draw the least-squares line on the scatter diagram from part a.
e. Is the point x̅, y̅ on the least-squares line?
f. Use the equation of the least-squares line to predict the value of y when x = 3.4.
g. Find, to the nearest hundredth, the linear correlation coefficient.
Name:__________________________________Year/Section_________Score________
Write your answers here!