Simple Linear Regression and
Correlation Analysis
1
Simple Regression
Definition
A regression model is a mathematical equation that
describes the relationship between two or more variables.
A simple regression model includes only two variables:
one independent (X) and one dependent (Y).
X: The independent variable is the one that
explains
the dependent variable.
Y: The dependent variable is the one being explained.
A simple regression model that gives a straight-line relationship
between two variables is called a linear regression model.
Relationship between Food Expenditure and Income:
(a) Linear Relationship; (b) Nonlinear Relationship
Example of plotting a Linear Equation
ŷ b0 b1 x
• b0 is the y intercept of the line.
• b1 is the slope coefficient of the line.
• ŷ is the estimated simple linear regression equation.
The estimated simple linear regression equation is as
follows:
ŷ b0 b1 x
• b0 is the y intercept of the line.
• b1 is the slope coefficient of the line.
• ŷis the estimated value of y for a given x value.
Example 1:
Table 1 shows the Incomes (in hundreds of dollars) and
Food Expenditures of seven households:
Table 1
(a) Find the regression equation for the data given in
Table 1. Use income as an independent
variable (X) and food expenditure (Y) as a
dependent variable.
The regression equation^
𝒚is=𝟏 .𝟓𝟎𝟕𝟑+𝟎 . 𝟐𝟓𝟐𝟓 𝒙
Coefficient of Determination, r 2
&
Correlation Coefficient, r
Coefficient of Determination, r2
The coefficient of determination or r2 is the percentage of
the total variation in the dependent variable (y) that is
explained by the independent variable (x).
Correlation Coefficient, r
Correlation measures the direction and the strength
of the linear association between two variables.
Features of Correlation
Coefficient, r
• Range between -1 and 1
• The closer to -1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker the linear relationship
Example 1:
Table 1 shows the Incomes (in hundreds of dollars) and
Food Expenditures of seven households:
Table 1
(b) Determine and interpret the coefficient of
correlation (r) and coefficient of determination (r2).
r
r2
r = 0.9481, so r2 = 0.8988
Interpretation of r and r² :
The value of r = 0.95 indicates that the Income (x) and the Food
Expenditure (y) are positively correlated. The relationship is
strong. Those with higher income tends to spend more on food.
The value of r² = 0.8988 states that 89.88% of the total variation in
Food Expenditure (y) is explained by Income (x), while 10.12% is
explained by other factors.
(c) At the 5% significance level, does the data
provide sufficient evidence that there is a
correlation between the Income and the
Food Expenditure.
Step 1:
H0 : = 0 (no correlation between x and y)
HA: ≠ 0 (correlation exist between x and y)
Step 2:
We will use the t-distribution to perform this test.
Step 3:
Area in two tails = 0.05
df = n – 2 = 7 – 2 = 5
In Statistical Table (Table B.2), The critical values are
-2.5706 and 2.5706.
Reject H0 Reject H0
-2.5706 2.5706
17
Step 4:
From the summary output, the test statistic is 6.6641.
Since the test statistic > critical value, i.e. 6.6641 > 2.5706,
H0 is rejected. We conclude that there is a correlation between
Income (x) and Food Expenditure (y).
(d) At the 5% significance level, does the
data provide sufficient evidence that
there is a positive correlation
between the Income and the Food
Expenditure.
Step 1:
H0 : 0 (no positive correlation between x and y)
HA: > 0 (positive correlation exist between x and y)
Step 2:
We will use the t-distribution to perform this test.
Step 3:
Area in one tail = 0.05
df = n – 2 = 7 – 2 = 5
In Statistical Table (Table B.2), The critical value is
2.0150.
Reject H0
2.0150
21
Step 4:
From the summary output, the test statistic is 6.6641.
Since the test statistic > critical value, i.e. 6.6641 > 2.0150,
H0 is rejected. We conclude that there is a positive correlation
between Income (x) and Food Expenditure (y).
(e) Predict the food expenditure for a household
with income 90 (in hundreds of dollars).
From part (a), the regression equation is
^
𝒚 =𝟏 .𝟓𝟎𝟕𝟑+𝟎 . 𝟐𝟓𝟐𝟓 𝒙
^
𝒚 =𝟏 .𝟓𝟎𝟕𝟑 +𝟎 . 𝟐𝟓𝟐𝟓 (𝟗𝟎)
^
𝒚 =𝟐𝟒 . 𝟐𝟑𝟐𝟑
Table 2 lists t he driving experiences ( in
Example 2:
years) of eight drivers and t heir mont hly paid aut o insurance premiums (in dollars).
Let the driving experience be an independent variable (X), and the
insurance premium be a dependent variable (Y).
r Note: r is − 0.77
r2
^
𝒚 =𝟕𝟔 . 𝟔𝟔𝟎𝟒 −𝟏 . 𝟓𝟒𝟕𝟔24𝒙
The regression equation is
r = − 0.77, so r2 = 0.5929
Interpretation of r and r² :
The value of r = -0.77 indicates that the driving experience (x)
and the monthly auto insurance premium (y) are negatively
correlated. The relationship is strong but not very strong.
The value of r² = 0.59 states that 59% of the total variation in
insurance premiums (y) is explained by years of driving
experience (x), while 41% is explained by other factors.
Test whether there is a correlation between the driving
experiences (x) and monthly auto insurance premiums
(y) at 5% of level of significance.
Step 1:
H0 := 0 (no correlation between x and y)
HA: ≠ 0 (correlation exist between x and y)
Step 2:
We will use the t-distribution to perform this test.
Step 3:
Area in two tails = 0.05
df = n – 2 = 8 – 2 = 6
In Statistical Table (Table B.2), The critical values are
-2.447 and 2.447
Reject H0 Reject H0
-2.447 2.447
29
Step 4:
From the summary output, the test statistic is -2.9367.
Since the test statistic < critical value, i.e. -2.9367 < -2.447,
H0 is rejected. We conclude that the correlation exist between
driving experience (x) and auto insurance premium (y).