1
BADM 221 Statistics for Business
Week 10
ANOVA (Analysis of Variance)
2
ANOVA
Test of several means – n-sample hypothesis test
Many statistical applications in psychology, social science,
business administration, and the natural sciences involve several
groups.
Example:
• An experiment to study the effects of five different brands of
gasoline on car engine efficiency.
• A consumer looking for a new car might compare the average
gas mileage of seven car models.
• A professor wishes to study the effect of four different teaching
techniques on mathematics proficiency.
3
ANOVA
The characteristic that differentiates the treatments from one
another is called the factor of the study. The different
treatments is called the levels of the factor. Here, we only
consider one factor.
Example:
• An experiment to study the effects of five different brands of gasoline
on car engine efficiency.
Factor: Gasoline brand Treatments: The 5 different
brands
• A consumer looking for a new car might compare the average gas
mileage of seven car models.
Factor: Car model Treatments: The 7
car models
• A professor wishes to study the effect of four different teaching
techniques on mathematics proficiency.
Factor: Teaching technique Treatments: The 4 different
techniques
4
ANOVA
For hypothesis tests comparing averages among more
than two groups, statisticians have developed a method
called “Analysis of Variance” (abbreviated ANOVA)
One-way ANOVA
(Single-factor ANOVA)
The purpose of an ANOVA test is to determine whether
there is any significant difference among several group
means. The test uses variances to help determine if the
means are equal or not.
5
ANOVA
Two kinds of variances (source of variations)
• Variance between treatments:
Variation due to the different levels of the factor
(Termed as Sum of Squares of treatment/factor)
SS(Treatment) or SS(Factor)
• Variance within treatments:
Variation due to error
(Termed as Sum of Squares of Error)
SS(Error)
6
ANOVA
Null and Alternative Hypothesis
H0: All the population means are the same.
Ha: At least one of the means is different.
Suppose we want to compare k groups.
H0: The population means of all k groups are the same.
Ha: At least one group has a different mean.
H0: 1 2 k
Ha: At least one i is different from others
7
ANOVA
Data are typically put into a table for easy referencing by
computer software. The table is called ANOVA table.
Number of treatments: k Total number of data: n
Source of Sum of Squares Degrees of
Mean Square (MS) F
Variation (SS) Freedom (df)
MS(Factor)
Between SS(Factor) or MS(Factor)
k–1 SS(Factor) F
Treatments SS(Treatment) MS(Error)
k 1
MS(Error)
Error (Within
SS(Error) n–k SS(Error)
Treatments)
nk
Total SS(Total) n–1
8
ANOVA
Example:
Three different diet plans are to be tested for mean weight loss. The
entries in the table are the weight losses for the different plans.
Plan 1 Plan 2 Plan 3
5 3.5 8
4.5 7 4
4 4.5 3.5
3
The resulting ANOVA table is shown below:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458
Error (Within
Treatments)
20.8542
Total
9
ANOVA
Number of treatments: k = Total number of data: n =
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458
Error (Within
Treatments)
20.8542
Total
Source of Sum of Squares Degrees of
Mean Square (MS) F
Variation (SS) Freedom (df)
Between SS(Factor) or SS(Factor) MS(Factor)
k–1
Treatments SS(Treatment) k 1 MS(Error)
Error (Within SS(Error)
SS(Error) n–k
Treatments) nk
Total SS(Total) n–1
10
ANOVA
Example (continued):
Three different diet plans are to be tested for mean weight loss. The
entries in the table are the weight losses for the different plans.
Plan 1 Plan 2 Plan 3
5 3.5 8
4.5 7 4
4 4.5 3.5
3
Test the hypothesis that the mean weight loss of the 3 diet plans are
the same, at 5% level of significance.
11
ANOVA
Hypothesis Testing:
H 0: The population mean weight loss of the three diet
plans are ALL the same.
Ha: At least one of the diet plans has a different mean
weight loss.
ANOVA table
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 1.1229 0.3769
Error (Within Treatments) 20.8542 7 2.9792
Total 23.1 9
12
ANOVA
Hypothesis Testing:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 1.1229 0.3769
Error (Within Treatments) 20.8542 7 2.9792
Total 23.1 9
F50,50
F-distribution
F10,90
F3,5
F90,10
13
ANOVA
Hypothesis Testing:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 (df1) 1.1229 0.3769
Error (Within Treatments) 20.8542 7 (df2) 2.9792
Total 23.1 9
Critical value
Fdf1 ,df2 F2,7 4.7375
Test Statistic
Fc 0.3769
14
ANOVA
Hypothesis Testing:
Reject H0 if (Test Statistic > Critical value)
Do not reject H0 if (Test Statistic Critical
value)
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 2.2458 2 (df1) 1.1229 0.3769
Error (Within Treatments) 20.8542 7 (df2) 2.9792
Total 23.1 9
Critical value Test Statistic
Fdf1 ,df2 F2,7 4.7375
Fc 0.3769
Fc F2,7 Do not reject H0 .
There is insufficient evidence that at least one of the
diet plans has a different mean weight loss.
15
ANOVA
1 H 0: The population mean weight loss of the three diet
plans are ALL the same.
Ha: At least one of the diet plans has a different mean
weight loss.
2 Test statistic: Fc 0.3769
3 Critical Value: At 5% level of significance, F2,7 4.7375
4 Fc F2,7 Do not reject H0
5 Conclusion: Do not reject H0 at a 5% level of significance.
There is insufficient evidence that at least one of the diet
plans has a different mean weight loss.
16
ANOVA
Example:
As part of an experiment to see how different types of soil cover
would affect slicing tomato production, Douglas College students
grew tomato plants under different soil cover conditions. Groups of
three plants each had one of the 5 treatments (i.e. a total of 15 plants).
All plants grew under the same conditions and were the same variety.
Students recorded the weight (in grams) of tomatoes produced by
each of the plants and the results are summarized in an ANOVA table:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 36,648,561
Error (Within Treatments)
Total 57,095,287
At the 0.05 level of significance, conduct a hypothesis test to
determine if all treatment means are the same.
17
ANOVA
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 36,648,561 4 9,162,140.25 4.481
Error (Within Treatments) 20,446,726 10 2,044,672.6
Total 57,095,287 14
H0: The population mean of all 5 treatments are the
1
same.
Ha: At least one treatment has a different mean.
2 Test statistic: Fc 4.481
3 Critical Value: At 5% level of significance, F4,10 3.478
4 Fc F4,10 Reject H0
5 Conclusion: Reject H0 at a 5% level of significance.
There is sufficient evidence that at least one treatment
has a different mean.
18
ANOVA
Example:
In a completely randomized experimental design, 7 experimental
units were used for each of the 4 levels of the factor:
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments
Error (Within Treatments) 24,000
Total 38,301
Complete the ANOVA table and test the hypothesis that the
population treatment means are all the same, at 0.05 .
19
ANOVA
Source of Variation Sum of Squares Degrees of Freedom Mean Square F
Between Treatments 14,301 3 4,767 4.767
Error (Within Treatments) 24,000 24 1,000
Total 38,301 27
1 H 0: The population mean of all 4 factors are the same.
Ha: At least one factor has a different mean.
2 Test statistic: Fc 4.767
3 Critical Value: At 0.05 , F3,24 F3,20 3.0983
4 Fc F3,20 Reject H0
5 Conclusion: Reject H0 at a 5% level of significance.
There is sufficient evidence that at least one factor has a
different mean.
20
BADM 221 Statistics for Business
Unit 11
Linear Regression
21
Linear Regression
Regression is a statistical technique that uses the idea
that one variable may be related to one or more variables
through an equation.
Here we consider the relationship of two variables only in a
straight line relationship, which is called simple linear
regression.
22
Linear Regression
Simple linear regression uses the relationship between the
two variables to obtain information about one variable by
knowing the values of the other.
The equation showing this type of relationship is called
linear regression equation.
23
Linear Regression
Linear equation: y mx b
slope y-intercept
y 2x 1 Slope = 2
Y-intercept = –1
y-intercept
24
Linear Regression
We want to use X to predict (or estimate) the value of Y that
might be obtained without actually measuring it, provided
the relationship between the two can be expressed by a line.
“ X ” is usually called the independent variable and “ Y ” is
called the dependent variable.
Statistics
Score
Mathematics Score
25
Linear Regression
Example: The exam scores of a class of 9 students in
Mathematics ( X ) and in Statistics ( Y ) are shown
below:
Math Score (X) 80 58 92 60 75 63 93 76 78
Stat Score (Y) 78 64 96 62 78 65 90 61 82
Statistics
Score
Mathematics Score
26
Linear Regression
We want to determine the equation of the regression line
that best-fits the data.
Statistics Statistics
Score Score
Mathematics Score Mathematics Score
Statistics Statistics
Score Score
Mathematics Score Mathematics Score
27
Linear Regression
Equation of the regression line:
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306
Coefficients Standard Error t Stat p-value
Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001
Statistics
Score
Mathematics Score
28
Linear Regression
Equation of the regression line:
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306
Coefficients Standard Error t Stat p-value
Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001
Y 9.450 0.872 X
Statistics
Score
Stat Score 9.450 0.872 Math Score
Mathematics Score
29
Linear Regression
We can then make prediction using the regression
equation:
Stat Score 9.450 0.872 Math Score
For example:
Score in Math Estimated score in Stat
61 9.450 +
0.872 61 = 62.42
73 9.450 +
0.872 73 = 73.11
91 9.450 +
30
Linear Regression
Is the regression relationship significant?
Null and Alternative Hypothesis
H0: There is no relationship between X and Y
(The regression relationship is NOT
significant.)
Ha: There is a linear relationship between X and Y
(The regression relationship is
significant.)
31
Linear Regression
Is the regression relationship significant?
Use the p-value approach
Reject H0 if (p-value level of significance)
The regression relationship is significant.
Do not reject H0 if (p-value > level of significance)
The regression relationship is NOT
significant.
32
Linear Regression
Is the regression relationship significant?
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306
Coefficients Standard Error t Stat p-value
Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001
Which p-value?
33
Linear Regression
Is the regression relationship significant?
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306
Coefficients Standard Error t Stat p-value
Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001
As an illustration: Take level of significance = 5%
The p-value for Math Score is 0.001 < the level of significance
Reject H0 The regression relationship is significant.
34
Linear Regression
How good is the regression equation?
Coefficient of Determination, R2
SS (Regression)
R
2
(decimal percent)
SS (Total)
Interpreted as the percentage of the observed variation in Y
that can be explained by the variation in X.
35
Linear Regression
df SS
Regression 1 1004.483
Residual 7 301.517
Total 8 1306
Coefficients Standard Error t Stat p-value
Intercept 9.450 13.74 0.687 0.513
Math Score 0.872 0.1807 4.829 0.001
1004.483
R
2
0.7691 76.91%
1306
76.91% of the variability of the Statistics score can be explained
by the linear relationship with the Mathematics score.
36
Linear Regression
Example:
A teacher wishes to investigate if there is any relationship
between a student’s exam score in Mathematics (X) and the
exam score in Accounting (Y). A sample of 11 students is
randomly selected and the results are summarized in the
ANOVA table below:
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
37
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
What is the estimated regression equation that relates the exam score in
accounting (Y) to the score in mathematics (X)?
What is the estimated exam score in accounting if a student got a score of
80 in mathematics?
38
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
What is the estimated regression equation that relates the exam score in
accounting (Y) to the score in mathematics (X)?
Y 24.13 0.759 X
Acc.Score 24.13 0.759 (MathScore)
What is the estimated exam score in accounting if a student got a score of
80 in mathematics?
24.13 0.759 80 84.85
39
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
Is the regression relationship significant? Use the p-value approach and 2%
level of significance.
40
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
Is the regression relationship significant? Use the p-value approach and 2%
level of significance.
The p-value for MathScore is 0.001 < the level of significance
Reject H0 The regression relationship is significant.
41
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
Compute the coefficient of determination between the exam score in
accounting and the exam score in mathematics. Interpret the result in the
context of the problem.
42
Linear Regression
df SS
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
Compute the coefficient of determination between the exam score in
accounting and the exam score in mathematics. Interpret the result in the
context of the problem.
1305.68
R
2
0.9409 94.09%
1387.64
94.09% of the variability of the exam score in
accounting can be explained by the linear
relationship with the exam score in mathematics.
43
Linear Regression
Coefficient of determination
1305.68
R2
1387.64
df SS 0.9409 94.09%
Regression 1 1305.68
Residual 9 81.96
Total 10 1387.64
Coefficients Standard Error t Stat p-value
Intercept 24.13 4.657 5.182 0.005
MathScore 0.759 0.063 11.974 0.001
Acc.Score 24.13 0.759 (MathScore) Significance of regression relationship?
Estimated regression equation p-value the level of significance
The regression relationship is significant.
p-value > the level of significance
The regression relationship is NOT significant.
44
Linear Regression
Example:
The accountant at Walmart wants to determine the
relationship between customer purchases at the store, Y ($),
and the customer monthly salary, X ($). A sample of 15
customers is randomly selected and the results are
summarized in the ANOVA table below:
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
45
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
What is the estimated regression equation that relates the amount of
customer’s purchase (Y) to the customer’s monthly salary (X)?
46
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
What is the estimated regression equation that relates the amount of
customer’s purchase (Y) to the customer’s monthly salary (X)?
Y 78.56 0.066 X
Amt.Purchase 78.58 0.066 (Salary)
47
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
Is the regression relationship significant? Use the p-value approach and 1%
level of significance.
48
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
Is the regression relationship significant? Use the p-value approach and 1%
level of significance.
The p-value for Salary is 0.003 < the level of significance
Reject H0 The regression relationship is significant.
49
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
Compute the coefficient of determination between the amount purchase and
the customer’s monthly salary. Interpret the result in the context of the
problem.
50
Linear Regression
df SS
Regression 1 186952
Residual 13 99236
Total 14 286188
Coefficients Standard Error t Stat p-value
Intercept 78.58 7.540 1.202 0.035
Salary 0.066 0.013 4.948 0.003
Compute the coefficient of determination between the amount purchase and
the customer’s monthly salary. Interpret the result in the context of the
problem.
186952
R
2
0.6532 65.32%
286188
65.32% of the variability of the amount
purchased can be explained by the linear
relationship with the customer’s monthly salary.
51