MDPN460 Lecture04
MDPN460 Lecture04
Engineering Lab
Lecture 4
2 / 41
ANalysis Of VAriance (ANOVA)
●
ANOVA (ANalysis Of VAriance) is a statistical
method for determining the existence of
differences among several population means.
m1 m2 m3
Population 1 Population 2 Population 3
3 / 41
ANOVA
●
We have r independent random samples, each one corresponding
to a population subject to a different treatment.
●
We have:
– n = k1+ k2+ k3+ ...+kr total observations.
●
We assume independent random sampling from each of the r
populations
●
We assume that the r populations under study:
– are normally distributed,
– with means mi that may or may not be equal,
– but with equal variances, si2.
m1 m2 m3
Population 1 Population 2 Population 3
5 / 41
The Hypothesis Test of ANOVA
Thehypothesis
The hypothesistesttestofofanalysis
analysisof ofvariance:
variance:
HH00::mm11==mm22==mm33==mm44==......mmr r
HH11::Not allmmi i(i(i==1,
Notall 1,...,
...,r)r)are
areequal
equal
Thetest
The teststatistic
statisticof
ofanalysis
analysisof
ofvariance:
variance:
Estimateofofvariance
variancebased
basedon
onmeans
meansfrom
fromr rsamples
samples
FF(r-1, =
n-r) =
(r-1,n-r)
Estimate
Estimateofofvariance
Estimate variancebased
basedon
onall
allsample
sampleobservations
observations
Thatis,
That is,the
thetest
teststatistic
statisticininanananalysis
analysisofofvariance
varianceisisbased
basedon
onthe
theratio
ratioofoftwo
twoestimators
estimators
ofaapopulation
of populationvariance,
variance,and andisistherefore
thereforebased
basedononthe
theFFdistribution,
distribution,with
with(r-1)
(r-1)degrees
degreesofof
freedomininthe
freedom thenumerator denotednn1and
numeratordenoted and(n-r)
(n-r)degrees
degreesofoffreedom
freedomininthethedenominator,
denominator,
1
denotednn2. .
denoted 2
F-Distribution
Let s12 and s 22 represent the sample variances of two different
populations. If both populations are normal and the population
variances σ12 and σ 22 are equal, then the sampling distribution of
2
s 1
F= 2
s 2
is called an F-distribution.
There are several properties of this distribution.
n1 = 1 and n2 = 8
n1 = 8 and n2 = 26
n1 = 16 and n2 = 7
n1 = 3 and n2 = 11
F
1 2 3 4
Critical Values for the F-Distribution
n1 = 5 and n2 = 28.
F-Distribution
Denominator 𝛼 = 0.05
degrees of Numerator degrees of freedom (n1)
freedom (n2) 1 2 3 4 5 6
1 161.4 199.5 215.7 224.6 230.2 234.0
2 18.51 19.00 19.16 19.25 19.30 19.33
F-Distribution
Denominator 𝛼 = 0.05
degrees of Numerator degrees of freedom (n1)
freedom (n2) 1 2 3 4 5 6
1 161.4 199.5 215.7 224.6 230.2 234.0
2 18.51 19.00 19.16 19.25 19.30 19.33
3 10.13 9.55 9.28 9.12 9.01 8.94
4 7.71 6.94 6.59 6.39 6.26 6.16
5 6.61 5.79 5.41 5.19 5.05 4.95
6 5.99 5.14 4.76 4.53 4.39 4.28
7 5.59 4.74 4.35 4.12 3.97 3.87
Whenthe
When thenull
nullhypothesis
hypothesisisistrue:
true:
H0: m = m = m
WeWewould
wouldexpect
expectthe
thesample
samplemeans
meanstotobe
benearly
nearlyequal,
equal,
asasininthis
thisillustration.
illustration. And
Andwewewould
wouldexpect
expectthe
the
variationamong
variation amongthe thesample
samplemeans
means(between
(betweensample)
sample)
x
totobe
besmall,
small,relative
relativetotothe
thevariation
variationfound
foundaround
aroundthe
the
individualsample
individual samplemeans
means(within
(withinsample).
sample).
IfIfthe
thenull
nullhypothesis
hypothesisisistrue,
true, the
thenumerator
numeratorininthethetest
test
statisticisisexpected
statistic expectedtotobe
besmall,
small,relative
relativetotothe
the
x denominator:
denominator:
x
When the Null Hypothesis Is False
Inany
In anyofofthese
thesesituations,
situations,we
wewould
wouldnot
notexpect
expectthe
thesample
samplemeans
meanstotoall
allbe
benearly
nearlyequal.
equal.
Wewould
We wouldexpect
expectthe
thevariation
variationamong
amongthethesample
samplemeans
means(between
(betweensample)
sample)totobe
belarge,
large,
relativetotothe
relative thevariation
variationaround
aroundthe
theindividual
individualsample
samplemeans
means(within
(withinsample).
sample).
IfIfthe
thenull
nullhypothesis
hypothesisisisfalse,
false, the
thenumerator
numeratorininthe
thetest
teststatistic
statisticisisexpected
expectedtotobe
belarge,
large,
relativetotothe
relative thedenominator:
denominator:
Note:
1. The type of applicator is a treatment.
2. The data values from repeated samplings are called replicates.
Sample Results:
(Treatment)
Applicator (Level)
Brush Roller Pad
(i = 1 ) (i = 2) (i = 3)
39.1 31.6 32.7
39.4 33.4 33.2
31.1 30.2 28.7
33.7 41.8 29.2
30.5 33.9 25.8
34.6 31.4
26.7
29.5
Sum C 1= 208.4 C 2 = 170.9 C 3 = 237.2
Mean x¯1= 34.73 x¯2 = 34.18 x¯3 = 29.65
x̄=32.45
Note:
1. The drying time is measured by the mean value.
x i is the mean drying time for treatment i, i = 1, 2, 3.
2. There is a certain amount of variation among the means.
3. Some variation can be expected, even if all three population
means are equal.
4. Consider the question: “Is the variation among the sample means
due to chance, or it is due to the effect of applicator on drying
time?”
Solution:
1. The Set-up:
a. Population parameter of concern: The mean at each
treatment of the test factor. Here, the mean drying time for
each applicator.
b. The null and the alternative hypothesis:
H 0: m 1 = m 2 = m 3
The mean drying time is the same for each applicator.
Ha: mi ≠ mj for some i ≠ j
Not all drying time means are equal.
2. The Test Criteria:
a. Assumptions: The data was randomly collected and all
observations are independent. The effects due to chance and
untested factors are assumed to be normally distributed.
b. Test statistic: F test statistic (see below).
c. Level of significance: a = 0.05
3. The Sample Evidence:
a. Sample information: Data listed in the given table.
b. Calculate the value of the test statistic:
The F statistic is a ratio of two variances.
Separate the variance in the entire data set into two parts.
Partition the Total Sum of Squares:
Consider the numerator of the fraction used to define the sample
variance:
SS(Total)
2
s=
∑ (x− x̄)
2
n−1
The numerator of this fraction is called the sum of squares, or total
sum of squares.
Notation:
Ci = total for column i
ki = number of observations for treatment i
n = ∑ k i = total number of observations
SS(total)=∑ ( x− x̄)2
2
SS(total)=∑ ( x−
∑ ( x)
)
n
2
x ∑ ( x) ( ∑ x)
SS(total)=∑ ( x −2
2
+ 2
)
n n
2
SS(total)=∑ ( x )−2
2 ∑ ( x) ∑ (x) ∑ ((∑ x) )
+ 2
n n
2 2
(∑ x) n(∑ x)
SS(total)=∑ ( x )−2
2
+
n n2
2
(∑ x)
SS(total)=∑ ( x )−
2
n
2
(∑ x)
SS (total)=∑ ( x )−
2
n
= total variation in data
( )
2
( ∑ x)
2 2 2
C C C
1 2 3
SS (factor )= + + +... −
k1 k 2 k 3 n
= variation between treatments
( )
2 2 2
C C C
SS (error )=∑ ( x )−
2 1 2 3
+ + +...
k1 k2 k3
=SS (total)−SS (factor )
= variation within rows
Calculations:
2
(∑ x) (616.5) 2
SS (total)=∑ ( x )−
2
=20316.69−
n 19
= 20316.69 – 20003.80 = 312.89
( )
2
( ∑ x)
2 2 2
C C C 1 2 3
SS (factor )= + + +... −
k1 k 2 k 3 n
( )
2 2 2 2
208.4 170.9 237.2 (616.5)
= + + −
6 5 8 19
= 20112.77 – 20003.8 = 108.97
Source df SS MS
Factor 108.97
Error 203.92
Total 312.89
Degrees of freedom, df, associated with each of the three sources
of variation:
1. df(factor): one less than the number of treatments (columns), c,
for which the factor is tested.
df(factor) = c - 1
2. df(total): one less than the total number of observations, n.
df(total) = n - 1
n = k1 + k2 + k3 + ...
3. df(error): sum of the degrees of freedom for all levels
tested. Each column has ki - 1 degrees of freedom.
df(error) = (k1 - 1) + (k2 - 1) + (k3 - 1) + ...
=n-c
Calculations:
df(factor) = df(applicator) = c - 1 = 3 - 1 = 2
df(total) = n - 1 = 19 - 1 = 18
df(error) = n - c = 19 - 3 = 16
Note:
The sums of squares and the degrees of freedom must check.
SS (factor) SS(error)
MS (factor )= MS (error)=
df (factor ) df (error)
Calculations:
SS (factor) 108.97
MS (factor )= = =54.49
df (factor ) 2
SS(error) 203.92
MS (error)= = =12.75
df (error) 16
The Complete ANOVA Table:
Source df SS MS
Factor 2 108.97 54.59
Error 16 203.92 12.75
Total 18 312.89
Level 3
Level 2
Level 1
20 25 30 35 40
Time
Solution:
1. The box-and-whisker plots show the relationship among
the three samples.
2. The plots suggest the three sample means are different
from each other.
3. This suggests the population means are different.
4. There is relatively little within-sample variation, but a
relatively large amount of between-sample variation.
Example: Do the box-and-whisker plots below show
sufficient evidence to indicate a difference in the three
population means?
Level 4
Level 3
Level 2
Level 1
Factor Levels
Sample from Sample from Sample from Sample from
Replication Level 1 Level 2 Level 3 Level C
k =1 x 1,1 x 2,1 x 3,1 x c ,1
k =2 x 1,2 x 2,2 x 3,2 x c ,2
k =3 x 1,3 x 2,3 x 3,3 x c ,3
Column C1 C2 C3 Cc T
Totals T = grand total = sum of all x 's =
∑x =
∑C i
Mathematical Model for Single-Factor ANOVA:
x c , k =μ + F c + εk (c)
1. m: mean value for all the data without respect to the test
factor.
2. Fc: effect of factor (level) c on the response variable.
3. ek(c): experiment error that occurs among the k replicates in
each of the c columns.
Example: A study was conducted to determine the effectiveness of
various drugs on post-operative pain. The purpose of the
experiment was to decide if there is any difference in length of pain
relief due to drug. Eighty patients with similar operations were
selected at random and split into four groups. Each patient was
given one of four drugs and checked regularly. The length of pain
relief (in hours) was recorded for each patient. At the 0.05 level of
significance, is there any evidence to reject the claim that the four
drugs are equally effective?
Note:
1. The data is omitted here.
2. The ANOVA table is given in a later slide.
Solution:
1. The Set-up:
a. Population parameter of interest: The mean time of pain
relief for each factor (drug).
b. The null and alternative hypothesis:
H0: m1 = m2 = m3 = m4
Ha: the means are not all equal.
2. The Hypothesis Test Criteria:
a. Assumptions: The patients were randomly assigned to
drug and their times are independent of each other. The
effects due to chance and untested factors are assumed
to be normally distributed.
b. Test statistic: F* with df(numerator) = df(factor) = 3 and
df(denominator) = df(error) = 80 - 4 = 76
c. Level of significance: a = 0.05
3. The Sample Evidence:
a. Sample information: The ANOVA table:
Source df SS MS
Factor 3 70.84 23.61
Error 76 226.05 2.97
Total 79 296.89
MS (factor) 23.61
F* = = =7.95
MS(error) 2.97
4. The Probability Distribution (Classical Approach):
a. Critical value: F(3, 76, 0.05) @ 2.72
b. F* is in the rejection region.
5. The Probability Distribution (p-Value Approach):
a. The p-value:
P = P(F* > 7.95, with dfn = 3, dfd = 76) < 0.01
By computer: P @ .0001
b. The p-value is smaller than the level of significance, a.
6. The Results:
a. Decision: Reject H0.
b. Conclusion: There is evidence to suggest that not all
drugs have the same effect on length of pain relief.