SBST 3203 Module
SBST 3203 Module
www.oum.edu.my
Answers 247
INTRODUCTION
SBST3203 Elementary Data Analysis is one of the courses offered at Open
University Malaysia (OUM). This course is worth 3 credit hours and should be
covered over 8 to 12 weeks.
COURSE AUDIENCE
This course is offered to all learners taking the Bachelor of Mathematics and
Management (Honours) programme.
STUDY SCHEDULE
It is a standard OUM practice that learners accumulate 40 study hours for every
credit hour. As such, for a three-credit hour course, you are expected to spend
120 study hours. Table 1 gives an estimation of how the 120 study hours could be
accumulated.
Study
Study Activities
Hours
Briefly go through the course content and participate in an initial
4
discussions
Study the module 64
Attend tutorial sessions 6–8
Online participation 15
Revision 16
Assignment(s) 15
TOTAL STUDY HOURS ACCUMULATED 120
COURSE SYNOPSIS
This course is divided into 10 topics. The synopsis for each topic is listed as follows:
Topic 1 introduces you to the concept of mean comparison, null and alternative
hypotheses as well as type of test for two population means. In addition, it will
cover rejection region (RR), significance level and p-value for two population
means as well as mean comparison for unspecified Δ0.
Topic 4 teaches you how to use ANOVA in experiment as well as within and
between group variation. Then, one-way ANOVA and model for a single-factor test
will be demonstrated too.
Topic 5 covers chi-squared tests. In this topic, you will learn about goodness-of-fit
and contingency table tests as well as the expected frequencies for goodness-of-fit
and contingency table tests. Later, the degrees of freedom for all tests performed
and the procedures to obtain statistical decision for goodness-of-fit and contingency
table tests will be presented.
Topic 6 explains correlation concepts and the relationship between two variables,
followed by two-way scatter plot, Pearson correlation coefficient and Spearman
rank correlation coefficient.
Topic 7 teaches you about simple linear regression analysis. The contents of this
topic includes a discussion on regression concepts, simple linear regression model,
least squares estimate method, inferential concepts and methods to evaluate data
suitability in fitting a regression model.
Topic 8 concentrates on multiple regression. The discussion in this topic covers the
multiple regression concept, multiple regression model and assumptions made,
methods to evaluate the suitability of data on regression model and how to use the
regression equation for the prediction and estimation of parameter values.
Learning Outcomes: This section refers to what you should achieve after you have
completely covered a topic. As you go through each topic, you should frequently
refer to these learning outcomes. By doing this, you can continuously gauge your
understanding of the topic.
Summary: You will find this component at the end of each topic. This component
helps you to recap the whole topic. By going through the summary, you should be
able to gauge your knowledge retention level. Should you find points in the
summary that you do not fully understand, it would be a good idea for you to revisit
the details in the module.
Key Terms: This component can be found at the end of each topic. You should go
through this component to remind yourself of important terms or jargon used
throughout the module. Should you find terms here that you are not able to explain,
you should look for the terms in the module.
References: The References section is where a list of relevant and useful textbooks,
journals, articles, electronic contents or sources can be found. The list can appear
in a few locations such as in the Course Guide (at the References section), at the
end of every topic or at the back of the module. You are encouraged to read or refer
to the suggested sources to obtain the additional information needed and to enhance
your overall understanding of the course.
PRIOR KNOWLEDGE
Learners are required to pass SBST1203 Introductory Statistics.
ASSESSMENT METHOD
Please refer to myINSPIRE.
REFERENCES
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
INTRODUCTION
In data analysis, the comparison of two population means is very common and
provides a way to test the hypothesis that the two groups differ from each other. In
this topic, learners will be introduced to the concept of mean comparison, null and
alternative hypotheses, type of test and decision rule.
(a) The age of OUM first-year students in the 2010/2011 session as population 1;
and
(b) The age of UKM first-year students in the 2010/2011 session as population 2.
In this situation, the population mean is the appropriate parameter to be used in the
comparison statements of “the mean of population 1 is greater than the mean for
population 2.”
Example 1.1:
Population Statement:
The measurement level of blood pressure (in systolic unit) before treatment
compared to the measurement level of blood pressure (in systolic unit) after
treatment.
Population Selection:
The selection of population is arbitrary. Nevertheless, in this example:
(a) The level of blood pressure before treatment is chosen as population 1; and
(b) The level of blood pressure after treatment is chosen as population 2.
Random Sample:
If a random sample of size n1 is taken from population 1, we obtain:
Comparable Parameters:
(a) Population variance 1 ( σ12 ) and population variance 2 ( σ 22 ) are the same; and
ACTIVITY 1.1
Population Statement:
Mathematics test scores for a group of students taught using a shortcut
method compared with those of another group of students who are not
taught using the shortcut method.
Table 1.3 presents various hypothesis formulations for various relationships of two
population means. The determination of test type is based on the formulation of
alternative hypothesis, H1 as shown in the same table (see Table 1.3).
μ 2 < μ1 H 0 : μ 2 ≥ μ1 ; 1 left-tailed
H1 : μ 2 < μ1
μ 2 = μ1 H 0 : μ 2 = μ1 ; 2-tailed
H1 : μ 2 ≠ μ1
μ 2 ≥ μ1 H 0 : μ 2 ≥ μ1 ; 1 left-tailed
H1 : μ 2 < μ1
μ 2 ≤ μ1 H 0 : μ 2 ≤ μ1 ; 1 right-tailed
H1 : μ 2 > μ1
μ 2 ≠ μ1 H 0 : μ 2 = μ1 ; 2-tailed
H1 : μ 2 ≠ μ1
2. H1 : μ 2 < μ1 H1 : μ 2 − μ1 < 0
H 0 : μ 2 = μ1 or H 0 : μ 2 ≥ μ1 H 0 : μ 2 − μ1 = 0 or H 0 : μ 2 − μ1 ≥ 0
3. H 0 : μ 2 = μ1 H 0 : μ 2 − μ1 = 0
H1 : μ 2 ≠ μ1 H1 : μ 2 − μ1 ≠ 0
Thus, the hypothesis testing is done onto H 0 with the implication of rejecting H 0
at a specified significance level, which also means accepting H1 . As such, the
formulations of H1 and H 0 as indicated in Table 1.3, and the adjustments as given
in Table 1.4 must be made with caution so that the decision made based on samples
conforms to (as if supporting) H1 .
The following are three categories of statements for testing population mean
parameter (see Table 1.5).
Table 1.5: Three Categories of Statements for Testing Population Mean Parameter
U2: Statement H 0 is as stated in the The statement “the mean lifetime for an
involving statement and H1 is electric bulb of Population 2 is at least equal
inequality in to the mean lifetime for an electric bulb of
the opposite.
value, sign is Population 1.”
“≤” or “≥” H 0 : μ 2 ≥ μ1 ; H 1 : μ 2 < μ1 ⇔
H 0 : μ 2 − μ1 ≥ 0 ; H1 : μ 2 − μ1 < 0
U3: Statement H1 is as stated in the The statement “the mean lifetime for an
involving strict statement and H 0 is electric bulb of Population 2 exceeds the mean
inequality in lifetime for an electric bulb of Population 1”
the opposite.
value, sign is (write this statement as H1 , and its
“<”or “>” complement as the H 0 ).
H1 : μ2 > μ1 ; H 0 : μ2 ≤ μ1 : ⇔
H1 : μ2 − μ1 > 0 ; H 0 : μ2 − μ1 ≤ 0
ACTIVITY 1.2
“The statement categories are applicable for the population proportion p”.
What do you understand from this statement? Discuss this statement in
the myINSPIRE forum.
EXERCISE 1.1
State two population means observations that are comparable for the case
Δ 0 = 0. Select one comparison in Table 1.4.
(b) A sample statistics used in the testing which is referred to as test statistics.
The test statistics is chosen based on the population parameter discussed in the
problem. This statistics is actually the point estimator for the population parameter
being discussed. For example, in the testing of population means difference as
stated in the null hypothesis:
H 0 : μ2 − μ1 = 0 : μ 2 − μ 1 = 0
Thus, the test statistics used is the difference between the sample means obtained
from the respective populations, that is:
Statistics B = X 2 − X 1
Hypothesis testing on a single parameter value usually ends with a decision to either
accept or reject the null hypothesis H 0 . To make the decision, one can follow a
method or hypothesis rejection rules as specified by the researcher. This rule is
named the decision rule.
B = X 2 − X1
Both subsets are usually separated by a number k whose value may be given/known
or not but can be calculated when the value of a is known by using the distribution
of test statistics.
SELF-CHECK 1.1
The null hypothesis for testing population means difference as in Table 1.4 for
various comparison cases are either:
(a) H 0 : μ2 − μ1 ≤ 0; or
(b) H 0 : μ2 − μ1 = 0; or
(c) H 0 : μ2 − μ1 ≥ 0.
(a) X1 is a mean for a random sample of size n1 (≥ 30) from population 1 which
follows normal distribution or approximately normal, N μ1 , σ 12 n1 ; and( )
(b) X 2 is a mean for a random sample of size n2 (≥ 30) from population 2 which
follows normal distribution or approximately normal N μ2 , σ 22 n2 . ( )
Thus, the statistics B = X 2 − X 1 is normally distributed or approximately normal
( )
N μ B , σ B2 where μB = μ2 − μ1 , and σ B2 is expressed in the terms of σ12 and σ 22
depending on the following situations (see Table 1.6).
Situation Description
1
(a) Population variances σ12 and σ 22 are known and σ 12 ≠ σ 22 .
(b) Both populations are normally distributed.
(c) No requirements on sample size.
(d) Random sample 1 and random sample 2 are independent.
σ 22 σ 12
Variance σ B2 = +
n2 n1
X 2 − X1
Z= Formula 1.1
σ 22 σ 12
+
n2 n1
σ 22 σ 12
Variance σ B2 = +
n2 n1
X 2 − X1
Z= Formula 1.2a
σ 22 σ 12
+
n2 n1
S22 S12
σˆ B2 = +
n2 n1
X 2 − X1
Z= Formula 1.2b
S 22 S12
+
n2 n1
(c) Both sample sizes are small ( n1 < 30, n2 < 30).
v = n1 + n2 – 2 degrees of freedom
S B2 =
( n1 − 1) S12 + ( n2 − 1) S22
n1 + n2 − 2
X 2 − X1
T= Formula 1.3a
1 1
SB +
n2 n1
(c) Both sample sizes are small ( n1 < 30, n2 < 30).
S12 S22
σˆ B2 = +
n1 n2
X 2 − X1
T= Formula 1.3b
S 22 S12
+
n2 n1
D1 , D2 , ..., Dn
where:
Di = X i − Yi , i = 1, 2, ..., n
is the difference of the matched observations of the ith individual (the same
topic).
D
T= Formula 1.5a
SD
n
Mean difference D =
Di
n
( Di )
2
Di2 −
Standard deviation S D = n Formula 1.5b
n −1
Take note that in the case of a non-normal population, you may use a non-
parametric technique. Please refer to other advanced data analysis books. For a
detailed explanation on the central limit theorem, you may refer to the following
webpage:
http://www.pinkmonkey.com/studyguides/subjects/stats/chap8/s0808601.asp
SELF-CHECK 1.2
Since the formulations of H 0 and H1 are important but difficult to implement, thus
we focus on problems where both hypotheses are not given. For testing problems
where H 0 and H1 are given, you can straight away use the testing procedures. Let
us look at an example.
Example 1.2:
A researcher is assigned to conduct a study on the beginning salary for executives
in a private company which consists of two groups:
(a) Determine whether there is a difference in the starting salary for groups A
and B; and
(b) Determine whether the starting salary for group A is less than the starting
salary for group B.
Two independent random samples have been chosen from their respective
population groups:
Answer:
For both objectives, there is no statement on the amount/quantity for population
mean difference. For the given objectives, the population parameter under study
is the population mean for the respective groups.
H1 : μ B ≠ μ A ; and
H 0 : μ B = μ A (the complementary)
H 0 : μB − μ A = 0; H1 : μB − μ A ≠ 0; .
X 2 − X1
Z= Formula 1.6
S22 S12
+
n2 n1
XB − XA 1, 200 − 1,167
ZB = = =1.4496 = 1.45
S22 S12 1002 1502
+ +
n2 n2 64 60
(b) To Determine if the Starting Salary for Group A is Less Than the
Starting Salary for Group B
“Starting salary for A is less than the starting salary for group B”
⇔ Statement μB > μ A , thus, we write:
(b) H 0 : μB ≤ μ A (complementary)
Hypothesis adjustments:
(b) H 0 : μB ≤ μ A ⇔ H 0 : μB − μ A ≤ 0
Step 5: Even though the population is not assumed normal and the
variance population is unknown, both samples are large. Thus, by
the central limit theorem, the sampling mean can be assumed to
be approximately normal. The test statistics B = X 2 − X 1 is a
standard Z score where:
X 2 − X1
Z=
S22 S12
+
n2 n1
XB − XA 1, 200 − 1,167
ZB = = = 1.4496 ≈ 1.45
S22 S12 1502 1002
+ +
n2 n1 64 36
Test decision: Since the value for zB = 1.45 < 1.645, thus we do
not reject H 0
EXERCISE 1.2
• After identifying the form of comparison for the population parameter, the
subsequent step is to write or formulate the appropriate null hypothesis, H 0
and the alternative hypothesis, H1 .
• The test statistics used is the difference between the sample means obtained
from the respective populations.
• The comparison between the two population means include directive and non-
directive changes.
Central limit theorem – The theorem from which it is inferred that for a large
sample size ( n ≥ 30 ) , the shape of the sampling
distribution of x is approximately normal. Also, by the
same theorem, the shape of the sampling distribution of
p is approximately normal for a sample for which
np > 5 and nq > 5.
1. State the appropriate null and alternative hypotheses to test the following
claims:
(b) The mean of population A is less than 50 than the mean of population B.
(c) The average length of the work week in mining is longer than that in
manufacturing.
2. Determine the p-value for the following hypothesis tests of the difference
between two means given that the population variances are unknown.
3. Determine the critical values that would be used for the following hypothesis
tests about the difference between two means given that the population
variances are unknown.
4. Determine the test statistics that would be used in the following hypothesis
testing.
5. Sketch the rejection region and mark the critical value on the sketch for the
following hypothesis testing on difference mean where population variances
are assumed to be equal.
Standard
Machine Sample Mean
Deviation
A 10 5.38 1.59
B 12 5.92 0.83
(a) State the null and alternative hypotheses to conduct a hypothesis testing
regarding the students’ complaints.
(b) State the assumptions that should be made to conduct the test.
(c) At α = 0.05 level, what can you conclude regarding the complaints?
Does the data show that the mean score for those with computer experience is
significantly less than the mean score for those without computer experience?
Use α = 0.05.
INTRODUCTION
In Topic 1, we have discussed the comparison of parameter mean for two
populations. Now in this topic, we will discuss the comparison between two
population proportions. Happy reading!
Example 2.1:
Population 1 comprises employees from company A who had a heart attack.
They consist of those who smoke and those who do not smoke. Smoking is an
attribute in the population. The population proportion parameter of subjects who
smoke can be denoted as π1 (sample 1).
The population proportion parameters, π1 and π 2 , are usually unknown and must
be estimated from the sample proportion.
X1
P1 =
n1
X2
P2 =
n2
Both samples are assumed to be independent from each other. The comparison
of population proportion is based on the following terms:
SELF-CHECK 2.1
ACTIVITY 2.1
Let us look at Table 2.2 which displays the formulation of hypothesis for various
comparisons between two population proportions.
Mathematical H 0 and H 1
Form of Comparison Values
Expressions
1. Greater than, > π 2 > π1 H1 : π 2 > π 1 ; H 0 : π 2 = π 1
2. Less than, < π 2 < π1 H1 : π 2 < π 1 ; H 0 : π 2 = π 1
3. Equals to, = π 2 = π1 H1 : π 2 ≠ π 1 ; H 0 : π 2 = π 1
4. Greater than or equal, ≥ π 2 ≥ π1 H1 : π 2 < π 1 ; H 0 : π 2 ≥ π 1
5. Less than or equal, ≤ π 2 ≤ π1 H1 : π 2 > π 1 ; H 0 : π 2 ≤ π 1
6. Not equal, ≠ π 2 ≠ π1 H1 : π 2 ≠ π 1 ; H 0 : π 2 = π 1
Note that the expressions of H 0 usually contain complete equality and incomplete
inequality.
The original objective is to either accept or reject H1. However, the measurement
in hypothesis testing is rather difficult for this hypothesis since the value is stated
in a specified interval.
Let us look at Table 2.3 which shows the three categories of statements for the
testing of population proportion parameter.
Note that every H 0 statement contains the equality sign “=”. Let us look at another
example.
Example 2.2:
A new technique in teaching statistics has been designed by a professor at a local
university. This technique was tested on 150 students at a private college with
60% of the students admitting that the technique is very effective in learning
statistics. For comparison, another group of 120 students in the same private
college is taught using the traditional technique and 40% of them understand
statistics. Write appropriate null and alternative hypotheses to test the
effectiveness of the new technique.
Answer:
From the problem, we observed that:
(b) The rates of 60% and 40% indicate that the population proportion is the
parameter being compared; and
This situation fits the notion that the new technique is effective. Assume that:
(a) The group of 150 students learning statistics using the new method is a
random sample from population 2.
(b) The other 120 students are a random sample from population 1.
From the previous observations and selection of hypothesis statement (U3), the
following is the appropriate hypothesis:
H1 : π 2 > π1 H :π −π > 0
⇔ 1 2 1
H 0 : π 2 ≤ π1 H 0 : π 2 ≤ π1 ≤ 0
SELF-CHECK 2.2
ACTIVITY 2.2
Name two binomial populations which have the same attributes. Also
state the attributes. Then, compare the proportion of the attributes by
choosing one of the comparisons and mathematical expressions shown in
Table 2.3. Share your answer for discussion in the myINSPIRE forum.
H 0 : π 2 − π1 = 0
Thus, the test statistics used is the difference between the sample proportions
obtained from the respective populations, which is:
Statistics B = P2 − P1
ACTIVITY 2.3
1 1
σ B2 = P (1 − P ) + Formula 2.1a
n1 n2
n1P1 + n2 P2 x1 + x2
with P = = Formula 2.1b
n1 + n2 n1 + n2
Such that:
X
(a) P1 = is the attribute proportion for sample 1 of size n1 ; and
n
X
(b) P2 = is the attribute proportion for sample 2 of size n2 .
n
(b) n2π 2 ≥ 5, n2 (1 − π 2 ) ≥ 5.
Take note that the larger the sample size, the better the approximation.
P2 − P1
Z= Formula 2.2
1 1
P (1 − P ) +
n1 n2
However, if both conditions are not satisfied, the probability of rejecting H 0 can
be calculated by using the binomial distribution table. Nevertheless, in this topic,
discussion is limited to a random sample of a large sample size.
Good comprehension of the technique and procedure of hypothesis testing for mean
difference is important to help you learn and understand hypothesis testing for
proportion difference. Let us look at Example 2.3.
Example 2.3:
Inspection is carried out on 200 items produced by company A and 2% is found
to be defective. A sample of 300 items from company B contains 3% defective
items. Can we conclude that items from company A are better? Test at the 0.05
level of significance.
Answer:
Step 2: The sample sizes obtained from both populations are large, thus
the distribution for sample proportion is assumed to be normal
distribution.
Thus, we write:
Then,
H 0 : π 2 ≥ π1 H 0 : π 2 − π1 ≥ 0
⇔
H1 : π 2 < π1 H1 : π 2 − π1 < 0
Step 4: Significance level α = 0.05 is given. Thus, from the standard normal
table, the critical value for a one-tailed test (to the left):
zα = −1.645
4 + 9 13
P= = = 0.026
500 500
• Binomial population means that the population consists of some subjects with
a certain attribute while others do not have that attribute.
• The simple rule is that the mathematical expression which does not contain an
equality sign is used to formulate the alternative hypothesis. The null hypothesis
is then formulated by complementing the expression in the alternative
hypothesis.
• The test statistics used is the difference between the sample proportions
obtained from the respective populations, which is:
Statistics B = P2 − P1
• For a large sample size taken from both samples, normal distribution can be
used as an approximation.
(a) There is no difference between the proportions of men and women who
will vote for the incumbent in next month’s election.
(b) The percentage of boys who cut classes is greater than the percentage of
girls who cut classes.
(c) The percentage of college students who drive an old car is higher than
the percentage of non-college people of the same age who drive an old
car.
2. Determine the critical region and critical value(s) which would be used to test
the following hypotheses when z is used as the test statistics.
4. What conditions determined that both samples are large and normal
approximation can be applied?
(b) Suppose for some practical reasons, you know that π1 cannot be larger
than π 2 .
(i) Given this knowledge, what should you choose for the alternative
hypothesis? What about the null hypothesis?
3. An article reported that smoking boosts the death risk of diabetics. Suppose
that as a follow-up study, we investigate the smoking rates for male and
female diabetics and obtain the following data.
(a) Formulate the null and alternative hypotheses to test whether the
smoking rate is higher for males than for females.
(c) State the conclusion at a 0.05 level of significance and at a 0.01 level of
significance.
INTRODUCTION
Previously, you have been exposed to several methods of testing and estimating a
population mean. Similarly, they might be interested in making inferences on
changes in population; therefore, the correct parameter to be used in this case is the
population variance, σ 2 . There are several reasons of important to test hypotheses
concerning the variances of populations. Inferences on variance can be applied in
daily life.
In the financial area, investors use variation in the returns from portfolio stock, bond
or any type of investments as a measure on uncertainties and risks. With this
method, investors will be able to reduce the risk of investment.
Both distributions are non-symmetrical and skewed to the right. The chi-square
distribution is used for testing the hypothesis on a single population variance.
σ 2 , the population standard deviation, σ , and to construct their confidence
intervals.
On the other hand, the F distribution is required for studies involving two or more
independent populations. The sample variance, s12 , will be calculated and used as a
point estimation for the first population variance, σ12 , and the second sample
variance, s22 , will be calculated and used as a point estimation for the second
population variance, σ 22 .
Theorem 1
Suppose a random sample, X of size n is chosen from a normal distribution with
mean μ and variance σ 2 , hence X can be defined as:
X=
( n − 1) s 2
σ2
The bigger the value of v, the flatter the density curve is, which is skewed to the
right.
Using v = 1, 2, 3 and 4, sketch the graph of f (x) versus X separately. What can you
observe? Give your comment.
SELF-CHECK 3.1
ν Large α χ c2 Small α χ c2
Pr ( χ 2 > χc2 ) = α , 0 ≤ α ≤ 1
To facilitate the usage of Table 3.1, we use the term “small α” if its values are in
the range of 0 ≤ α ≥ 0.1 and the term “large α” if its values are in the
complementary interval of 0.9 ≤ α ≥ 1. Usually, the χc2 values are big for small α,
and χc2 is small when α is large. Let us look at two examples in Figure 3.2 where
Pr ( χ(2ν ) ≥ χc2 ) = α .
(a) (b)
Based on Table 3.1, it can be seen that for the same v degrees of freedom (v = 5),
the χc2 value approaches zero when the value of α increases for the “large α” case.
On the other hand, the value of χc2 increases and approaches ∞ when α value
decreases for the “small α” case. This is an important property especially when α
is used as significance level in hypothesis testing or confidence interval
construction for population, σ 2 .
Meanwhile, Table 3.2 shows part of χc2 values for some α values and the
corresponding degrees of freedom v where ν = n –1.
χ 2 Value
1
2
.
.
11 4.575 5.578 17.275 19.675
12 5.226 6.304 18.549 21.026
13 5.892 7.042 19.812 22.362
14 22.362 7.790 21.064 33.685
.
Table 3.2 was constructed in a similar way as the t-table that contains the t values.
The top row of Table 3.2 represents the right-hand side section of the point as in
Figure 3.4, in reference to the rows with the appropriate degrees of freedom.
2
If v = 12 and α = 0.05 then χ0.05 = 21.026 and if v = 12 and α = 0.95, the
2
χ0.95 = 5.226 . Since the chi-square distribution is non-symmetrical, the values on
the left-hand side of the distribution need to be determined similar to how the values
in the right-hand side of the distribution are determined from Table 3.2.
2 2
Based on two values of χ0.05 = 21.026 and χ0.95 = 5.226 , it is known that
The values and area for chi-square distribution with v = 12 degrees of freedom are
shown in Figure 3.3(a). Since Pr (χ 2 > 5.226) = 0.95 , hence Pr (χ 2 < 5.226) =
0.05.
Figure 3.3(a): Values and area for chi-square distribution with v = 12 degrees of freedom
It is important to note that the χ 2 point represents its position on the horizontal
axis. As such, all points and probabilities/areas in the Table 3.2 satisfy
Pr ( χ (2ν ) ≥ χ c2 ) = α .
Figure 3.3(b): Values and area for chi-square distribution when α = 0.05
Example 3.1:
2 2
Suppose that n = 20, determine χ 0.975 and χ 0.025 .
Answer:
Figure 3.4 clearly depicts the area on the left side of point 8.907 is 0.975 and the
area on the right side of point 32.852 is 0.025.
Example 3.2:
Suppose that X and Y are independent random variables distributed as χ 2 (2) and
χ 2 (6) respectively, determine the probability of:
(a) X > 7.38 (b) X < 0.103 (c) Y < 22.46 (d) X + Y > 2.18
Answer:
(a) It is known that X is distributed as χ 2 (2). From Appendix 3.1, the value of
0.025 from the distribution was situated on the right side of point 7.378
(column α = 0.025, row v = 2)
∴ Pr(X > 7.378) = 0.025
(c) Y~ χ 2 (6). Hence, Pr(Y > 22.46) = 0.001 (from column α = 0.001,
row v = 6)
∴ Pr(Y < 22.46) = 1 – 0.001 = 0.999
EXERCISE 3.1
Statement 1
To test whether a random sample of size n with sample variance, S 2 was drawn
(n − 1) S 2
from a normal population with variance σ 2 , we use χ =
2
σ2
statistics, which
is distributed as χ 2 (n − 1) when the null hypothesis is true.
(a) Suppose that we want to find the RR region for a one-sided right side
(small) hypothesis testing at 5% significance level (α = 0.05) with sample
size, n = 9.
(b) Find the critical region for a one-sided left side (large α) at 1% level (α =
0.01) and sample size n = 21.
(c) Determine the critical value at α = 0.05 level and sample size, n = 20, for a
two-sided test.
Figure 3.8: First step to determine critical value at α = 0.05 level and
sample size, n = 20 for a two-sided test
Figure 3.9: Second step to determine critical value at α = 0.05 level and
sample size, n = 20 for a two-sided test
Table 3.3 displays the RR for various combinations of null hypothesis, H 0 , and
alternative hypothesis, H1.
Example 3.4:
A normal population has a variance of 9.0. A random sample of size 9 and
variance 8.01 was drawn from a normal population. Determine whether the
variance from this random sample is 9.0. Test at 5% level of significance.
Answer:
There are a few steps to be considered to answer this question:
H 0 : σ 2 = 9.0
H1 : σ 2 ≠ 9.0
χ 2 < χ97.5%
2
(8) or χ 2 > χ 2.5%
2
(8)
2 (n − 1) S 2 8(8.01)
Test statistics, χ = 2
= = 7.12
σ 9
Step 7: Conclusion
In conclusion, we have evidence to state that the sample is from
a normal population with population variance, σ 2 = 9.0 (see
Figure 3.10).
EXERCISE 3.2
3.2 F DISTRIBUTION
Now, let us move on to F distribution by learning about its properties and table. Let
us continue with the lesson.
Theorem 2
Let U and V be two independent random variables with chi-squared distribution
given that v1 and v2 degrees of freedom, respectively. Then,
U / v1
F=
V / v2
becomes a random variable that follows the F distribution with v1 and v2 degrees
of freedom.
SELF-CHECK 3.2
Pr ( F ≥ Fν1,ν 2 ;α ) = α .
The critical value of the F distribution lies on the right side of the function graph
(see Figure 3.11). For each pair of v1 and v2 ; the first, second and third row in the
F distribution table are critical values at 0.05, 0.025 and 0.01 significance levels.
To determine the critical values on the left side, the following equation of
relationship has been applied:
1
Fν1,ν 2 ;α =
Fν 2,ν1;1−α
v1 v2 Fν1 ,ν 2 ;0.05
6 5 4.950
6 6 4.284
6 7 3.866
6 8 3.581
v1 v2 Fν1 ,ν 2 ;0.05
5 6 4.387
6 6 4.284
7 6 4.207
8 6 4.147
Comment: The F critical value also increases when the significance level
decreases, v1 varies and v2 is fixed.
1
Using special property of F distribution, that is, Fν1,ν 2 ;α = we can find the
Fν 2,ν1;1−α
values of F distribution at α = 0.95, 0.975 and 0.99. Let us look at Example 3.5.
Example 3.5:
Determine the value of F10,11,α = 0.95
Answer:
Using the property,
1 1 1
F10,11,0.95 = = = = 0.3398
F11,10,1−0.95 F11,10,0.05 2.943
EXERCISE 3.3
2. Determine the values (using the F distribution table) that satisfy the
equation below:
(a) (6,14) = 3.50 (b) (10,32) = 2.93
(c) (24,38) = 1.81 (d) (2,24) = 5.61
σ 22 S12
F=
σ 12 S22
Hence, we write the equation as follows where the illustration of the equation could
be seen as in Figure 3.12.
σ 2S 2
Pr F1−α /2,n1 −1,n2 −1 < 22 12 < Fα /2,n1 −1,n2 −1 = 1 − α
σ 1 S2
1
Since F1−α /2,n1 −1,n2 −1 = then,
Fα /2,n2 −1,n1 −1
σ 2S 2
Pr F1−α /2 (ν 1 ,ν 2 ) < 22 12 < Fα /2 (ν 1 ,ν 2 ) = 1 − α
σ1 S2
S 22 σ 22
⇔ Pr 2 F1−α /2 (ν 1 ,ν 2 ) < 2 < Fα /2 (ν 1 ,ν 2 ) = 1 − α
S1 σ1
we obtained Theorem 3.
Theorem 3
If s12 and s22 are variances for independent random variables from a normal
population, then
S12 1 σ 12 S12
⋅ < < ⋅ Fα /2,n2 −1,n1 −1
S 22 Fα /2,n1 −1,n2 −1 σ 22 S 22
σ12
is defined as (1 – α)100% confidence interval for . Let us look at Example 3.6.
σ 22
Example 3.6:
In one experiment, the first sample with n1 = 9 has a standard deviation, s1 = 0.5
while the second sample with n2 = 7 has a standard deviation, s2 = 0.7.
σ 12
Determine a 98% confidence interval for 2 ratio.
σ2
Answer:
From the F distribution table, we obtained the values of f 0.01,9,7 = 6.72 and
f 0.01,7,9 = 5.61 for α = 0.02. Therefore, substitutes the values above into
following equation,
σ 12
The confidence interval for ratio shows that the value of 1 is in the interval,
σ 22
therefore σ12 = σ 22 is true.
EXERCISE 3.4
Now, if both samples come from normal population with equal variance, σ 2 then
(n − 1) S X2 2 (m − 1) SY2
2
χ ( n − 1) and 2
~ χ 2 (m − 1)
σ σ
(n − 1) S X2 (m − 1) S X2
Hence, are F(n – 1, m – 1) variable.
(n − 1)σ 2 (m − 1)σ 2
To test whether two random samples of sizes n and m with sample variance S X2 and
SY2 respectively taken from a normal population with equal variance, we use
σˆ X2
F= statistics distributed as F(n – 1, m – 1) when the null hypothesis is true
σˆY2
(that is samples have equal variance) with unbiased estimated variance:
( n − 1) S X2 ( m − 1) SY2
σˆ X2 = and σˆY2 = respectively. Table 3.4 summarises the test on
n −1 m −1
variance comparison.
Alternative Test
Null Hypothesis Critical Region
Hypothesis Statistics
(a) One-sided (a) One-sided test-right
test-right S12
H1 : σ 12 > σ 22 ≥ fα (ν 1 ,ν 2 )
S 22
σ 12
⇔ H1 : >1 (b) One-sided test-left
σ 22
S12
≤ f1−α (ν 1 ,ν 2 )
(b) One-sided test- S 22
H 0 : σ 12 = σ 22 left
σ 12 H1 : σ 12 < σ 22 (c) Two-sided test
⇔ H0 : =1
σ 22 σ 2 S12
⇔ H1 : 1
<1 ≥ fα / 2 (ν 1 ,ν 2 )
σ 2 S 22
2
S12 or
(c) Two-sided test F=
S 22
H1 : σ 12 ≠ σ 22
S12
σ 2 ≤ f1−α / 2 (ν 1 ,ν 2 )
⇔ H1 : 1
≠1 S 22
2
σ 2
H 0 : σ 12 ≥ σ 22 H 1 : σ 12 < σ 22 S12
≤ f1−α (ν 1 ,ν 2 )
σ 12 σ 12 S22
⇔ H0 : ≥1 ⇔ H1 : <1
σ 22 σ 22
H 0 : σ 12 ≤ σ 22 H1 : σ 12 > σ 22 S12
≥ fα (ν 1 ,ν 2 )
σ 12 σ 12 S 22
⇔ H0 : ≤1 ⇔ H1 : >1
σ 22 σ 22
Let us look at Example 3.7 which shows you the situation in comparing two
different variances.
Example 3.7:
A random sample with size n1 = 16 and n2 = 25 were taken from two normal
populations. The variances for both samples are s12 = 48 and s22 = 26 respectively.
By using α = 0.05 and appropriate hypothesis, carry out a test to determine
whether the variance from the first population greater than the second population
variance.
Answer:
The steps to solve the problem are:
Population 1 Population 2
(a) Shape: Normal (a) Shape: Normal
(b) Mean: μ1 (unknown) (b) Mean: μ 2 (unknown)
(c) Standard deviation: σ 1 (unknown) (c) Standard deviation: σ 2 (unknown)
σ 12
H1 : σ12 > σ 22 ⇔ H1 : 2 > 1
σ2
where H 0
σ12
H 0 : σ12 ≤ σ 22 ⇔ H 0 : ≤1
σ 22
s12 48
F= = = 1.85
s22 26
Step 7: Conclusion
Hence, there is strong evidence to say that σ12 = σ 22 , where the
variances for both populations are equal.
EXERCISE 3.5
Sample 1 Sample 2
n1 = 16 n2 = 25
x1 = 48.7 x2 = 39.2
(a) X1 + X 2 (b) X1 + X 3
( X − μ )2 ( X − μ )2
E 1 2 1 + 2 2 2 = 2
σ σ
Sample 1 24.3 46.0 56.6 40.3 64.1 69.5 48.1 37.1 56.5 50.6
Sample 2 31.9 42.8 55.4 52.3 46.5 42.0 45.5 42.4 32.0 51.5
(n − 1) s 2
• Chi-square distribution is a sampling distribution for the variable
σ2
possessing the following properties:
– The chi-square value is always greater than or equal to 0, a property which
is not applicable for z and t distribution.
– The distribution is non-symmetrical.
– The distribution will vary according to the sample size. This means the
distribution shape is very dependent on the value ν, which is the degrees of
freedom value that exists in a sampling situation.
– The mean for any distribution is equal to its degrees of freedom.
– It is used to compare the variance of a population.
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
Source: http://www.z-table.com/chi-square-table.html
INTRODUCTION
Can we use the methods learned in Topics 1 and 2 to compare three or more
populations? Well, there are situations when we are interested to broaden our test
scope to more than two populations. The following are two situations which
compare three or more populations:
Each production method may result in producing a different mean strength and a
researcher would like to test the equality of several means. This investigation could
use a procedure known as analysis of variance (ANOVA). ANOVA is a basic
method in experimental design and also widely used for inferential statistics.
(a) Independent variables are also known as factors. A study may involve 1, 2
or more factors e.g. type of tuition classes, time of classes and students’
commitment.
(b) Each factor may involve different factor levels (categories) (see Table 4.1 for
examples).
Factor Level
Type of tuition classes Method 1: Tuition
Method 2: Intensive class
Tuition class time Time 1: Afternoon
Time 2: Night
Time 3: Weekends
(c) The combination of one level of a factor to another factor’s level is known as
treatment or run.
Notes:
(a) For a single factor experiment, the factor level and treatment carry the same
meaning.
(b) In this book, only a single-factor experiment is considered. For multiple
factors, please refer to the appropriate reference books.
Therefore, the focus of Topic 4 is to compare the means for three or more
unknown populations with restriction on one variable or one factor. The examples
of experiments involved the comparison of means for three or more populations
such as:
(a) A college principal is interested in comparing the marks obtained by students
from Years 1, 2 and 3;
(b) An engineer wants to compare the current flow for five different conductors;
and
(c) A pharmacist is interested in comparing the effectiveness of three types of
medicine used to treat patients in a hospital.
You have learned about display plots in the SBST1203 Introductory Statistics
module. Now, the focus is more on the uses of the box-plot and dot-plot which
provide a visual display before calculation on mean comparison is carried out.
Figure 4.1 displays the box-plot graph for comparison of mean speed according to
the types of car.
Figure 4.1 shows information about five cars that have different means or medians
and variances for their speed at different levels. Do the observed differences yield
significant statistical results? No.
Therefore, ANOVA should carry out a numerical significance test on the equality
of each mean. The null hypothesis is there is no significant difference on the mean
speed for all cars. If the null hypothesis rejected, this shows there is a significant
difference in the mean speed for all cars. If we are interested, we can determine the
mean speed of the car that caused the difference.
There are several assumptions that we should consider before running the ANOVA
procedure, which are:
(c) Each population that produces a sample value equal has unknown and equal
population variance, that is σ12 = σ 22 = ... = σ k2 .
Figure 4.2 displays the graph shape for populations that satisfy the previous
assumptions of ANOVA.
Let us carefully observe Figure 4.3 and comment on the means; μ1, μ2 , and μ3.
Example 4.1:
The following is a one-way experiment/test. Measurement in the table is the final
examination score for 12 students according to type/category of tuition class.
Notes:
3. There are three groups in this experiment with dot plot as shown in
Figure 4.4.
4. There exists variation within each respective group (as observed from
variation in marks value).
5. There exists variation between groups (observe the varying position of the
group centre).
SELF-CHECK 4.1
ACTIVITY 4.1
EXERCISE 4.1
Table shows two data sets which are data 1 and data 2. By using these
datasets, construct a dot-plot graph. Give your comment.
Data 1 Data 2
A B C A B C
14.9 14.4 14.5 16.6 13.2 13.5
14.9 14.4 14.5 16.8 14.7 16.9
14.9 14.4 14.5 13.2 15.7 15.4
14.9 14.4 14.5 15.3 12.3 12.8
14.9 14.4 14.5 12.6 16.1 13.9
Total 74.5 72.0 72.5 74.5 72.0 72.5
Mean 14.9 14.4 14.5 14.9 14.4 14.5
Variance 0 0 0 3.71 2.63 2.71
Overall total 219.0 219.0
Mean total 14.6 14.6
Table 4.2: Measurement Collected from Observations Under Study (Subject) i and j
Observation
1 x11 x12 … xn1
2 x21 x22 … xn2
Treatment
⋅ ⋅ ⋅ … ⋅
⋅ ⋅ ⋅ … ⋅
k xk1 xk 2 xnk
Table 4.2 shows the measurement collected from observations under study (subject)
i and j where i = 1,2, … k, and j = 1,2, …, n. The sample sizes are not necessarily
k
equal and the total sample size is N = ni . The overall mean was obtained by
i =1
dividing the total overall observation from samples with the total sample size, that
k
is x = Ti / N .
i =1
Note that the total sum of squares SST can be written as:
k n 2 k n 2
( xij − x ) = ( xi − x ) + ( xij − x ) Formula 4.1
i =1 j =1 i =1 j =1
or
k n 2
( xij − x )
i =1 j =1
k n k n
( ) ( )
2 2
= n ( xi − x ) + xij − xi + 2 ( xi − x ) xij − xi Formula 4.2
i =1 j =1 i =1 j =1
n
However, the cross-product term in Formula 4.2 is zero since ( xij − xi ) =
i =1
xi − nxi = xi − n ( xi / n ) = 0
Hence, we have:
k n 2 k k n 2
( ) = n ( xi − xi ) + xij − xi ( )
2
xij − x Formula 4.3
i =1 j =1 i =1 i =1 j =1
Formula 4.3 states that total variability in the data, as measured by the total sum
of squares can be partitioned into the sum of squares deviation between means
in the treatment and within means of the treatments. This means the differences
between mean of observed treatment and the overall mean is a measure of
differences between treatments means, while the observed differences between a
treatment with the treatment mean can only be caused by random error. As such,
we can write Formula 4.3 (in symbol) as:
where SS(Tr ) is the sum of squares due to treatments (that is between observations)
and SS E is the sum of squares due to error (that is within observations). Since there
exists kn = N total observations, we have N – 1 degrees of freedom. SS(Tr ) has
k – 1 degrees of freedom since there are k factor levels (and k treatment means).
It is important for us to check intrinsically both terms on the right side of the basic
identity for analysis of variance (Formula 4.3). Consider the following sum of
squares for error:
k n 2 k n 2
SS E = xij − xi ( ) = ( xi − xi )
i =1 j =1 i =1 i =1
It is easier for us to see that the term inside the square bracket is equivalent to the
sample variance at the i-th treatment divided by n – 1, that is
n
( xij − xi )
2
Si2 = i =1
i = 1, 2, …, k
n −1
Now, the sample variance can be combined to get an equal estimation for the
population variance, that is,
k n 2
xij − xi ( )
( n − 1) S12 + ( n − 1) S22 + ... + ( n − 1) Sk2 = i =1 j =1
( n − 1) + ( n − 1) + ... + ( n − 1) k
( n − 1)
i =1
= SS E / ( N − k )
In the same way, if there is no difference between k treatment means, we can use
the changes in treatment means from the total mean to estimate σ 2 . Especially when
n is equal for each k treatments, as such,
k 2
n ( xi − x )
SS(Tr ) / ( k − 1) = i =1
.
k −1
is an estimation for σ 2 when the treatment means are the same. If n is unequal, we
have
k 2
n ni ( xi − x )
SS(Tr ) / ( k − 1) = i =1
k −1
Note that the identity analysis of variance (Formula 4.3) provides two estimations,
which are based on the existence of changes within treatments and between
treatments. If there are no differences in treatment means, both estimations should
be almost equal. If both estimations are not the same, it is suspected that the
observed differences must be due to differences within treatment means. The
quantity
SS(Tr )
= MS(Tr )
k −1
and
SS E
= MS E
N −k
Example 4.2:
Using data on students’ test marks according to classes (full marks = 20),
calculate the mean squares between classes, mean squares between students
(errors) and total deviation of overall marks.
Answer:
The results are as below:
We obtained:
nj (xj − x )
2
=
2 2 2
= 5(14.9 – 14.6) + 5(14.14 – 14.6) + 5(14.5 – 14.6)
= 5(0.09 + 0.04 + 0.01)
= 0.7
Degrees of freedom, df = k – 1 = 3 – 1 = 2
As such,
This means that changes in measurement between the three samples, that is
variation of students’ marks between classes is 0.35.
This means the changes in measurement within the three samples, which is
variation of marks between students in respective classes is 0.14.
Hence,
( )
2
SST = xij − x
2 2 2
= (15.2 − 14.6 ) + (15.4 − 14.6 ) + ... + (14.6 − 14.6 )
= 2.38
In this topic, the usage of mean squares is synonym with the term covariance.
Therefore, when discussing about the changes in measurement for a set of data, the
word mean squares is used compared to the word variance. When the value of mean
squares of treatments is greater than mean squares of errors, we can make an early
conclusion that there exists significant difference between treatments in giving
effects towards experimental results. The following subtopic will demonstrate how
the existing differences can be verified using appropriate statistical test.
ACTIVITY 4.2
(a) Determine
(i) Sum of squares of treatment + Sum of squares of error
(ii) df SS(Tr ) + df [ SS E ]
(b) Compare the value of mean square of treatment ( MSTr ) and mean
square of error ( MS E ).
EXERCISE 4.2
G1 G2 G3 G4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
81 72 83
69 79 76
90
H0 : μ1 = μ2 = ... = μk
Note that ANOVA does not provide information on how much the mean
populations differ, as well as accurate information on which population mean
differs. If H 0 is true and the three previous assumptions were satisfied, we can say
the three samples were taken from a same population as shown in Figure 4.5(a) and
variance contribution to the total overall variability is zero.
On the other hand, if H 0 is false, the variation within the observed dependent
variable is mostly due to difference in treatment and its contribution is viewed as
SS(Tr ) percentage towards the total deviation. The greater the SS(Tr ) percentage is,
the closer it is towards the truth of H1 that is the mean for each population is not
equal. This event implies that the samples taken were from the following
populations as shown in Figure 4.5(b).
(a) The degrees of freedom for the treatment, v1 = k – 1 with k sample size; and
(b) The degrees of freedom for the error, v2 = N – k where N is the total number
of observations in all samples that is n1 + n2 + ... + nk = N (for unequal
sample sizes) or kn = N (for equal sample sizes) and k is the number of groups/
treatments.
Determine the critical value, that is Fv1 ,v 2 ,α (obtained from the F distribution table
at http://www.z-table.com/f-distribution-table.html). Reject H 0 when F > Fv1 ,v2 ,α .
MS(Tr ) SS(Tr ) / ( k − 1)
F= =
MS E SS E / ( N − k )
EXERCISE 4.3
SELF-CHECK 4.2
The mean model is the model usually used for single-factor testing that is the
model used to compare means ( μ j ) of factor levels. This model is
Error SS E N–k MS E
Total SST N–1
The total sum of squares = Sum of squares treatment + Sum of squares error
Example 4.3:
A private accounting firm carried out a study to investigate whether the efficiency
of its employees is related to their former schools. A few selected accountants
from the company chose four schools at random and the number of mistakes
made by other accountants in two weeks’ duration were recorded as below:
Carry out an ANOVA test at 0.01 significance level. Is there any significant
difference in evaluating the employees’ efficiency?
Answer:
The factor involved here is the former school; the factor level is the four original
schools selected by the accountants.
MS(Tr ) SS(Tr ) / (k − 1)
F= =
MS E SS E / ( N − k )
94.8333 / 3
= = 2.129
297 / 2
Step 6: Conclusion
Hence, we have strong evidence to state that the mean mistakes done
by the accountants are equal. In conclusion, there is no significant
difference in evaluating employees’ efficiency based on schools.
How was it? Can you understand? If not, please re-read. Then, try out the
following exercise to strengthen your understanding. You can also visit the
following websites to find out more about chi-square and F distribution at
http://mathforum.org/library/drmath/view/52808.html.
EXERCISE 4.4
4. Determine:
(a) The critical value for ANOVA test at α = 0.01 if the test
consists of 6 samples with 34 items in each sample.
(b) The critical value for ANOVA test at α = 0.05 if the test
consists of 4 samples with 44 items in each sample.
5. Calculate:
(a) Test statistics, F when MS E =14.6 and MS(Tr ) = 35.7 (use the
information in Question 3(a)).
• There are three assumptions that need to be verified before the ANOVA
technique can be carried out, namely:
– The population distribution approaches normality;
– The samples chosen at random and independent; and
– The population variances are equal.
• There are six steps to carry out hypothesis testing procedures, namely:
– S1: Determine the null and alternative hypotheses;
– S2: Choose a significance level;
– S3: Determine the rejection region;
– S4: Calculate the test statistics;
– S5: Test result; and
– S6: Conclusion.
Mean square between – A measure of the variation among the samples taken
samples, MS(Tr ) from different populations.
Mean square within – A measure of the variation within the data of all sample
samples, MS E taken from different populations.
SST – The total sum of squares given by the sum of SS(Tr ) and
SS E .
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
INTRODUCTION
Previous topics have discussed analysis and hypothesis on quantitative and
continuous data where the statistical techniques discussed required measurement
values such as weight, height, diameter, distance, total money or total score/marks
for a test.
Topic 5 will explore the methods to analyse categorical data. Firstly, what is a
categorical variable? A categorical variable is a variable that classifies or
categorises each individual into exactly one of several cells or classes.
Copyright © Open University Malaysia (OUM)
98 TOPIC 5 CHI-SQUARED TESTS
For example, in a public poll, respondents’ feedback on certain issues are recorded
and the data is in categorical form, that is whether the respondents “agree”,
“disagree” or have “no opinion”. Other examples are as follows:
(a) An experimenter who is carrying out a study on leukaemia patients can record
the number of cancer patients according to the patient’s family category; and
Each of these examples is a categorical variable and the data taken is the number of
frequency that fall into each category of variable. The chi-square test will be used
to carry out categorical data analysis by comparing observed frequencies with the
expected frequencies under a pre-specified null hypothesis. Let us find out more in
the following subtopics.
ACTIVITY 5.1
Multinomial Experiment
3. The probability that the outcome of a single trial will fall in a particular cell,
say, cell i, is pi , where i = 1,2, …, k, and remains the same from trial to
trial and P1 + P2 + ..., Pk = 1.
6. In n trials, the expected number that falls into the j-th category under the
null hypothesis is E j = np j .
Example 5.1:
Suppose that an unbiased die was tossed 120 times and each outcome was
recorded in the following frequency table.
Answer:
Face 1 2 3 4 5 6 Total
Frequency 20 20 20 20 20 40 120
Step 2: Determine the Significance Level and the Rejection Region (RR)
The test was performed at 5% level of significance, therefore we reject
(O − E )2
the null hypothesis when the test statistics X = >
E
2
χ5% (5) = 11.070 with degrees of freedom, v = number of columns –
1 = 6 – 1 = 5.
Face 1 2 3 4 5 6 Total
The observed frequency, O 20 10 10 20 20 40 120
The expected frequency, E (when
20 20 20 20 20 20 120
the null hypothesis is true)
O–E 0 –10 –10 0 0 20 0
( O − E )2 0 5 5 9 9 20 20
E
( O − E )2 ( O − E )2
Hence, we obtain X = = 30. Under H 0 , is
E E
distributed as χ 2 distribution. To calculate the test value in this case,
the information of the value of the 6 pairs’ needs. The 6 pairs are
independent and their total frequency must be 120 and the difference
for each added pair must be zero. This means, as many as 6 – 1
observed independent pairs (degrees of freedom) would be used to
calculate the χ 2 test value.
In other words, there are six cells to fill based on 1 restriction where
the total frequency for the 6 sets must be 120. For this reason, 6
choices – 1 restriction = 5 degrees of freedom.
Step 5: Conclusion
Hence, we have strong evidence to state that the throw of the die is
unbalanced, that is, it is not uniformly distributed.
EXERCISE 5.1
Monday 12
Tuesday 9
Wednesday 11
Thursday 10
Friday 9
Saturday 9
(a) The chi-square test statistics approaches the chi-square distribution with k – 1
degrees of freedom in goodness of fit test and (r – 1)(k – 1) in test of
independence and test of homogeneity.
( O − E − 0.5)2
X cc =
E
This continuity correction is required as a result of approximating a discrete
distribution by a continuous probability distribution.
Example 5.2:
A lecturer would like to investigate the frequency of students taking
elective classes at a local university. Consider whether the following
number of students is distributed as a binomial distribution. Use 5%
significance level.
Number of Elective
0 1 2 3 4 5 or 6 Total
Courses
Number of Students 12 16 8 3 1 0 40
Answer:
H1 : Otherwise
X 0 1 2 3 4 5 or 6
Expected
11.51 15.93 9.19 2.83 0.49 0.05
Values
X 0 1 2 or more Total
Observed 12 16 12 40
Expected 11.51 15.93 12.56 40
X=
(12 − 11.51) (16 − 15.93) (12 − 12.56 )
2
+
2
+
2
= 0.05
and
11.51 15.93 12.56
Step 5: Conclusion
Hence, we do not have strong evidence to reject the null
hypothesis. In conclusion, it is clear that the distribution of
students taking the elective classes is binomial.
Note:
The sample mean will not be used. It is sufficient to use ΣΟ to get the
expected frequencies. There is only one restriction, which is total, ΣΟ.
(i) p unknown; or
(ii) p known, and here, we will only consider Poisson distribution for one
case.
Example 5.3:
Test whether the Poisson distribution can be fitted to the frequency
distribution below:
X 0 1 2 3 4 5 6 or more
Frequency 19 26 27 13 11 2 0
Answer:
⇔ H0 : X P ( λ )
where
173
λˆ = x = = 1.765
98
H1 : Otherwise
X 0 1 2 3 4 or more Total
Observed 19 26 27 13 13 98
Expected 16.8 29.6 26.1 15.4 10.1 98
Hence,
Step 5: Conclusion
This means there is strong evidence to accept H 0 and to state
that Poisson distribution is the best fit for this data.
(i) Calculate the x and s values for the given frequency distribution;
(ii) Using x and s values as approximation to μ and σ along with the total
frequency given, construct a theoretical normal distribution; and
(iii) Compare the observed frequency with expected frequency using the chi-
square test with 3 restrictions.
Example 5.4
The following table shows information on height (measured to the nearest cm)
for 694 nine-year old girls.
Answer:
( )
⇔ H 0 : X N μ , σ 2 where x = μ̂ = 134.356 and s = σˆ = 6.195
H1 : Otherwise
Hence,
Step 5: Conclusion
We have evidence to state that the normality assumption on the
observed data is not satisfied.
SELF-CHECK 5.1
EXERCISE 5.2
X 0 1 2 3 4 5
Frequency 1 6 14 33 31 15
Number of
0 1 2 3 4 5 >5
Storm
Number of
Station (f) 10 11
that Reports
2 4 74 28 10 2 0
However, if n > 40, each expected frequency in table r × c has a value of more than
1 (r is number of columns for level of the first factor, c is the number of rows for
the level of the second factor).
Political Affiliation
Tax Reform
Party A Party B Party C Total
For 308 190 102 600
Against 92 160 148 400
Total 400 350 250 1,000
Table 5.1 also called a 2 × 3 (or 3 × 2) table since it consists of two rows and three
columns. The two categorical variables involved are the political affiliation (at three
levels, which are party A, party B, and party C) and their views on tax reform (at
two levels, “For” or “Against”). The values inside the table: 308, 190 and the rest
are the intersection given type of political affiliation and views on tax reform. These
values are observed frequencies as they represent the results obtained in the study
and identified as the number of individuals in all of the six categories or cells. For
example, the number of people who are members of party A and agree on tax reform
is 308 (row 1, column 1) while the number of people who are members of party C
and disagree on the tax reform is 148 (row 2, column 3).
Table 5.2: Cross-classification Table between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Overall Total)
Political Affiliation
Tax Reform
Party A Party B Party C Total
For 30.8 19 10.2 60
Against 9.2 16 14.8 40
Total 40 35 25 100
Table 5.3: Cross-classification Table between Political Affiliation Views on Tax Reform
(Percentage Values in the Table are Based on Total Rows)
Political Affiliation
Tax Reform
Party A Party B Party C Total
For 51.33 31.67 17 100
Against 23 40 37 100
Total 40 35 25 100
Table 5.4: Cross-classification Table between Political Affiliation and Views on Tax
Reform (Percentage Values in the Table are Based on Total Columns)
Political Affiliation
Tax Reform
Party A Party B Party C Total
For 77 54.29 40.8 60
Against 23 45.71 59.2 40
Total 100 100 100 100
What decisions can be made from Tables 5.2, 5.3 and 5.4? There are various
decisions which can be made from Tables 5.2, 5.3 and 5.4. Some of these are:
(b) 51.33% of individuals from party A support tax reform (Table 5.3).
(c) 77% from those who agrees on tax reform are from party A (Table 5.4).
Cj Ri
Eij = N Formula 5.1
N N
that is
Political Affiliation
Tax Reform
Party A Agreement Party A Agreement
For 308 190 102 600 (R1)
Let us say that we are interested in determining the expected frequency for
individuals from party A who agree on tax reform, that is (A and “For”). Assume
the variables are independent,
EXERCISE 5.3
< 30 68 42
30 or more 31 59
Check the number of rows (r) and column (c) in the related table.
Calculate the degrees of freedom for the test, v = (r – 1) (c – 1).
EXERCISE 5.3
< 30 68 42
30 or more 31 59
Check the number of rows (r) and column (c) in the related table.
Calculate the degrees of freedom for the test, v = (r – 1) (c – 1).
χ =
2 ( O − E )2 .
E
Step 4: Test Result
The results of the test depend on information in Steps 2 and 3.
Step 5: Conclusion
Based on Step 4, conclude whether the two variables are independent.
Example 5.5:
We will use the example given in Subtopic 5.2. Based on the data and result
obtained, construct and test the appropriate hypothesis (use α = 0.05).
Answer:
Hence,
v = (2 – 1) × (3 – 1) = 2
(O − E ) 2
(b) Compute the chi-square statistics value, that is χ 2 = .
E
X=
( 308 − 240 ) (190 − 210 )
2
+
2
+ ... +
(148 − 100 )
2
= 91.329
240 210 100
Step 5: Conclusion
Since the H 0 is rejected, we conclude that the two variables are
dependent where the individuals’ opinions on tax reform depend on
their political affiliation.
EXERCISE 5.4
In this test, we test the hypothesis that the population proportions within each
category are the same (homogenous). This applies when either the row or column
totals are predetermined. Data is in the form of a two-way contingency table, which
is on classification of variable and another one on population classification. It is
important to stress that the assumptions and statements under the null and
alternative hypotheses are different but the analysis techniques are the same. Let us
refer to Example 5.6 for a clear explanation.
Example 5.6:
A two-year study has been carried out on 120 heart-problem patients who are
given two types of drugs, A and B. After a certain time-off period, the conditions
of the patients are classified as “no change”, “show improvement” and
“recovering”. The following table illustrates the distribution of patients.
Determine whether the patients’ conditions are the same although each of them
were given a different type of drug at α = 5%.
Patients’ Condition
Drug
Type No Show
Recovering Total
Change Improvement
A 15 22 33 70
B 20 18 12 50
Total 35 40 45 120
Answer:
H1 : Otherwise.
Patients’ Condition
Drug
Type No Show
Recovering Total
Change Improvement
A 15 (20.42) 22 (23.33) 33 (26.25) 70
B 20 (14.58) 18 (16.67) 12 (18.75) 50
Total 35 40 45 120
X=
(15 − 20.42 ) 2
+
( 22 − 23.33) 2
+ ... +
(12 − 18.75) 2
= 7.801
20.42 23.33 18.75
Step 5: Conclusion
We conclude that the condition of the patients depends on the type of
drug that they received.
EXERCISE 5.5
A random sample of 100 female students and 100 male students at a local
university were taken for an interview on their favourite sports. Of the
male students, 33% prefer football, 38% favour basketball, 24% love
baseball and the rest like tennis. Meanwhile, for the female students, their
preferences are quite balanced with 38% into football, 21% liking
basketball, 15% preferring baseball and the rest favouring tennis.
Determine the classification variables and population involved. Explain
an appropriate test with any calculation.
ACTIVITY 5.3
( O − E − 0.5)2
X cc =
E
This correction statistics is not only used in 2 × 2 table but also for any data
employing X test statistics and with only one degree of freedom. There are times
when this correction statistics results in a value which is very different from the
value of ordinary X test statistics and this results in the acceptance of H 0 . In this
case, a bigger sample size is needed and repeats the test or uses other appropriate
methods.
Note
It is important to know that the continuity correction will always cause a
reduction in the X value, a fact that can be proven through careful analysis of the
shape. If the test value is in favour of H 0 acceptance, we do not have to calculate
the X cc value as the test result would not give any effect.
Example 5.7:
Students in two Form 1 classes, A and B, sat for an examination and the following
results were recorded.
Form 1
Results
Class A Class B
Passed 72 64
Failed 17 23
Carry out a hypothesis test to determine whether there is any difference between
examination results for the two classes using Yate’s continuity correction. Use
5% significance level.
Answer:
H1 : Otherwise
(b) Hence,
Step 5: Conclusion
In conclusion, there is no significance difference in students’ results
between classes A and B.
SELF-CHECK 5.3
EXERCISE 5.6
Factor I
Type
1 2 Total
Factor A a b nA
II
B c D nB
Total n1 n1 n
Number of
0 1 2 3 4 5 6 7 8
Goals
Number of
3 5 14 9 3 4 1 1 0
Teams
Number of
0 1 2 3 4 5
Defective Items, X
Frequency 22 37 20 13 6 2
Year of Study
Grade Average
Year 1 Year 2 Year 3
<2.0 14 16 15
2.0–3.0 10 11 11
>3.0 26 23 24
Category
1 2 3 4 Total
1 16 38 5 41 100
Population 2 24 41 12 23 100
3 19 36 15 30 100
Opinion
Gender
Agree Does not Agree
Male 32 11
Female 68 89
(O − E ) 2
χ =
2
E
Cj R
Eij = N i
N N
( O − E − 0.5) 2
X cc =
E
is used when the 2 × 2 contingency table has only one degree of freedom.
Copyright © Open University Malaysia (OUM)
128 TOPIC 5 CHI-SQUARED TESTS
Multinomial experiment – An experiment with n trials for which: (1) the trials
are identical, (2) there are more than two possible
outcomes per trial, (3) the trials are independent, and
(4) the probabilities of the various outcomes remain
constant for each trial.
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
INTRODUCTION
Firstly, what is correlation?
Both variables are usually denoted as X and Y and their distributions approximate
to normal distribution. There are three types of relationships between X and Y,
which are explained in Table 6.1.
Type Meaning
Positive linear One variable increase, the other variable tends to increase linearly.
correlation
Negative linear One variable tends to decrease linearly as the other variable
correlation increases.
No correlation There is no linear relationship between two variables.
These three relationships between two variables can be categorised into four
conditions:
(a) Perfect;
(b) Strong;
(c) Weak; and
(d) No linear relationship.
The linear relationship for these four conditions can be clearly visualised by using
a graphical display, such as a two-way scatter plot.
However, judgment based on the graph is very subjective and at times, not accurate.
As such, to accurately determine the condition of this relationship, a quantitative
measurement known as correlation coefficient is needed. It is usually denoted by
ρ for population, which is usually unknown. This population parameter is
estimated by the sample correlation coefficient, r. The value of r always lies
between –1 and 1:
–1.00 ≤ r ≤ +1.00
Table 6.2 explains the r values for each of the four conditions with regard to the
three types of relationships between two variables.
2 –1.00 < r < –0.50 There exists a strong negative linear relationship.
+0.50 < r < +1.00 There exists a strong positive linear relationship.
There are two types of sample correlation coefficients, which are Pearson
correlation coefficient and Spearman correlation coefficient. Their application
depends on the types of data. The Pearson correlation coefficient is used for
quantitative data, in discrete and continuous forms.
Meanwhile, the Spearman correlation coefficient is used for rank data. So, the
name Spearman rank correlation coefficient is also sometimes used in other
discussions.
SELF-CHECK 6.1
ACTIVITY 6.1
Give an idea on the type of correlations (positive, negative and none) that
we can expect from the following statements:
(a) Students’ grade and height;
(b) An individual’s weight and cholesterol level found in blood;
(c) The amount of ice-cream sold and ambient temperature; and
(d) Price of rubber and amount of rainfall.
The horizontal axis represents the X variable and the vertical axis represents
the Y variable. Let us look at some figures that show you the plots that could
explain the relationship between two variables. First, look at Figure 6.1.
Figure 6.1(a) shows the X and Y variables having a perfectly positive linear
relationship as every value of the Y variable increases as the value of X increases
and all points fall on a straight line.
Now, let us move on to Figure 6.2 which shows you two types of strong linear
relationship.
Figures 6.2(a) and 6.2(b) display a strong positive and negative linear relationship
respectively. These are said to be strong conditions as most of the points fall near
the straight line. Next is Figure 6.3. It shows you two types of weak linear
relationship.
(a) Weak positive linear correlation (b) Weak negative linear correlation
As we can see, Figures 6.3(a) and 6.3(b) display a weak positive and negative linear
relationship respectively. This is because most of the points fall far from the straight
line.
Lastly, let us look at Figure 6.4. It shows no linear correlation between the X and Y
variables.
You can try Exercise 6.1 on the application of the two-way scatter plot.
SELF-CHECK 6.2
EXERCISE 6.1
For each pair of two-way scatter plot, identify which one has a higher
value of correlation coefficient, r, and state its direction.
where
sx = Standard deviation of X =
( xi − x )2
n −1
sy = Standard deviation of Y =
(Yi − y ) 2
n −1
n xi y i − ( xi )( yi )
rp = Formula 6.2
( n xi2 − ( xi )
2
) ( n
yi2 − ( yi )
2
)
xi yi xi yi x i2 yi2
x1 y1 x1 y 1 x12 y12
x2 y2 x2 y 2 x22 y 22
x3 y3 x3 y 3 x 32 y 32
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
xi yi xi yi xi2 yi2
Example 6.1:
A teacher would like to prove to her students that playing computer games
affected academic performance. She claimed that the more time they spent
playing computer games, the lower their examination marks would be. A random
sample of 10 students was taken and their data was recorded as below:
Answer:
To prove a negative linear relationship between time spent (in hours per week)
and students’ examination marks, we should calculate the Pearson correlation
coefficient, rp . In this problem, we define time spent on computer games as the
X variable and students’ examination marks as the Y variable. Next, we can
construct this table:
xi yi xi yi x i2 y i2
4 26 104 16 676
10 17 170 100 389
14 7 98 196 49
12 12 144 144 144
4 30 120 16 900
5 40 200 25 1,600
8 20 160 64 400
11 15 165 121 225
13 10 130 169 100
15 5 75 225 25
96 182 1,366 1,076 4,408
n xi y i − ( xi )( yi )
rp =
( n x − ( x ) ) ( n y − ( y ) )
2
i i
2 2
i i
2
10 (1,336 ) 96 (182 )
= −
10 (1, 076 ) − ( 96 ) 10 ( 4, 408 ) − (182 )
2 2
= −0.927
The Pearson correlation coefficient value –0.927 shows a strong negative linear
relationship between time spent on playing computer games and students’
performance. Hence, we can conclude that if a student spends a large amount of
time playing computer games, this will affect his or her academic performance.
EXERCISE 6.2
Farm Area A1 A2 A3 A4 A5 A6 A7
Compute the correlation between the fertiliser use and crop production.
Give the conclusion on the value of the correlation coefficient obtained.
H0 : ρp = 0
H1 : 1. ρ p > 0
2. ρp < 0
3. ρp ≠ 0
n−2
Test statistics: T = rp
1 − rp2
Reject H 0 when:
1. T > tá,v
2. T < –tá,v
3. |T| > tá/2,v
Example 6.2:
Refer back to Example 6.1. Perform the Pearson correlation coefficient ρ p
significance test at a 0.05 significance level.
Answer:
We will use one-tailed hypothesis testing since we know that the correlation
coefficient ρ p value is negative. Hence,
H0 : ρp = 0
H1 : ρ p < 0
Test statistics:
n−2
T = rp
1 − rp2
10 − 2
= −0.927
1 − ( −0.927 )
2
= −6.99
Reject H 0 when:
Since T = –6.99 < – t0.05,8 = –1.86, we reject the H 0 and we have strong evidence
to conclude that ρ p < 0. This shows a significant relationship at 5% significance
level, if a student spends most of his/her time playing computer games, this
results in less time spent on revision, hence, their poor academic performance.
EXERCISE 6.3
where
6 D 2
rs = 1 − Formula 6.4
(
n n2 − 1 )
with D = U – V that is the difference between ranks U and V. The calculation process
using Formula 6.4 can be further simplified if the values are placed in Table 6.2.
xi ui yi vi Di D i2
x1 u1 y1 v1 D1 D12
x2 u2 y2 v2 D2 D 22
x3 u3 y3 v3 D3 D32
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅
xn un yn vn Dn D n2
D2
Let us look at Example 6.3.
Example 6.3:
A teacher aims to investigate if there exists any difference between male and female
students on the level of difficulty faced in Form 5 subjects taken by social science
students. The teacher asks 10 male and 10 female students to provide their score on
the level of difficulty in 10 subjects, which are Additional Mathematics (AM),
History (HIS), Modern Mathematics (MM), Geography (GEO), English Language
(EL), Accounting Principle (AP), Islamic Studies (IS), Science (SC), Finance (FIN)
and Malay Language (ML). Each student is requested to give a score of 1–5 for each
subject whereby Score 1 means the easiest and Score 5 means the most difficult. Is
there any relationship between male and female students on the level of difficulty
for the subjects that they took?
Answer:
Firstly, we need to determine the X and Y variables. Define total subjects’ scores
by male students as X variable and Y variable as total subjects’ scores by female
students. Prior to obtaining the Spearman rank correlation coefficient, rs , we
need to convert the data into rank form. In deciding the ranks, the scores can be
arranged in a descending order, that is, the highest score is given rank 1, the
second highest is rank 2 and the lowest score is rank 10. The subjects’ rank for
male (U) and female (V) students along with their differences (D) are displayed
in the following table:
xi ui yi vi Di D i2
45 1 34 5 –4 16
33 5 15 10 –5 25
20 8 30 6 2 4
24 7 38 4 3 9
43 2 26 7 –5 25
39 3 45 2 1 1
13 10 17 9 1 1
36 4 20 8 –4 16
28 6 48 1 5 25
15 9 40 3 6 36
D 2 = 158
6 D 2 6 (158 )
rs = 1 − = 1− = 0.0424
(
n n2 − 1 ) (
10 10 2 − 1 )
The Spearman correlation coefficient value is 0.0424. This means there is almost
no linear relationship between the opinions of male and female students on the
level of difficulty of the subjects that they took.
EXERCISE 6.4
Ten athletes were ranked at the beginning of a sports match. After the
match, their position in the match was recorded. The data is in the table
below:
Athlete 1 2 3 4 5 6 7 8 9 10
Ranking 1 2 3 4 5 6 7 8 9 10
Position in
3 5 2 1 10 4 9 7 8 6
the Match
Obtain the correlation between ranking and their position in the match.
Comment on the value of the correlation coefficient.
SELF-CHECK 6.3
H 0 : ρs = 0
H1 : 1. ρs > 0
2. ρs < 0
3. ρs ≠ 0
n−2
Test statistics: T = rs
1 − rs2
Reject H 0 when:
1. T > tá,v
2. T < – tá,v
3. |T| > tá/2,v
Example 6.4:
Refer to Example 6.3. Test the significance of the Pearson correlation coefficient
rs at a 0.05 level of significance.
Answer:
We will use a one-sided hypothesis test as we know the correlation coefficient
value rs is positive. As such,
H 0 : ρs = 0
H1 : ρs > 0
Test statistics:
n−2
T = rs
1 − rs2
10 − 2
= 0.0424
1 − ( 0.0424 )
2
= 0.12
Since T = 0.12 < t0.05,8 = 1.86, we do not reject the null hypothesis ( ρs = 0).
Hence, we can conclude that the value of the population parameter ρ is zero. In
other words, there is no relationship between the opinions of male and female
students.
Now, do you know when the Pearson and Spearman coefficients are used? If you
are still not sure, please reread this topic carefully.
EXERCISE 6.5
• Three types of relationship exist between X and Y. These are positive linear
correlation, negative linear correlation and no correlation.
• These three relationships between two variables can be categorised into perfect,
strong, weak and no linear relationship.
• A two-way scatter plot can be used to roughly show the relationship between
two variables, whether there is a positive linear, negative linear or no
relationship.
• The Pearson (rp ) correlation coefficient is used for quantitative data, for both
discrete and continuous forms. It is generated from the Pearson product moment
for n pairs of variables (X, Y).
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
William, M., & Terry, S. (2003). A second course in statistics: Regression analysis.
(6th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc.
INTRODUCTION
In Topic 6, you were introduced to methods to visually check the relationship
between two variables by using a two-way scatter plot and to measure the strength
of the relationship by using correlation coefficient. If a relationship exists, we
would like to know the meaning of the relationship. Once we have determined the
relationship in terms of formula, we will be able to predict the value of a variable
given the value of the other variable.
Let us learn about simple linear regression in the following subtopics. Happy
reading!
From the regression model, we can predict the y value for a given value of x.
SELF-CHECK 7.1
ACTIVITY 7.1
Discuss in the myINSPIRE forum the independent variables for the
following dependent variables:
(a) Profit made by a firm;
(b) Students’ final examination grade; and
(c) Selling price of a house.
y = β0 + β1 x+ ∈
Hence, the manager can estimate the selling price using the model below:
y = 25,000 + 90x Formula 7.1
where y = Selling price and x = House size in square feet. If the house is 2,000
square feet, the price would be RM205,000, that is:
y = 25,000 + 90(2,000) = 205,000
However, this is only an estimated price and the actual price (based on observation)
would be between RM180,000 and RM250,000. For this reason, to reflect the actual
situation, another simple linear regression model replaces the previous model, that
is:
y = 25,000 + 90x + ∈ Formula 7.2
where ∈ is a random variable for errors representing all other variables which are
not considered in the model (see Figure 7.1).
In other words, the selling price for the same size will also differ due to other factors
such as location, number of bedrooms, toilets and other unknown factors.
where ŷ is the predicted/fitted value for y, β̂0 is the estimation for population
parameter β 0 and β̂1 is the estimation for population parameter β1. The estimation
model (Formula 7.3) is a linear formula with the β̂1 parameter as the regression
slope and β̂0 parameter as the y-intercept, which is the y value when x is zero (see
Figure 7.1).
However, in most cases, when x = 0, the y value does not carry any significant
meaning and at times, it is not possible for x = 0 to happen. The slope of a straight
line is a fixed value that explains the changes (increasing or decreasing) in the y value
given a one unit change in the x value.
Errors (see Figure 7.1) are obtained from the difference between y observed values
with ŷ fitted values. This is denoted by ∈i for i = 1, 2…n and the formula is:
E ( y ) = β0 + β1 x Formula 7.6
s ( y ) = ∈∈ Formula 7.7
Observe from Formula 7.6, mean E(y) depends on x, but the standard deviation does
not depend on anything. This is because σ∈ is fixed for all x values. The visual
display of simple linear regression is shown in Figure 7.2.
EXERCISE 7.1
Given the regression equation ŷ = –12.84 + 36.18x, state the values of β̂0
and β̂1 and explain both values. Next, calculate the residuals using the
following data:
x 8.3 8.3 12.1 12.1 17.0 17.0 17.0 24.3 24.3 24.3 33.6
y 227 312 362 521 640 539 728 945 738 759 1263
When the straight line fails to capture all the data (point (x,y) on the graph), what
must we do to obtain the best straight line? This best straight line refers to the fitted
straight line that we build in the two-way scatter plot that best represents the
relationship between the two variables. This fitted line would be a straight line that
is close to points (x,y) and when the errors between the points on the straight line
(estimated) and actual observed points are minimised.
However, the total errors∈i does not represent the distance between the actual
and observed points. Let us look at an example to prove why ( yi − yˆi ) is not
suitable to represent the distance value of the actual and observed points in
Figure 7.3.
With reference to Figures 7.3(a) and 7.3(b), we can see that the positions of the two
data sets, data (a) and (b), are different. The total errors for data (a) and data (b) are
calculated as in Table 7.1.
Table 7.1: Total Errors for Data (a) and Data (b)
∈i = yi − yˆi ∈i = yi − yˆi
8–6=2 7–6=1
1 – 5 = –4 6–5=1
6–4=2 2 – 4 = –2
∈i = 0 ∈i = 0
Total errors are zero for both data (a) and (b), and this always holds. This figure
shows that the distance of data points (a) and (b) from the regression line is the
same.
However, from both graphs in Figure 7.3, we can see this is not true. There are
differences in positions of data points (a) and (b) from the regression line where
data points (b) are closer to the regression line compared to data points (a). Hence,
∈i is not suitable to be used as a selection criteria.
How can we solve this problem? We can solve this by squaring each error before
( yi − yˆi )
2
summing them up. Table 7.2 gives you the values of for data (a)
and (b).
( yi − yˆi )
2
Table 7.2: The Values of for Data (a) and Data (b)
( 8 − 6 )2 = 4 ( 7 − 6 )2 = 1
(1 − 5) ( 6 − 5)
2 2
= 16 =1
( 6 − 4 )2 = 4 ( 2 − 4)2 = 4
(∈i ) (∈i )
2 2
= 24 =6
( yi − yˆi )
2
Based on values for both data (a) and (b), it shows that the total sum
of squares for data (b) is smaller than (a). This proves that points for data (b) are
nearer to the regression line and this line is the best fitted line. This method to obtain
the best fitted line based on least squares summation is known as the least squares
method.
To fit the regression line, we need to compute the estimates for regression
coefficients β 0 and β1. Using the least squares method, the formula for regression
coefficient β1 is as in Formula 7.8.
n
xi yi − nxy
βˆ1 = i =1
n
Formula 7.8
xi2 − nx 2
i =1
where
After getting the estimate for β1 , we can derive the value for β 0 . The formula to
get β̂0 is shown in Formula 7.9.
where
Example 7.1:
For the following data, find the value of regression coefficients β̂0 and βˆ1 , and
write down the fitted regression model:
x 3 7 6 6 10 12 12 12 13 13 14 15
y 33 38 24 61 52 45 29 65 82 63 50 79
Answer:
To facilitate the calculation of parameter values β̂0 and βˆ1 , we can form a table
as below:
xi yi x i yi x i2 yi2
3 33 99 9 1,089
7 38 266 49 1,444
6 24 144 36 576
6 61 366 36 3,721
10 52 520 100 2,704
12 45 540 144 2,025
12 29 348 144 4,225
12 65 780 144 6,724
13 82 1,066 169 841
13 63 819 169 3,969
14 50 700 196 2,500
15 79 1,185 225 6,241
y=
yi = 621 = 51.75. Now, we can get the β̂1 regression coefficient using
n 12
this formula:
n
xi yi − nxy 6833 − 12 (10.25 )( 51.75 )
βˆ1 = i =1
= = 2.92
1421 − 12 (10.25 )
n 2
xi2 − nx 2
i =1
SELF-CHECK 7.2
EXERCISE 7.2
x 60 62 64 65 66 67 68 70 72
We are going to test the parameter for the population regression slope β1 using the
β̂1 regression coefficient. The hypothesis-testing process for testing population
parameter β1 is similar to that of testing mean and variance. We will begin with a
hypothesis statement. The null hypothesis claims there is no linear relationship,
which means the slope of the regression line is zero. If we accept the null
hypothesis, this means the population regression line is a straight line that shows
the y value does not change with the changes in the x value. In this case, information
on x is not enough to assist in predicting the y value.
On the other hand, if the null hypothesis is rejected, there is enough evidence to say
β1 is not zero, that is either β1 > 0 or β1 < 0. This shows that the regression line has
a tendency to increase or decrease, and this helps in predicting the y value using the
x value.
H 0 : β1 = 0
H1 : 1. β1 > 0
2. β1 < 0
3. β1 ≠ 0
βˆ1 − βˆ βˆ
Test statistics: T = =
s ( βˆ1 ) s ( βˆ1 )
Reject H 0 when:
1. T > tá,v
2. T < –tá,v
3. |T| > tá/2,v
s ( βˆ1 ) is the standard deviation for βˆ1. The formula to get the standard deviation
for β̂1 is:
Example 7.2:
Based on the data in Example 7.1, prove that at 0.05 significance level, there is
enough evidence to say that there is a linear relationship between x and y, that is
( β1 ≠ 0). Construct a 95% confidence interval for β1.
Answer:
The hypothesis statement:
H 0 : β1 = 0
H1 : β1 ≠ 0
βˆ1 2.92
Test statistics: T = = = 2.317
s ( βˆ1 ) 1.26
Reject H 0 when:
Prior to obtaining the test statistics value, we need to calculate the value of s ( βˆ1 ).
= 1.26
Since the test statistics (T = 2.317) > 2.228 ( t0.025,10 ), we reject the H 0 . Hence,
we can conclude that β1 is not zero, that there is enough evidence of the existence
of a linear relationship between x and y.
( ) ( )
βˆ1 − t0.025,10 s βˆ1 ≤ β1 ≤ βˆ1 + t0.025,10 s βˆ1
2.92 − 2.228 (1.26 ) ≤ β1 ≤ 2.92 − 2.228 (1.26 )
0.113 ≤ β1 ≤ 5.727
This confidence interval shows that the y value will increase between 0.113 and
5.727 for each increment in x. The wide range for β1 is due to the small sample
size.
EXERCISE 7.3
This is similar to say that we are quantifying the contribution of predicting the value
of y. Let us refer back to Figure 7.4, for each x value. For example, for x0 , we can
separate the y0 deviation from mean y into two parts; one part is for explained
variation and the other part for unexplained variation. Total variation is the total
( yi − y ) .
2
sum of squares of deviations from the mean of the y points, that is
This can be derived from the variation term in y, that is:
yi − y = ( yi − yˆ ) + ( yˆ − y )
( yi − y )2 = ( yi − yˆ )2 + ( yˆ − y )2
Unexplained variation is the sum of squares of deviations of observed y from ŷ
( yi − y )
2
estimate, that is and explained variation is the sum of squares of
( yˆ − y ) .
2
deviations of fitted values from mean, that is
2 βˆ0 yi + βˆ1 xi yi − ny 2
R = Formula 7.10
yi2 − ny 2
The coefficient of determination is always positive, that is 0 ≤ R 2 ≤ 1, and usually
expressed in percentages, that is, by multiplying with 100%. For example, if the
value of R 2 is 0.57, we say that 57% of the variation in y can be explained by the
fitted regression. The remaining 43% cannot be explained. The higher the value of
R 2 (approaching to 1), the better the data fit for the simple linear regression model,
that is, the data concerned could explain the population very well.
Example 7.3:
Based on the data in Example 9.1, calculate the coefficient of determination and
interpret its meaning if y = sales and x = number of radio advertisements.
Answer:
The coefficient of determination is:
2
2 21.82 ( 621) + 2.92 ( 6,833) − 12 ( 51.75 )
R = 2
36, 059 − 12 ( 51.75 )
= 0.3481
This means the fitted regression model can explain only 34.81% of variation in
sales and 65.19% of variation in sales can be explained by other factors.
ACTIVITY 7.2
What might happen if the formulated hypothesis is not correct? Discuss
the answer for this question in the myINSPIRE forum.
EXERCISE 7.4
The following shows some graphs that present deviations from assumptions made.
Let us look at Figure 7.5.
Figure 7.5 is a plot of ∈ i versus the fitted values yˆi or xi to determine whether the
linearity assumption is fulfilled. The graph shows that the data plotted formed a
curve and hence, we can conclude that the fitted model is non-linear.
Meanwhile, Figure 7.6 (plot of ∈ i versus the fitted values yˆi or xi ) shows deviation
of the model from the assumption that the random errors have constant variance.
The plot of data shows a bell-shape pattern. This means random errors instead of
having a non-constant variance, the errors are actually proportional to ŷ values. The
random errors have constant variance if the graph shows a random pattern or no trend.
ACTIVITY 7.3
What should we do to the model if there is a violation of assumptions?
Discuss the answer for this question in the myINSPIRE forum.
EXERCISE 7.5
Inverse x ∗ = 1 − x; y = β 0 + β1 x ∗
Hyperbolic y ∗ = 1 − y; x∗ = 1 − x y ∗ = β 0 x ∗ + β1
A two-way scatter plot is very useful to ascertain whether a model has a linear
or non-linear form. Hence, it is good to know the shape of exponential, power,
inverse and hyperbolic functions (refer to Figure 7.8). Look at both Table 7.3 and
Figure 7.8 on suitable transformation needs.
EXERCISE 7.6
(b) y = 2e 3.1 x
x
(d) y=
0.4 + 2 x
where y = selling price and x = house size (in square foot). The x values are between
x a and xb . If we would like to predict the selling price of a house where the built-
up area is 2,000 square feet, where the value 2,000 > xa , we can use the regression
model with x value = 2,000. Based on the regression formula, the manager can
predict that the selling price for each house with 2,000 square feet is RM205,000.
However, this selling price is a forward estimation and it does not explain the
position of that value with respect to actual selling price. In other words, is the
estimation value close to the actual value or very different? This relates to the
reliability aspect of a certain prediction. To get information on the position of
estimation values versus actual values, we need to use intervals. There are two
types of intervals used, prediction interval for any dependent variable y, and
estimation interval for estimated value of y.
ACTIVITY 7.4
List any other situations that required prediction or estimation. Share your
answer for discussion in the myINSPIRE forum.
( )
2
1 xg − x
yˆ ± tα /2 s∈ 1+ + Formula 7.12
n ( xi − x )2
where
ŷ = Future estimated value of dependent variable calculated from
( yˆ = βˆ0 + βˆ1xg )
n
( yi − yˆi )
2
i −1
s∈ = Formula 7.13
n−2
Example 7.4:
Refer to the data in Example 7.1. Calculate the 95% prediction interval for
x = 20 and explain its meaning if y = Sales, and x = Number of advertisement in
the radio.
Answer:
Refer to Example 7.1, the simple linear regression model is:
yˆ = 21.82 + 2.92 x
To get the standard error of the estimate, we need y values. These can be
generated by using the regression model ŷ = 21.82 + 2.92x. The data is as in the
table:
x 3 7 6 6 10 12
y 33 38 24 61 52 45
ŷ 30.58 42.26 39.34 39.34 51.02 56.86
x 12 12 13 13 14 15
y 29 65 82 63 50 79
ŷ 56.86 56.86 59.78 59.78 62.70 65.62
n
( yi − yˆi )
2
i =1 2,556.946
Hence, s∈ = = 15.99
n−2 10
Since α = 0.05, tα /2 = t0.025 = 2.228. Thus, the 95% prediction interval for
x g = 20 is
( )
2
1 xg − x
yˆ ± tα /2 s∈ 1+ +
n ( xi − x )2
1 ( 20 − 10.25 )
2
80.22 ± ( 2.228 )(15.99 ) 1+
12 160.25
80.22 ± 46.13
The lower and upper limits of the prediction interval are 34.09 and 126.35
respectively. This shows the minimum predicted sales are 34 units and the
maximum is 126 units when 20 advertisements were broadcast on the radio.
EXERCISE 7.7
Refer to the data in Exercise 7.2. Calculate the 99% prediction interval for
x = 86 and provide an explanation for it.
E ( y ) = β0 + β1x
Hence, to estimate a mean value of y, given any x g value, we can use the following
interval:
( )
2
1 xg − x
yˆ ± tα /2 s∈ + Formula 7.14
n ( xi − x )2
This interval applies when any specific value of x lies between the interval for
independent variable x values, i.e. xb ≤ x ≤ xa . Let us look at our last example.
Example 7.5:
Refer to data in Example 7.1. Calculate the 95% confidence interval for the mean
value of y when x g = 11 and explain its meaning if y = Sales, and x = Number
of radio advertisement.
Answer:
Values for yˆ , tα /2 and s∈ can be obtained from Example 7.4. Hence, 95%
confidence interval for the mean value of y when x g = 11 is:
( )
2
1 xg − x
yˆ ± tα /2 s∈ +
n ( xi − x )2
1 (11 − 10.25 )
2
53.94 ± ( 2.228 )(15.99 ) +
12 160.25
53.94 ± 10.50
Notes:
The lower and upper confidence limits for mean value of y are 43.44 and
64.44 respectively. This shows the minimum mean sales is 43 units while the
maximum is 64 units when 11 radio advertisements were broadcast.
EXERCISE 7.8
Refer to Exercise 7.2. Calculate the 99% confidence interval for mean y
when x = 69 and provide an explanation for it.
• Regression analysis concepts deal with finding the best relationship between a
dependent variable, Y, and independent variable, X, quantifying the strength of
that relationship, and the use of methods that allow for prediction of the
response values (Y) given values of the regressor x.
• The least squares method is used to get parameter estimates for the slope of
regression line and intercept on y-axis.
• If the simple linear regression model obtained is adequate for the data, the model
can be used to estimate the dependent variable value for any specific
independent variable value. This model can also be used to predict the mean
value of a dependent variable.
– Coefficient of determination, R 2 ;
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
William, M., & Terry, S. (2003). A second course in statistics: Regression analysis.
(6th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc.
INTRODUCTION
Most practical applications of regression analysis utilise models that are more
complex than the simple linear regression discussed in Topic 7, where only two
variables, the dependent variable (Y) and independent variable (X), are considered.
In simple linear regression, it is important to fully understand the following items:
X Formula 8.1
Y = β0 + β1 X Formula 8.2a
(c) It can be understood from the model in Formula 8.2b that Yˆ is a random
variable that follows a certain population distribution. Here, ∈ is assumed to
be independent of each other for any Y value with a zero mean. From the
previous formulas, for a given X value, the expectation is:
E (Y X ) = β 0 + β1 X Formula 8.3
(d) In regression analysis, the first thing to do is to estimate the Yˆ line using the
model in Formula 8.2a, that is, to get estimations for β0 and β1 using
methods such as ordinary least squares which uses n pairs of observation
values (y,x).
(e) β0 and β1 are known as coefficients for Yˆ in Formula 8.2a that need to be
estimated through Formula 8.2b using methods such as ordinary least squares
which uses n pairs of observation values (y,x). How far Yˆ is able to explain
variation in Y values is usually associated with how accurate the estimation of
both parameters is. This is usually measured with the R 2 coefficient where
the closer is its value to 1, the better is Yˆ in explaining variation in Y, and
vice-versa.
In another situation, a manager may want to find out the effect/relationship between
advertising cost and total space allocated in an advertising board (as two
independent variables, X 1 and X 2 respectively) on the amount of monthly product
sales (as the dependent variable, Y). Thus, Formulas 8.1, 8.2a and 8.2b would be:
Y ′ = β 0 + β1 X 1 + β 2 X 2 Formula 8.4
Y = β 0 + β1 X 1 + β 2 X 2 + ∈ Formula 8.5a
Y = Y ′+ ∈ Formula 8.5b
Formula 8.4 or its equivalent involving two or more independent variables is called
a multiple regression formula. Thus, the model in Formula 8.5a or 8.5b is termed
a multiple regression model. In the multiple regression method, there are usually
n observations for ( y, x 1 , x2 ). These observations are used to estimate β0 , β1 , β 2
using a method such as ordinary least squares. Similar to the linear regression case,
the goodness of fit of Y ′ in explaining variation in Y values depends on the
precision of the β0 , β1 , β 2 estimation. The R 2 coefficient of determination is also
used to measure this. However, the formula used is different in this case. This matter
will be further discussed in the subsequent subtopic. Let us continue with the lesson.
where
Y ′ = β 0 + β1 X 1 + β 2 X 2 + ... + β k X k
However, in this module, we will only focus on the multiple regression model
with two independent variables. For cases involving more than two independent
variables, the calculation is usually done with the help of a statistical package such
as Statistical Package for Social Sciences (SPSS); you can refer to any related books
to read more about this.
ACTIVITY 8.1
Search on the Internet for SPSS and find keywords that describe it. Share
your findings in the myINSPIRE forum.
Y = Y ′+ ∈
where
Y ′ = β 0 + β1 X 1 + β 2 X 2
For the ith observation, i = 1,2, ..., n, the above model value (a number, written
in a small letter) is:
yi = β 0 + β1 x1i + β 2 x2i + ∈i Formula 8.7a
where
yi′ = β 0 + β1 x1i + β 2 x2i Formula 8.8
Hence, using Formulas 8.7b and 8.8, errors between observations and their
estimations′ are given as = –. The error value can take a negative sign.
Using the outcome of Formula 8.9, separating and equating the formula to
zero, we obtained the following three formulas:
n n ˆ n
yi − nβ0 + β1 x1i + β2 x2i
ˆ ˆ Formula 8.10
1 1 1
n
n ˆ n 2 ˆ n
yi x1i = β0 x1i + β1 x1i + β2 x1i x2i
ˆ Formula 8.11
1 1 1 1
ˆ x + βˆ x x + βˆ x2
n n n n
i 2i 0 2i 1 2i 1i 2 2i
y x = β Formula 8.12
1 1 1 1
Example 8.1:
Given a set of n = 10 observations ( y , x1 x2 ), Table 8.2 is usually constructed for
manual calculation of parameter estimation (when n is moderately large where
the calculation works such as this can still be carried out).
y x1 x2 d d1 d2 dd dd 1 dd 2 d12 d12 d 1d 2
For large n, the calculation can be done using a statistical package such as Excel,
Minitab or SPSS with computer assistance. Please take note that different
package may generate different answers due to rounding error.
From Formulas 8.13, 8.14 and 8.15, we obtain the following estimates:
βˆ1 =
( 7.46 )( 6 ) − (15 ) 0.5 = 5.265687 ≈ 5.266,
(1.221)( 6 ) − ( 0.5)2
βˆ2 =
(15 )(1.221) − ( 7.46 ) 0.5 = 2.061193 ≈ 2.0612,
(1.221)( 6 ) − ( 0.5 )2
βˆ0 = 30.2 − 5.266 ( 2.77 ) − 2.0612 ( 3) = 9.430469 ≈ 9.430.
Model estimation:
yˆ ′ = 9.430 + 5.266 x1 + 2.0612 x2 Formula 8.16
The negative residual value indicated that an over-estimation of ŷ3′ since the
value of yˆ3′ > y3 .
On the other hand, for i = 7, y7 = 32; yˆ7′ = 9.430 + 5.266 ( 2.7 ) + 2.061( 3) =
29.832 ≈ 29.83, and its residual,
The positive residual value indicates that the value ŷ7′ is under-estimated as
the value of yˆ7′ < y7 .
The population parameter estimates, βˆ1 , βˆ2 , follow a sampling distribution with
mean and variance as shown below:
∈i2 d 22
2 ˆ( )
s = β1 =
Formula 8.19
n − k − 1 ( d12 )( )
d22 − ( d1d2 )
2
and
( )
Mean βˆ2 = β 2 , and Variance βˆ2 = Var βˆ2 = s2 βˆ2 , ( )
where
∈i2 d12
( )
s = βˆ2 =
2
Formula 8.20
n − k − 1 ( d12 )( )
d22 − ( d1d2 )
2
H1 : β1 ≠ 0; H1 : β1 ≠ 0;
Hypothesis test:
βˆi − βi βˆi
T= =
( )
s βˆi ( )
s βˆi
i = 1.2.
( ) ( )
βˆi – tá/2,n–k–1 s βˆi ≤ βi ≤ β̂ i + tá/2,n–k–1 s βˆi Formula 8.21
Example 8.2:
Testing Significance of Model Parameters (Formula 8.19)
This model has k = 2 parameter estimators and n = 10. From the data in
Example 8.1, the following values are obtained:
y y′ ∈ ∈2 d2
[ 6]
( ) ( )
Var βˆ1 = s 2 βˆ1 = [1.914297 ]
(1.221)( 6 ) − ( 0.5 )2
= 1.6232,
( )
Standard deviation, s βˆ1 ≠ 1.274
and
[1.221]
( ) ( )
Var βˆ2 = s 2 βˆ2 = [1.914297 ]
(1.221)( 6 ) − ( 0.5)2
= 0.330322
( )
Standard deviation, s βˆ2 ≠ 0.5747
βˆ2
The test statistics value, t2 = = ( 2.061193) / 0.5747 ≈ 3.5866 >
( )
s βˆ2
tα /2 ( 7 ) = 2.3646, → H 0 is rejected at 5% level, hence, it is significant.
There is enough evidence at 5% significance level that x2 contributes to
the variation in y values.
( y′ − y ) ∈ 2
2
R2 = = 1− Formula 8.23a
( y − y)
2
y2
Without calculating y′, the coefficient is given by the following formula:
R2 =
( βˆ ) ( dd ) + ( βˆ ) ( dd )
1 1 2 2
Formula 8.23b
d 2
R 2
= 1−
∈2
= 1 − (13.40 ) / ( 83.6 ) = 83.9713 ≈ 83.97
y2
This means the estimated model (Formula 8.16) can only explain 83.97% of
variation in y values; and the rest cannot be explained by the model and is usually
contained in the error ∈.
R2 =
( 5.265687 )( 7.46 ) + ( 2.061193)(15 ) = 0.839712 ≈ 0.8397
83.6
R 2 ( R 2 -adjusted)
) ( n( − k −)1)
n −1
(
R2 = 1 − 1 − R2 Formula 8.23c
SSR = yˆ i
2
When k ≤ 1, R 2 ≤ R 2;
R 2 = 1 − (1 − 0.839712 )
(10 − 1) = 0.793915
(10 − 2 − 1)
Testing Significance of Overall Regressin Model
The F test can be used to test the significance of the overall regression model. It is
tested based on the ratio of explained variance in the model on the remaining
unexplained variance. The F distribution is used, with k and n − k − 1 degrees of
freedom where k is the number of parameters estimated excluding the constant β 0 .
(A few books consider β0 as a parameter, hence, the F degrees of freedom becomes
k − 1 and n − k ). The test statistics is given by:
Fk ,n−k −1 =
yˆ ′i2 / ( k ) =
R2 / ( k )
Formula 8.24
∈i2 /9 ( n − k − 1) (1 − R2 ) / ( n − k − 1)
or
H 0 : β1 = β 2 = 0; Formula 8.25b
Test result: If the F-probability is < 0.05, reject H 0 at 5% significance level. This
means it is significant that the regression parameters β0 , β1 , β 2 , especially the last
two, is not all zero. Subsequently, it is significant that the coefficient of
determination R 2 is not zero.
SELF-CHECK 8.1
EXERCISE 8.1
30 3.2 2
36 3.4 4
28 2.8 3
29 2.4 4
27 2.5 2
28 2.2 3
32 2.7 3
27 2.6 2
34 3 4
31 2.9 3
3. Click on “Tools”.
4. Click on “Data analysis”.
5. Double click on “Regressions”.
6. Enter y range as $A$2:$A$11
7. Enter x range as $B$2:$C$11
8. Choose: Label, confidence level, residuals, residual plot, normal probability
plots.
9. Click on “OK”.
Predicted Standard
Observation Residuals Percentile 30
30 Residuals
1 35.71396896 0.286031042 0.22368127 5.555555556 27
- -
2 30.44733925 2.447339246 1.91386207 16.66666667 27
- -
3 30.10421286 -1.10421286 0.86351376 27.77777778 28
4 26.86363636 0.136363636 0.10663875 38.88888889 28
5 27.08148559 0.918514412 0.71829432 50 29
6 29.88636364 2.113636364 1.65290058 61.11111111 31
- -
7 27.42461197 0.424611973 0.33205398 72.22222222 32
8 33.47006652 0.529933481 0.41441724 83.33333333 34
- -
9 31.00831486 0.008314856 0.00650236 94.44444444 36
Take note that there are differences in values obtained manually and those obtained
using an Excel package. This is due to rounding error in the Excel package.
1. Multiple R
This quantity is often referred to as multiple correlation between y and all
independent variables without any conditions imposed on any independent
variable. The value is the source of multiple R 2 .
2. R Square
Measures the goodness of fit of the y ′ model on observed y. A high value
means that the regression model can explain better about the variation in Y as
much as ( R 2 = 100)%.
On the other hand, a small value indicates a poor fit of the regression model.
(b) This means there exists a linear relationship between Y and X1 and X 2
simultaneously.
(d) When the R 2 value is not significantly zero and this proves that there
exists a linear relationship between Y and X1 and X 2 .
4. t Test
This test is used to evaluate whether the individual regression coefficient
( β1 , β 2 ) is significantly zero at α level, that is, by comparing the p probability
value with the α value. For example, assume that α = 0.05; and if p < α,
reject H 0 : β1 = 0 at 5% level. This means that β1 ≠ 0, and x1 contributes
significantly to the variation in Y. In the given example, we found that β1 and
β2 are both not significantly zero at the 5% level.
5. Residual Output
One way to know the adequacy of a model is by looking at the parameters
involved and checking whether the assumptions in model construction are
fulfilled. This can be done through the residuals shape on predicted values. If
the assumptions are met, usually the predicted residuals will not have any
particular pattern that can be seen from the residuals plot.
This means that residuals are random around the horizontal line, as shown in
Figure 8.1.
The normality assumption is satisfied if the normal plot follows a straight line.
Referring to Figure 8.2, not all points fall on the straight line; hence, the
regression model can satisfy only about 94% of the normality assumption.
ACTIVITY 8.2
Based on your understanding of what you have learned, try to think of two
or three independent variables for the following dependent variables.
(a) Profit made by a company;
(b) Students’ final examination grade; and
(c) Selling price of a house.
EXERCISE 8.2
1. (a) Perform a manual analysis and use Formulas 8.13, 8.14 and
8.15 to obtain the parameter estimates for β0 , β1 and β2 for
the multiple regression model for the following data:
Y : 10, 24, 40, 20, 15
X 1 : 2, 3, 7, 3, 4
X 2 : 5, 6, 6, 5, 3
X 1 : 2, 3, 7, 3, 4; X 2 : 5, 6, 6, 5, 3
X 1 : 2, 4, 4, 3, 4; X 2 : 2, 6, 8, 6, 3
• The least squares method is used to obtain the estimates of the regression model
parameters and y-intercept. Next, hypothesis testing is carried out on the
regression coefficients to determine whether there exists enough evidence to
state the existence of a linear relationship.
2
• The goodness of fit between two variables can be obtained using the R
coefficient of determination.
• The deviation of a model from the assumptions made can be identified using a
residuals plot that should not display any distinct pattern (that is random) if the
model assumptions are met.
• If the multiple linear regression model obtained is suitable and fits the data, this
model can be used to obtain the estimated values of a dependent variable for
any given independent variable values. This model can later be used for making
predictions for any independent variable value as well as estimating the values
of dependent variables.
Multiple linear regression – A regression model with one dependent and two or
more independent variables that assumes a straight-
line relationship.
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
William, M., & Terry, S. (2003). A second course in statistics: Regression analysis.
(6th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc.
INTRODUCTION
In testing a single population using parametric methods, the t-test and z-test (for
large sample size) are used to determine whether population mean, μ, is equivalent
to or different from a certain mean value, μ0 . Non-parametric methods for testing
a single population also check whether there exists significant difference in terms
of location or position of a population with a given value of a measure of location.
Hence, the mean is replaced by the median as the pertinent location parameter under
test. This is because the median value is not influenced by outlier values or skewed
distribution shape. Median is a measure of the middle data or equal separation of
two data in a population.
To test a population based on the median, we will test the hypothesis whether
the median of a population under study, say, denoted by τ (read as “tau”), is
significantly different from a specific median, say, τ0 (read as “tau not”).
There are two types of non-parametric statistical tests that will be discussed in this
topic to test the position or location of a population using median value. These tests
are sign test and signed-rank test. Let us find out more about them in the following
subtopics.
ACTIVITY 9.1
On the other hand, null hypothesis refers to a statement that we wish to reject (note
that null comes from the English word “nullify” which means reject or invalidate).
For example, in a study to test consumers’ preferences on a certain product, the test
will decide whether more than half of the consumers sampled chose this product or,
equivalently, fewer than half chose the other products. If x measures consumer
preference for the product, then the probability of the consumers preferring
the product to be larger or smaller than the median value must be equal to 1 2 (see
Figure 9.1).
For the previous example, the expression term in constructing the alternative
hypothesis statement H1 : θ > 0.5 or H1:τ > τ 0 is “larger than”.
In summary, the expressions in the previous example can lead us to decide the
set of null and alternative hypotheses for a single-population test, as stated in
Table 9.1.
Table 9.1: Expression Terms with Related Null and Alternative Hypotheses
SELF-CHECK 9.1
ACTIVITY 9.2
State the expression terms and define the appropriate null and alternative
hypothess for the following cases:
Alternative Null
Statement with Expressions
Hypothesis Hypothesis
(a) A supervisor recorded 9 observations of
battery lifetime before new recharge
required. Determine if this battery
operates with a median of 1.8 hours
before the next charging.
(b) The following data was obtained from a
non-normal population. Determine if
median was distributed less than 5.2cm.
Median is the observed value that is located in the middle of data when all
observations are ranked in sequence regardless of an increasing or decreasing order.
If the sample size is even, the median would be the mean value of the two
observations at the centre. Similar to the mean value, median is also a measure of
location for a distribution. Hence, the sign test is sometimes known as test of
distribution location.
In the sign test, we replace each sample value exceeding median value, τ 0 with a
plus sign and each value less than τ 0 with a minus sign, that is:
The sample value which is equal to τ 0 value will be replaced with the “0” sign.
This situation occurs if we deal with rounded data even though the population is
continuous. Observations replaced with “0” sign will not be used in subsequent
analysis. When this occurs, the sample size for analysis will decrease as many as
the number of “0” (zero) signs. Table 9.2 summarises this information.
In the sign test, the test statistics, S, is the random variable x representing the number
of “+” signs in the random sample. If the null hypothesis τ = τ 0 is true, the
probability that a sample value results in either a “+” or a “–” sign is equal to 1 2 .
Therefore, we are actually testing H 0 that the number of “+” sign, S, is a value of
a random variable having the binomial distribution with the parameter θ = 1 2 , that
is:
θ = Pr ( xi − τ 0 > 0 ) = Pr ( xi − τ 0 ) = 1
2
For the previous example, we shall reject the null hypothesis Η 0 : τ = 100 or
H 0 : θ > 0.5 only if the proportion of “+” sign is sufficiently greater than 1 2 , that
is, when S is large.
Let us learn the procedure for test of location for a single population:
(a) State the null and alternative hypotheses to test a population location
parameter:
One-sided test: H 0 : θ = 1
2 versus H1 : θ > 1 2
[or H 0 : θ = 1
2 against H1 : θ < 1 2]
Two-sided test: H 0 : θ = 1
2 versus H1 : θ ≠ 1
2
or
One-sided test: H 0 :τ = τ 0 versus H1:τ > τ 0
[or H 0 :τ = τ 0 against H1:τ < τ 0 ]
Two-sided test: H 0 :τ = τ 0 versus H1:τ ≠ τ 0 .
Example 9.1:
Suppose that we would like to test if the median of a population is less than 51.
From the following observations, calculate the test statistics and significance
level.
36 43 52 51 51 48 57 50
Answer:
To decide whether t is smaller than 51, test
Η 0 : τ = 51 versus Η 1 : τ = 51
36 43 52 51 51 48 57 50
– – + 0 0 – + –
Example 9.2:
It is suspected that the percentage of active bacteria obtained from a sewerage
specimen at an area has a median of 40. The active bacteria percentages in a
random sample of 9 specimens are given below:
41 33 43 52 37 44 49 53 40
Is there enough evidence based on the data provided to claim that the median for
active bacteria percentage exceeded 40? Carry out a sign test by using α = 0.05.
Answer:
To determine if the median of percentage of active bacteria,τ0 exceeded 40, test:
H 0 :τ = 40 versus H1:τ > 40 at α = 0.05.
Assigning values exceeding 40 with “+” sign and values less than 40 with “–”
sign,
41 33 43 52 37 44 49 53 40
+ – + + – + + + 0
Using the sign test, the test statistics, S = Number of observed “+” signs = 6.
Hence, S is distributed as binomial with n = 9 – 1 = 8 and θ = 0.5. If the x variable
is distributed as binomial with n = 8 and θ = 0.5, then:
p-value = Pr ( x ≥ S ) = Pr ( x ≥ 6) = 1 – Pr ( x ≤ 5)
= 1 – 0.8555 = 0.1445
Since α = 0.05 < p-value = 0.1455, therefore do not reject the H 0 . There is not
enough evidence to say that the median for active bacteria percentage exceeds 40
at 5% significance level.
SELF-CHECK 9.2
ACTIVITY 9.3
The procedure for single population test of location for a large sample is as follows:
(a) State the null and alternative hypotheses for testing a population location
(refer to hypothesis for small n).
S − Mean( S ) S − 0.5S
Z= =
var( S ) 0.5 n
One-sided test: Z ≥ zα
Two-sided test: Z ≥ zα /2
zα and zα /2 values can be obtained from Appendix 3.1 (see Topic 3), standard
deviation.
Example 9.3:
The following data are the amount of sulphur oxide (in tons) emitted by an
industrial plant in 40 days. Perform a sign test to determine whether τ < 21.5 at
0.01 significance level.
17 15 20 29 19 18 22 25 27 9 24 20 17 6 24
14 15 23 24 26 19 23 28 19 16 22 24 17 20 14
13 19 10 23 18 31 13 20 17 24
Answer:
H 0 :τ = 21.5
Step 2: For a one-sided test, reject H 0 if test statistics z > z0.01 = 2.33 where
S −θ
z=
nθ (1 − nθ )
Step 3: Since z = 1.26 < 2.33, we do not reject H 0 . There is not enough
evidence to prove that sulphur oxide content is less than 21.5 at 0.01
significance level.
EXERCISE 9.1
Given that t0 = 160, use the sign test to test the null hypothesis
t = t0 against alternative hypothesis t > t0 at 0.05 significance level.
Under the null hypothesis that there is no difference between x values and τ 0 , we
would expect that on average, half the differences would be negative and the other
half positive.
In other words, there will be n/2 negative differences and vice-versa. Next, we
would rank these positive and negative differences in an absolute value, and assign
ranks according to sequence. It is expected that the total rank corresponding to the
positive differences should be equal or nearly equal to the total ranks which
correspond to the negative differences. The obvious difference in total rank
assigned to positive and negative differences is an indication of differences between
x values and τ 0 .
What are the procedures in applying the Wilcoxon signed-rank test? Let us follow
the following procedure:
(c) Assign rank to the absolute difference (rank 1 for smallest absolute difference;
rank n for the largest).
(d) When the absolute value of two or more differences is the same, assign to
each the average of the ranks that would have been assigned if the differences
were distinguishable.
(f) Differences with 0 value will be discarded, hence, reduction in sample size by
that amount.
The smaller the total rank value, the bigger the possibility that there exists
differences between sample values and τ 0 . Hence, we can reject H 0 when the test
statistics, that is the total rank, say T, is less than or equal to a critical value T0 .
The single population test procedure that considers magnitude and difference sign
is as the following:
(a) State the null and alternative hypotheses statement for a single
population test.
https://faculty.washington.edu/heagerty/Books/Biostatistics/TABLES/t-
Tables/.
Example 9.4:
Based on Example 9.2, test to determine whether the median of percentages of
active bacteria exceeds 40 at α = 5%. Use Wilcoxon signed-rank to make a
decision.
Answer:
H 0 :τ = 40 versus H1:τ > 40, α = 0.05
41 +1 1 (+) 1
33 –7 7 (–) 5
43 +3 3 (+) 2.5
52 +12 12 (+) 7
37 –3 3 (–) 2.5
44 +4 4 (+) 4
49 +9 9 (+) 6
53 +13 13 (+) 8
40 0 0 –
From the previous table, the total differences with “+” sign and total differences
with “–” sign are T + = 28.5 and T − = 7.5 respectively. Since there is only one
observation with a value equal to the median value, n = 9 – 1 = 8.
For a one-sided test, the test statistics is given as T − = 7.5. From the table of
critical value for the Wilcoxon signed-rank test (see Appendix 9.1), with n = 8,
the critical value is T0.05 = 4. Since, T − = 7.5 is not ≤ T0.05 = 4, we do not reject
H0. There is not enough evidence to prove that the median percentage of active
bacteria is more than 40. A similar conclusion is obtained through a sign test.
n ( n + 1) n ( n + 1)( 2n + 1)
μ = J (T ) = σ 2 = Var (T ) =
4 24
T − μT
The signed-rank test statistics for n ≥ 15 is Z = distributed as standard
σT
T − − μT
normal. For a one-sided test, the test statistics is Z = (right side) or
σT
T + − μT
Z= (left side).
σT
Let us look at the procedure for one location test with large n using the Wilcoxon
signed-rank test:
(a) State the null and alternative hypotheses statement for a single population test.
Using the same definition for T, T + and T − as previously, the test statistics
4
is Z = , with T = T − (one-sided right test),
n ( n + 1)( 2n + 1)
24
Two-sided test: Z ≥ za /2
Example 9.5:
It is claimed that a type of detergent is the choice of many consumers. To test
this claim, the detergent’s producer has recorded its sales for a month at a
hypermarket as shown below:
Test whether the median sales of this detergent differs from 120 units at 5% level
of significance.
Answer:
H 0 :τ = 120 against H1:τ ≠ 120 at α = 0.05. All observations are subtracted from
the median value, τ 0 = 120. The magnitude and difference signs are recorded and
next, ranks are assigned to each difference. The results are as follows:
yi yi − t 0 Rank yi yi − t 0 Rank
85 –35 (–) 25.5 73 –47 (–) 29
99 –21 (–) 17.5 123 +3 (+) 4
12 0 0 119 –1 (–) 2
116 –4 (–) 5.5 85 –35 (–) 25.5
138 +8 (+) 13 128 +8 (+) 9
100 –20 (–) 15.5 150 +30 (+) 23
129 +9 (+) 10 124 +4 (+) 5.5
115 –5 (–) 7 100 –20 (–) 15.5
141 +21 (+) 17.5 101 –19 (–) 14
142 +22 (+) 19 130 +10 (+) 11
121 +1 (+) 2 119 –1 (–) 2
94 –26 (–) 22 127 +7 (+) 8
78 –42 (–) 28 96 –24 (–) 21
152 +32 (+) 24 109 –11 (–) 12
97 –23 (–) 20 83 –37 (–) 27
An observation has been discarded since its difference = “0”, hence, n = 29. Total
differences is T + = 146 while T − = 289. Since n is large, normal approximation
is used. The test statistics calculated is:
4 4
Z= = = −1.546
n ( n + 1)( 2n + 1) 29 ( 30 )( 59 )
24 24
SELF-CHECK 9.3
What are the differences between sign test and Wilcoxon signed-rank test?
EXERCISE 9.2
1. For Wilcoxon signed-rank test, show that the sum of positive and
negative differences, T + + T − = n ( n + 1) / 2 with n is the number of
non-zero differences assigned rank.
Use the signed-rank test at = 0.05 to test whether the median gas
content is 98.5.
Use the sign test at α = 0.10 to determine whether these pipes follow
the specification require. Compare the result with the Wilcoxon
signed-rank test.
Carry out both the sign and signed-rank tests to determine whether
more than half of the random polynomial 0 – 1 problems require less
than or equal to 1 CPU second. Use α = 0.01.
• Sign test is an easier approach with simpler calculation, while signed-rank test
is more precise as it takes into account the magnitude of differences between
sample values and the specific value of interest in the test, apart from
information on sign or direction of difference.
• A test procedure for a single population, Wilcoxon signed-rank test, takes into
consideration the magnitudes of these differences where ranks are assigned to
observations based on these magnitudes.
Sign test – The sign test is designed to test a hypothesis about the location of a
population distribution. It is often used to test a hypothesis about a
population median and involves the use of matched pairs, for example,
before and after data, in which case it tests for a median difference of
zero.
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
Mendenhall, W., Beaver, R. J., & Beaver, B. M. (2013). Introduction to probability
and statistics (14th ed.). Belmont, CA: Thomson Brooks/Cole.
Walpole, R. E. (2006). Probability and statistics for engineers and scientist
(8th ed.). Upper Saddle River, NJ: Prentice-Hall, Inc.
Source: http://users.stat.ufl.edu/~winner/tables/wilcox_signrank.pdf
INTRODUCTION
A researcher often conducts observations to compare two populations under study,
such as whether both populations come from the same distribution. For example, in
a parametric test, two random samples, X 1 , ..., X n1 and Y1 , ..., Yn 2 , are obtained
from two normal populations with a mean of μ x and μ y respectively and constant
variance. The researcher may be interested to test H 0 : μ x = μ y versus H1 : μ x < μ y .
If the null hypothesis holds, the researcher can conclude that both distributions are
distributed as normal with similar mean and variance; in other words, both samples
were taken from the same population.
On the other hand, if the alternative hypothesis is true, then μ x < μ y , that is, the
location parameter of X (selected as the mean) has a smaller value than the location
parameter of Y.
Hence, the X population distribution is located on the left side of the Y distribution.
The dispersion of X and Y distributions is still the same as both variances are
assumed constant.
For example, if we are comparing male and female students’ marks in a basic
statistics course, we are sure that the selection of the first sample from the male
students’ population will not be influenced or influence the second sample or any
sample from the female students’ population and vice-versa.
On the other hand, if we can associate or match two populations where the selection
of samples from the second population depends on the selection of samples from
the first population, both are said to be dependent.
is not putting on any safety tool (sample 2). In this case, both samples are related
and experimented on the same subject, that is, the same driver is used to obtain
measures of injuries when putting on the safety tools and when not putting on the
tools.
In testing two dependent populations, the sample size n1 and n2 must be equal,
that is n1 = n2 = n due to relationship or similarity in data source/measurement
obtained, as in the case of similar subjects and the comparison made is based on
paired comparison.
SELF-CHECK 10.1
ACTIVITY 10.1
Determine whether the following data represents two dependent
populations:
(a) An average accident that occurred during work at a factory before
and after the implementation of a safety program.
(b) The nicotine content in cigarette brand X and Y. Give other examples
for both types of populations.
Discuss the answer in the myINSPIRE forum.
Example 10.1:
A company manager claims that night-shift workers tend to apply for more sick
leave compared to day-shift workers. Construct a hypothesis to test whether the
number of sick leaves for night-shift workers is higher than for day-shift workers.
Answer:
Define D1 as the distribution of sick leaves applied for by night-shift workers
and D2 as the distribution of sick leaves applied for by day-shift workers. From
the expression “higher”, we can conclude that H1 : D1 > D2 while H 0 : D1 = D2 .
Example 10.2:
It is claimed that students who were not provided sample examination problems
in advance will obtain lower marks compared to those who received them. To
test this claim, 20 students were selected so that each matched pair has almost the
same overall quality point average in other examinations. Construct a hypothesis
to test this claim.
Answer:
Define D1 as the distribution of marks for students who did not have the sample
problems and D2 as the distribution of marks for students who were provided
sample problems. From the expression “lower”, we can test whether
X 1 , X 2 , ..., X n1 values are smaller than Y1 , Y2 , ..., Yn 2 values or D1 is located on
the left side of D2 .
Example 10.3:
Twelve students identified as obese were put on a special diet believed to be able
to reduce body weight. Their weight before and after the diet was monitored for
a month. Construct a hypothesis to test whether this diet is effective.
Answer:
Define D1 as the students’ weight distribution before starting the diet and D2 as
the distribution of weight after the diet. If the diet is effective, the observation
values in D1 must be larger than the observation values in D2 . In other words,
D1 must be located on the right side of D2 . Hence, the alternative hypothesis is
H 0 : D1 > D2 .
The weight distribution before and after the diet can also be viewed as the
same distribution as the weight difference between the weight before and the
weight after, that is Di = X i − Yi . The alternative hypothesis can be described
as H1 : Di = X i − Yi > 0 and H 0 : Di = 0. We will discuss this further in
Subtopic 10.3.
Some examples of expressions that provide clues in constructing suitable null and
alternative hypotheses for two independent populations are given in Table 10.1.
Table 10.1: Summary of Expression Terms with the Null and Alternative Hypotheses
• “Different than” H1 : D1 ≠ D2 H 0 : D1 = D2
• “Not equal to”
There are two methods for comparing two dependent populations, sign test and
Wilcoxon signed-rank test. Both methods were discussed in Topic 9 for single
population testing. Now, let us look at how to use them for two populations.
1
θ = Pr ( X i >τ 0 ) = Pr ( X i >τ 0 > 0 ) =
2
1
= Pr ( Di > 0 ) = = Pr ( Di < 0 )
2
H 0 : Pr ( Di > 0 ) = Pr ( Di < 0 ) = θ = 1 − 2 or H 0 : τ1 − τ 2 = 0
Suppose that S represents the sum of differences between X i and Yi marked as “+”,
1
S follows a binomial distribution with θ = . Thus, the null hypothesis statement
2
1
for comparing two paired populations is H 0 : θ = .
2
How do we perform a sign test for two dependent populations? Let us follow these
steps:
(a) Obtain differences between the first sample and its pair, the second sample.
(b) Assign the “+” or “–” sign according to the result of the differences. The
paired sample with zero difference is discarded.
(c) Count the sum of “+” signs for one-sided (right) test and the sum of “–” sign
for one-sided (left) test.
Next, the rejection region (RR) can be determined by using binomial probability
distribution.
Take note that the testing procedure for two dependent or paired populations is
similar to the testing procedure for a single population.
Example 10.4:
A manager aims to study whether a raise in employees’ salary will reduce the
number of defective products. Data on defective products before and after salary
increment are recorded. Construct appropriate null and alternative hypotheses
and state the test statistics.
Answer:
To determine whether a salary increment results in lower defective products, test
1
H 0 : θ = (the distribution of defective products is the same before and after
2
1
salary increment) or H 0 : τ after = τ before , versus H1 : θ < (there are fewer
2
defective products after salary increment than before the increment) or
H 0 : τ after < τ before . From the alternative hypothesis, the test is a one-sided (left)
test. Hence, test statistics = S2 = the number of differences between X and Y with
“–” sign.
t
Example 10.5:
A fast-food restaurant’s marketing department plans to identify whether a new
ingredient results in tastier fried chicken compared to the original ingredients.
Ten culinary experts are chosen at random to evaluate fried chicken cooked with
and without the new ingredient and are asked to rate the taste at a scale of 1 to 10
(1 represents least delicious and 10 very delicious). The results are as follows:
Answer:
There are two dependent samples since the evaluation of the fried chicken cooked
with original and new ingredients were made by the same subjects (culinary
experts). The hypothesis to test whether the two types of ingredients are different,
1
H 0 : There is no change in the fried chicken’s taste or θ = versus
2
1
H1 : The new ingredient resulted in tastier fried chicken or θ > .
2
Di = New
Culinary Original New
Difference Ingredient –
Expert Ingredients Ingredient
Original
A 3 9 6 +
B 5 5 0 0
C 3 6 3 +
D 1 3 2 +
E 5 10 5 +
F 8 4 –4 –
G 2 2 0 0
H 8 5 –3 –
I 4 6 2 +
J 6 7 1 +
From the table above, we can see that 6 out of 10 culinary experts found that the
chicken tasted better with the new ingredient, with 2 saying that the original
ingredient tastes better while 2 other experts could not detect any difference.
In a one-sided right test, if H 0 is true, a large number of “+” sign or small number
or “–” sign will result in H 0 rejection.
( )
From the binomial table with n = 8 and θ = 0.5, Pr S 3 6 = 1.1445. Since p-value
= 0.1445 is greater than the significance level value, α = 0.05, we do not reject
the H 0 . In conclusion, the fried chicken cooked with the new ingredient does not
have a significant difference from the fried chicken cooked without it at 0.05
significance level.
small as 10, but for a more precise result, normal approximation is used for
n3 ≥ 15.
Example 10.6:
A total of 35 customers are randomly chosen and asked whether fried chicken
cooked with the new ingredient tastes differently from the fried chicken cooked
without it. The summary of the results are as follows:
“+” difference = 19
“–” difference = 13
“0” difference (no difference in evaluation) = 3
Answer:
To determine whether the new ingredient is preferable, test H 0 : θ = 0.5 versus
H1 : θ > 0.5.
S − 0.5n 2 S − n 2 (19 ) − 32
Test statistics = = = = 1.061
0.5 n n 32
From the standard normal table, the critical value z0.05 = 1.645 for one-sided test.
Since 1.061 < 1.645, do not reject the null hypothesis.
Remember that the signed-rank test is based on T + , that is, the sum of the ranks
assigned to the positive differences, or T − , that is, the sum of the ranks assigned to
(
the negative differences, or T, where T = minimum T + , T − . )
The steps in the Wilcoxon signed-rank test for two dependent populations are as
follows:
(c) Assign rank to the absolute values of the differences (1 for smallest
absolute difference; n for the largest).
(d) Differences with the equal value will be assigned the mean rank for ranks
that they jointly occupy.
(f) Zero differences are discarded; hence, the sample size will be reduced by the
number of differences with 0 values.
The test procedure for a paired two-population with sign-ranked method is similar
to the test procedure for a single population. Let us look at Example 10.7.
Example 10.7:
In evaluating paper quality, paper smoothness is very important to ensure
customers’ acceptance. Suppose that 10 judges were each given two samples of
paper produced by a factory. The evaluations of the judges were assigned rank 1
to 10, with rank 10 representing the highest quality of smoothness. From the
evaluation results below, test for differences in quality for both paper products at
0.05 significance level.
Judge 1 2 3 4 5 6 7 8 9 10
Product A 6 8 4 9 4 7 6 5 6 8
Product B 4 5 5 8 1 9 2 3 7 2
Answer:
The hypothesis tests for differences in both paper product’s quality,
H 0 : The distributions of evaluation for both products 1 and 2 are the same or
H 0 : τ1 = τ 2 versus H1 : The distributions of evaluation for both products 1 and 2
are different or H1 : τ1 ≠ τ 2 .
Applying the signed-rank method, the test procedure generated the following
results:
From the table, the sum of the rank differences “+” = T + = 46, while the sum of
the rank differences “–” = T − = 9. Since the test is two-sided, the test statistics is
given by:
( )
T = Minimum T + , T − . = T − = 9.
If the sample size is large, let us say n3 ≥ 15, the distribution of test statistics T
(either T + or T − ) will approach normal distribution. To perform the signed-rank
test for large n, T is a random variable with mean and variance,
n ( n + 1) n ( n + 1)( 2n + 1)
μ = E (T ) = σ 2 = Var (T ) =
4 24
T −μ
Hence, the signed-rank test statistics for n3 ≥ 15 is Z = which is a standard
σ
T− −μ
normal. For a single-sided test, the test statistics is Z= or
σ
T+ −μ
Z= .
σ
Example 10.8:
A company producing energy drinks claimed that the drinks are effective in reducing
body weight. The following are the weights of 16 random samples before and after
four weeks of taking the drinks.
Weight Before 147.0 183.5 232.1 161.6 197.5 206.3 177.0 215.4
Weight After 137.9 176.2 219.0 163.8 193.5 201.4 180.6 203.2
Weight Before 147.7 208.1 166.8 131.9 150.3 197.2 159.8 171.6
Weight After 149.0 195.4 158.5 134.4 149.3 189.1 159.1 173.2
Use the signed-rank test to test at 0.05 level of significance whether the company’s
claim is true.
Answer:
The energy drinks are said to be effective if the weight distribution before > weight
distribution after, that is, the number of “+” sign is less than the number of “–” sign.
To determine whether the energy drinks are effective in reducing body weight, test:
1 1
H0 :θ = versus H1 : θ <
2 2
The Wilcoxon signed-rank procedure for testing two dependent populations gave
the following results:
For a one-sided (right) test, the null hypothesis will reject if the test statistics,
T− −μ 25 − 68
Z= = = −2.22 < The critical value z0.05 = −1.645.
σ 374
Since z = −2.22 < z0.05 = −1.645, we reject the null hypothesis. We conclude that the
energy drinks are effective in reducing body weight at 5% level of significance.
EXERCISE 10.1
1. Suppose that the differences in paired data are 15 “+” signs, 5 “–”
signs and 9 “0”, use the sign test for right-end test,
(a) What are the values for n and S?
(b) At 0.05 level of significance, will H 0 be rejected?
Before 3 3 1 5 3 6 2 0 4 3 4 1
After 1 2 3 2 0 4 3 2 1 2 3 0
Use the sign test to test that the new traffic-control system is more
effective than the old system at 0.05 level of significance.
Suppose there are n1 and n2 independent samples from population 1 and population
2 respectively. The procedures for the rank-sum test suggest that we combine n1 +
n2 = n observations and assign rank according to the observed magnitude. Rank 1
will be assigned to observation with the smallest value and rank n to the observation
with the highest value.
Mann and Whitney have suggested a test statistics which also uses the sum of the
ranks for both samples and it can be shown to be equivalent to the Wilcoxon test.
This test, known as the Mann-Whitney U test, has been used extensively since the
availability of table for U critical values.
If comparing two populations using the Mann-Whitney test, the following statistics
will be used as the test statistics:
n1 ( n1 + 1)
U 0 = n1n2 + − W1 or
2
n ( n + 1)
U 2 = n1n2 + 2 2 − W2 or
2
U = The minimum of U1 and U 2
where U1 + U 2 = n1n2 , while W1 and W2 are the sum of the ranks of the values of
the first and second samples respectively.
From the formulas for U1 and U 2 , U1 will be small when W1 is large. This can
only happen if the population 1 distribution is shifted to the right of the population
2 distribution. Hence, the test statistics U1 will be used when the alternative
hypothesis is D1 > D2 .
(a) Arrange all n = n1 + n2 data from both populations, where n1 is the sample
size for population 1 and n2 for population 2.
(c) Assign rank to the arranged data. Assign rank 1 for data with the smallest
value and n for the largest data.
(d) In the case of ties (identical observations), we replace the observations with
the mean of the ranks that the observations would have if they were
distinguishable (tie-rank).
If the samples chosen were from two identical populations, it was expected that the
sum of the ranks of both samples would not differ too much. If there is an
appreciable difference between the means of the two populations, most of the lower
ranks are likely to go to the values of one sample, while most of the higher ranks
are likely to go to the values of the other sample.
(a) State the null and alternative hypotheses for two independent populations
test.
One-sided test: H 0 : D1 and D2 are equal versus H1 : D1 has shifted to the
right of D2 [or H 0 : D1 and D2 are equal versus H1 : D1 has shifted to the left
of D2 ].
n2 ( n2 + 1)
[or U 2 = n1n2 + −W ]
2
Two-sided test: U ≤ U α
Example 10.9:
To find out whether a new serum will arrest the seriousness of leukaemia, nine
laboratory rats with an advanced stage of the disease are selected. The survival
times (in years), from the time the experiment commenced, are as follows:
Answer:
Let D1 be the distribution of survival times of rats receiving treatment and D2
as the distribution of survival times for rats not receiving treatment. To determine
the effectiveness of serum treatment, test:
H 0 : D1 = D2 against H1 : D1 > D2
Original Data 0.5 0.9 1.4 1.9 2.1 2.8 3.1 4.6 5.3
Rank 1 2 3 4 5 6 7 8 9
When n13 ≥ 15 and n23 ≥ 15, the sampling distribution of U will approach a normal
distribution with mean and variance,
n1 ( n1 + n2 + 1) n1 ( n1 + 1) n1n2
μU = − =
2 2 2
n n ( n + n + 1)
σ U2 = 1 2 1 2
12
U − μU
Next, the Z = . statistics which approaches standard normal can be used
σU
to make a decision about the test. Let us do Exercise 10.2 to strengthen your
understanding of this final topic.
EXERCISE 10.2
(b) n1 = 6 n2 = 4 w2 = 17
H 0 : D1 and D2 are equal H1: D1 and D2 are unequal
Group 1 15 21 15 23 17 14 16
Group 2 18 22 24 25 19 24 17 19 23 16
(a) Does the data support the claim that the campaign is
successful? Test the effectiveness of the campaign at 0.05
significance level.
At α = 0.05, can you prove that alcohol intake resulted in longer time
to solve the problem?
• Two populations for comparison are said to be independent when samples taken
from one population is not dependent on or influenced by the samples chosen
from the other population.
• To compare two independent populations, the Wilcoxon rank test and the
Mann-Whitney test can be used.
Freund, J. E. (2003). Mathematical statistics (7th ed.). Upper Saddle River, NJ:
Prentice-Hall, Inc.
Mann, P. S. (2005). Introductory statistics using technology (5th ed.). Upper Saddle
River, NJ: John Wiley & Sons, Inc.
Answers
TOPIC 1: COMPARISON OF TWO POPULATION
MEANS
Exercise 1.1
Discuss the solutions in class.
Exercise 1.2
Step 1: The population parameters to be compared and tested are population
means μ A and μB .
Step 2: Both population distributions are assumed normal with both variances
unknown but having the same values.
We write:
H1 : μ A ≠ μB ; and
The complementary, H 0 : μ A = μ B ,
H 0 : μB − μ A = 0 vs. H1 : μB − μ A ≠ 0
Step 4: Even though the RR is not given, the level of significance α is given as
0.05. Thus, we can calculate the probability p-value or construct the RR
for the case with the following situation:
S B2 =
( n A − 1) S A2 + ( nB − 1) S B2 ( 23) 82 + ( 21) 92
= = 72.11 .
nA + nB − 2 44
RR = {t : t ≤ – 2.0154 } ∪ {t : t ≥ 2.0154}
B = X B − X A is a t score where:
XB − XA
T= (1.6)
1 1
SB +
nB nA
Step 6: The test decision when H 0 is true. Based on the given sample, the value
for score B is given by (1.6) as:
XB − XA 78 − 80
TB = = = −0.7979
1 1
( 72.11) +
1 1
SB +
n2 n1 22 24
Test decision: Since the value TB = 2.0154 > −0.7979 > −2.0154 , then
H 0 is not rejected (accepted).
Step 7: Test conclusion: Sample information does not give enough evidence to
indicate the difference in mean score of groups A and B. The difference
obtained in the sample score is merely by chance.
EXERCISE 3.1
The value 0.831 of figure (a) shows column α = 0.975 with row, v = 5 having
2
significance value χ0.975 (5) = 0.831.
Figure (b) shows that the value 23.68 given in Appendix 3.1 at column α = 0.05
and row, v = 14 which means “point 0.05 at χ 2 (14) distribution is 23.68”,
2
i.e. χ0.05 (14) = 23.68.
Exercise 3.2
Step 1: Determine the Test Parameter
The population parameter to be tested is population variance that is σ 2 .
H0 : σ 2 ≥ 42
H1 : σ 2 < 42
Exercise 3.3
1. For α = 5% = 0.05, 1% = 0.01 and 0.1% = 0.001 refer to the first, third and
fourth rows in the F distribution table for every pair of v1 and v2 . Thus,
2. Values on the right of the equation represent values in the distribution table.
Thus, determine their values based on suitability/accuracy value in the table
by referring to the intersection of the column and row according to the degrees
of freedom. Hence, we obtained:
Since the value 3.50 (at v1 = 6 and v2 = 14) is on the second row,
α = 0.025.
Since the value 2.93 (at v1 = 10 and v2 = 32) is on the second row,
α = 0.01.
Since the value 1.81 (at v1 = 24 and v2 = 38) is on the second row,
α = 0.05.
Since the value 5.61 (at v1 = 2 and v2 = 24) is on the second row,
α = 0.01.
Exercise 3.4
From the table, we obtain f 0.01,14,11 = 4.30 and f 0.01,11,14 = 3.87 (since α = 0.02).
Hence:
σ 12
that is 3.425 < < 56.991
σ 22
σ 12
This means the assumption that ratio = 1 is not true because the estimation of
σ 22
ratio interval does not contain the value 1.
Exercise 3.5
1. The appropriate hypothesis testing using α = 0.02 to determine whether both
populations have equal variance.
Population 1 Population 2
(a) Shape: Normal (a) Shape: Normal
(b) Mean: μ1 (unknown) (b) Mean: μ2 (unknown)
(c) Standard deviation: σ1 (c) Standard deviation: σ 2
(unknown) (unknown)
Terms of Expression H1 H0
σ12
H 0 : σ12 = σ 22 ⇔ H 0 : =1
σ 22
σ12
H1 : σ12 ≠ σ 22 ⇔ H1 : ≠1
σ 22
s12 10.6
The test statistics, F = = = 1.45
s22 7.3
3.
2
Use properties c(i) and c(ii). Since X 1 is distributed as N μ1, σ1 , ( )
X1 − μ1 X 2 − μ2
z1 = and z2 = are each distributed as standard normal,
σ1 σ2
N(0,1).
( X − μ )2 ( X − μ )2
(
E = 1 2 1 + 2 2 2 = E Z12 + Z 22 = E χ 2 (2) = 2
σ1 σ2
) ( )
Observe column α = 0.975 with row 40 in the table gives the value
of 24.43. Thus, χα2 = 24.43 (v = 40) is true at α = 0.975.
H 0 : σ 2 = 11
H 1 : σ 2 > 11
χ 2 > χ95%
2
(8)
Sample variance,
n
( xi − x )
2
2
s = i =1
=
( 2
9.1 + 10.73) + ... (10.4 − 10.73)
2
= 3.39
n −1 14
H 0 : σ 2 = 1.9
H 1 : σ 2 ≠ 1.9
χ 2 < χ97.5%
2
(14) or χ 2 < χ2.5%
2
(14)
2 (n − 1)S 2 14 ( 3.39)
Test statistics, χ = 2
= = 24.98
σ 1.9
H 0 : σ 2 = 1.9
H 1 : σ 2 > 1.9
χ 2 = χ.5%
2
(14)
2 (n − 1)S 2 14 ( 3.39)
Test statistics, χ = 2
= = 24.98
σ 1.9
1432 1 σ 2 1432
⋅ < 12 < ⋅ 2.408
3761 2.327 σ 2 3761
that is
σ12
0.1636 < 2 < 0.9168
σ2
Since the confidence interval contains values <1, we are confident that 95%
of the channel 1 sequence has smaller variance than the channel 2 sequence,
and therefore should be chosen by the firm.
Population 1 Population 2
(a) Shape: Normal (a) Shape: Normal
(b) Mean: μ1 (unknown) (b) Mean: μ2 (unknown)
(c) Standard deviation: σ 1 (c) Standard deviation: σ 2
(unknown) (unknown)
σ12
H 0 : σ12 = σ 22 ⇔ H 0 : =1
σ 22
σ12
H 0 : σ12 ≠ σ 22 ⇔ H 0 : ≠1
σ 22
s12 12.4
Test statistics, F = 2 = = 0.64
s2 19.3
Population 1 Population 2
(a) Shape: Normal (a) Shape: Normal
(b) Mean: μ1 (unknown) (b) Mean: μ2 (unknown)
(c) Standard deviation: σ 1 (c) Standard deviation: σ 2
(unknown) (unknown)
σ 12
H 0 : σ 12 = σ 12 ⇔ H0 2 = 1
σ2
σ 12
H1 : σ 12 > σ 12 ⇔ H1 >1
σ 22
s12 177.9
Test statistics, F = = = 2.85
s22 62.39
Exercise 4.1
Data 1 Data 2
From the two figures, both data sets illustrate changes between and within samples
for the variables. In Data 1, changes between samples are larger than changes within
samples.
However, Data 2 shows that changes between samples is not much different from
changes within samples.
Thus, significance tests must be performed. A statistical test used to examine the
equality in population mean should be able to differentiate the between and within
samples variations. Thus, we need to calculate the between and within samples
variations.
Exercise 4.2
The following result is obtained:
1 2 3 4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
81 72 83
69 79 76
90
Total 454 549 425 351
The Number of Students 6 7 6 4
Average 75.67 78.43 70.83 87.75
4 nj
(
= xi , j − x )
i =1 j =1
2 2 2
= ( 65 − 77.35 ) + ( 75 − 77.35 ) + ... + ( 88 − 77.35 )
= 1,909.2
( )
2
= nj xj − x
= 712.6
Exercise 4.3
In an ANOVA test, the critical area is determined by α, the degrees of freedom for
treatments and degrees of freedom for errors. Steps 2 and 3 (in Subtopic 4.2.1) need
to be understood prior to getting the critical value.
Exercise 4.4
1. The relevant factors are classes with factor levels or type of class that the
students are in.
4. We obtained:
(a) The critical value for ANOVA test at α = 0.01 when there are six
samples with 34 items in each sample is F5,28,0.01=3.75. This comes from
α = 0.01, the degrees of freedom for numerator = k – 1 = 6 – 1 = 5 and
the degrees of freedom for denominator = N – k = 34 – 6 = 28.
(b) The critical value for ANOVA test at α = 0.05 when there are four
samples with 44 observations is = 2.84. This comes from α = 0.05, the
degrees of freedom for numerator = k – 1 = 4 – 1 = 3 and the degrees of
freedom for denominator = N – k = 44 – 4 = 40.
5. We obtained:
MS (Tr ) 35.7
F= = = 2.45
MS E 14.6
MS (Tr ) 215.23
F= = = 2.92
MS E 73.81
6. The relevant factor is the socio-economic status and the factor level is socio-
economic level.
Exercise 5.1
Step 1: Construct Appropriate Hypothesis Statement
H1 : Otherwise.
Observed Expected (O − E ) 2
Day
Frequency, O Frequency, E
O–E E
Monday 12 10 2 0.4
Tuesday 9 10 –1 0.1
Wednesday 11 10 1 0.1
Thursday 10 10 0 0
Friday 9 10 –1 0.1
Saturday 9 10 –1 0.1
Total 60 60 0.8
(O − E ) 2
Thus, we obtained X = = 0.8.
E
Exercise 5.2
1. Solve the question using the following steps:
x 3.32
⇔ H 0 : X b ( 5, p ) where p = = = 0.664
n 5
H1 : Otherwise.
X 0 1 2 3 4 5 or 6
Expectation 0.4 4.2 16.7 33.1 32.7 12.9
Since the first two frequencies are less than 5 (X = 0 and 1), both are
combined together at X = 2 resulting in frequency value 21 (that is
0.4 + 4.2 + 16.7). Thus, combining the observed and expected
frequencies for subsequent analysis in the following table:
X ≤2 3 4 5 Total
Observed 21 33 31 15 100
Expected 21.3 33.1 32.7 12.9 100
and
X=
( 21 − 21.3) ( 33 − 33.1) ( 31 − 32.7 ) (15 − 12.9 )
2
+
2
+
2
+
2
= 0.43
21.3 33.1 32.7 12.9
X 0 1 2 3 4 or more Total
Observed 102 114 74 28 12 330
Expected 99.39 119.27 71.56 28.63 8.59 330
Hence, we obtained:
X=
(O − E )2 = (102 − 99.39)2 + ... + (12 − 8.59)2 = 0.46
E 99.39 8.59
( )
⇔ H0 : X N μ , σ 2 yang x = μˆ = 134.356 dan s = σˆ = 6.195
H1 : Otherwise.
Thus,
X =
(O − E )2 = (8 − 6.63)2 + (10 − 10.55)2 + ... + (2 − 138)2 = 0.83
E 6.63 10.55 1.38
Exercise 5.3
L Bi
Using the formula for expected frequencies as Eij = N j ,
N N
Type of Car
Age
Local-made Import Total
>30 110 *99 110 *101 110
= 54.45 = 55.55
200 200
30 and above 90 * 99 90 *101 90
= 44.55 = 45.45
200 200
Total 99 101 200
Exercise 5.4
Step 1: Construct Appropriate Hypothesis Statement
Thus, v = (2 – 1) × (2 – 1) = 1
2
Since v = 1 degrees of freedom, χ 0.05 = 3.841.
(a) The following table gives the expected frequency (and observed
frequency data):
Type of Car
Age
Local Made Import Total
>30 68 42 110
(54.45) (55.55)
30 and above 31 42 90
(44.55) (45.45
Total 99 101 200
X =
(O − E ) 2 .
E
X=
( 68 − 54.45 ) ( 42 − 55.55)
2
+
2
+ +
( 59 − 45.45 )
2
= 14.84
54.45 55.55 45.45
Exercise 5.5
The information can be summarised in the table below:
The variables classified are the tendency/interest on type of sport. Populations are
male and female students. The testing method follows several steps:
Exercise 5.6
1. Solve the question using the following steps:
H1 : Otherwise.
Colour Blindness
Category
Normal Colour Blind Total
2,210 190
Male
Factor II (2,280) (120) 2,400
2,540 60
Female
(2,470) (130) 2,600
Total 4,750 250 5,000
(b) Thus,
H 0 : Otherwise.
X=
(O − E ) 2
> χ21% (5) = 15.086 with v = Number of days – 1 =
E
6 – 1 = 5 degrees of freedom.
Observed Expected (O − E ) 2
Day O–E X =
Frequency, O Frequency, E E
Monday 204 258 –54 11.3023
Tuesday 292 258 34 4.4806
Wednesday 242 258 –16 0.9923
Thursday 283 258 25 2.4225
Friday 252 258 –6 0.1395
Saturday 275 258 17 1.1202
Total 1548 258 7 20.457
(O − E )
2
Thus X = = 20.45
E
3. It is known that,
Thus,
Number of
0 1 2 3 4 5 6 7 8
Goals, X
Number of
Expected 2.883 7.583 9.971 8.741 5.747 3.023 1.325 0.498 0.164
Team
Thus, we get:
(O − E )2 ( 8 − 10.4666 ) ( 6 − 5.01)
2 2
X= = + + = 3.73
E 10.466 5.01
H1 : Otherwise.
X 0 1 2 3 Total
Observed 22 37 20 21 100
Expected 16.807 36.015 30.87 16.31 100
and
X=
( 22 − 16.807 )2 + ( 37 − 36.015 )2 + ( 20 − 30.87 )2 + ( 21 − 16.31)2
16.807 36.015 30.87 16.31
= 6.81
H1 : Otherwise.
Thus,
2
(O − E ) (16 − 15.505) 2 (21 − 25.886) 2 (10 − 15.505) 2
X = = + + ... + = 7.945
E 15.505 25.886 15.505
Thus, v = (3 – 1) × (3 – 1) = 4
2
Since there are v = 1 degrees of freedom, we have x0.05 = 9.488
Year of Study
Average Grade Value
Year 1 Year 2 Year 3 Total
14 16 15
< 2.0 45
(15) (15) (15)
10 11 11
2.0–3.0 32
(10.67) (10.67) (10.67)
26 23 24
> 3.0 73
(24.33) (24.33) (24.33)
Total 50 50 50 150
X=
(14 − 15 ) (16 − 15 )
2
+
2
++
( 24 − 24.33)
2
= 0.39
15 15 24.33
H1 : Otherwise.
Category
1 2 3 4 Total
1 16 (19.67) 38 (38.33) 5 (10.67) 41 (31.33) 100
Population 2 24 (19.67) 41 (38.33) 12 (10.67) 23 (31.33) 100
3 19 (19.67) 36 (38.33) 15 (10.67) 30 (31.33) 100
(b) Hence, we can calculate the value of test statistics and obtain:
X=
(16 − 19.67 ) ( 38 − 38.33)
2
+
2
+ +
( 30 − 31.33)
2
= 12.184
19.67 38.33 31.33
H1 : Otherwise.
(b) Thus,
TOPIC 6: CORRELATION
Exercise 6.1
(a) 1, +
(b) 2, +
(c) 2, –
Exercise 6.2
xi yi xi yi xi2 yi2
1 2 2 1 4
2 3 6 4 9
4 4 16 16 16
5 7 35 25 49
6 12 72 36 144
8 10 80 64 100
10 7 70 100 49
Total 36 45 281 246 371
n xi yi − ( xi )( yi )
rp =
( n x − ( x ) ) ( n y − ( y ) )
2
i i
2 2
i i
2
7 ( 281) − ( 36 )( 45 )
=
( 7 ( 246) − (36) ) ( 7 (371) − ( 45) )
2 2
= 0.703
The Pearson correlation coefficient value 0.703 shows a strong positive linear
relationship between the frequency of fertiliser usage and crop yields. This means
the more frequent the farmer distributes the fertiliser, the higher the amount of crop
yield produced.
Exercise 6.3
A one-sided hypothesis test (since the Pearson correlation coefficient value is
positive) is as follows:
H0 : ρp = 0
H1 : ρp > 0
n−2
Test statistics : T = rp
1 − rp2
7−2
= 0.703 2
1 − ( 0.703)
= 2.21
Reject H 0 when
Since the test statistics T < 3.365, we cannot reject the null hypothesis. This means
we do not have enough evidence to say that the Pearson correlation coefficient value
is not zero, that there does not exist any significant relationship between the
frequency of fertiliser distribution with crop yield at 1% significance level.
Exercise 6.4
6 Di2
rs = 1 −
(
n n2 − 1 )
6 ( 74 )
= 1−
(
10 (10 ) − 1
2
)
= 0.5515
The Spearman correlation coefficient value 0.5515 shows there exists a strong
positive linear relationship between athletes’ ranking and their position in a match.
Exercise 6.5
A one-sided hypothesis test (since the Spearman correlation coefficient value is
positive) is as follows:
H0 : ρs = 0
H1 : ρs > 0
n−2
Test statistics : T = rs
1 − rs2
10 − 2
= 0.5515
1 − ( 0.5515 )
2
= 1.87
Reject H 0 when
Since the test statistics T < 2.896, we cannot reject the null hypothesis. This means
there is not enough evidence to say there exists a significant relationship between
athlete ranking and their position in a match at 1% significance level.
Exercise 7.1
x y yˆ = − 12.84 + 36.18 x ∈ = y − yˆ
8.3 227 287.454 –60.454
8.3 312 287.454 24.546
12.1 362 424.938 –62.938
12.1 521 424.938 96.062
17.0 640 602.22 37.78
47.0 539 1,687.62 –1,148.62
17.0 728 602.22 125.78
24.3 945 866.334 78.666
24.3 738 866.334 –128.334
24.3 759 866.334 –107.334
33.6 1,263 1,202.81 60.192
Exercise 7.2
x y xy x2 y2
60 63.6 3,816.0 3,600 4,044.96
62 65.2 4,042.4 3,844 4,251.04
64 66.0 4,224.0 4,096 4,356.00
65 65.5 4,257.5 4,225 4,290.25
66 66.9 4,415.4 4,356 4,475.61
67 67.1 4,495.7 4,489 4,502.41
68 67.4 4,583.2 4,624 4,542.76
70 68.3 4,781.0 4,900 4,664.89
72 70.1 5,047.2 5,184 4,914.01
74 70.0 5,180.0 5,476 4,900.00
Total 668.0 670.1 4,4842.4 4,4794 4,4941.90
From the table, we need to calculate the x and y values first, that is:
x=
xi = 668.0 = 66.8 dan y=
yi = 670.1 = 67.01
n 10 n 10
Now, we can get the β̂1 regression coefficient using the following formula:
n
xi yi − nxy 44,842.4 − 10 ( 66.8 )( 67.01)
βˆi = i =1
= = 0.465
44,794 − 10 ( 66.8 )
n 2
xi2 − nx 2
i =1
β̂1 = 0.465 shows that the y value will increase by 0.465 for each one unit increase
in x. βˆ0 = 35.95 refers to the y value when the x value is zero.
Exercise 7.3
(a) The hypothesis test (one-sided test since β̂1 value is positive):
H0 : β1 = 0
H1 : β1 > 0
βˆ1 0.465
Test statistics : T = = = 14.085
s ( βˆ1 ) 0.033
Reject H 0 when
T > t0.05,8 = 1.86
s( β̂1 ) is the standard deviation for β̂1 sampling distribution. The formula to
get the standard deviation for β̂1 is:
= 0.033
Since the test statistics T > t0.05,8 = 1.86, we reject the null hypothesis. We
have enough evidence to say that β1 value is not zero but positive.
(b) The 99% confidence interval for β1 is as follows (with α = 0.01 and t0.005,8 =
3.355):
( ) ( )
βˆ1 − t0.005,8 s βˆ1 ≤ β1 ≤ βˆ1 + t0.005,8 s βˆ1
0.465 − 3.355 ( 0.033) ≤ β1 ≤ 0.465 + 3.355 ( 0.033)
0.354 ≤ β1 ≤ 0.576
Exercise 7.4
The coefficient of determination is:
2 βˆ0 yi + βˆ1 xi yi − ny 2
R =
yi2 − ny 2
35.95 ( 670.1) + 0.465 ( 44,842.4 ) − 10 ( 67.01)
2
=
44,941.9 − 10 ( 67.01)
2
= 0.961
This means that 96.1% of variation in y can be explained by variation in x and only
3.9% of variation in y is explained by other factors.
Exercise 7.5
(a) There is no particular pattern in this plot. We found that the model has random
error with constant variance. Hence, there is no violation from the linear
model assumptions.
Exercise 7.6
(a)
The plot shows the regression model is in reciprocal function form. Hence,
transformation is x* = 1/x and the linear regression model is y = 2.67 – 0.68x*.
(b)
(c)
(d)
Exercise 7.7
Refer to Exercise 7.2, the simple linear regression model.
ŷ = 35.95 + 0.465x
To get the standard error for estimator, we need to have the w value. Using
regression model ŷ = 35.95 + 0.465x, the ŷ value for each x value is shown in the
table below:
x 60 62 64 65 66 67
ŷ 63.6 65.2 66.0 65.5 66.9 67.1
Hence,
n
( yi − yˆi )
2
i =1 1.494
sε = = = 0.432
n−2 8
The 99% confidence interval gives α = 0.01 and tα/2 = t0.005 = 3.355. Hence, for
xg = 86, the prediction interval is
( )
2
1 xg − x
yˆ ± tα /2s ε 1+ +
n ( xi − x )2
2
1 ( 86 − 66.8 )
75.94 ± ( 3.355 )( 0.432 ) 1+ +
10 171.6
75.94 ± 2.61
It is found that the upper and lower limits for the 99% confidence interval is 73.33
and 78.55 respectively. This means the predicted y value is 73.33 unit at the
minimum and is 78.55 unit at the maximum for x = 86.
Exercise 7.8
The ŷ values, tα/2 and s∈ can be obtained from Exercise 7.7. Hence, the interval for
expected value of y for xg = 69 is:
( )
2
1 xg − x
yˆ ± tα /2s ε +
n ( xi − x )2
2
1 ( 69 − 66.8 )
75.94 ± ( 3.355 )( 0.432 ) +
10 171.6
75.94 ± 0.52
It is found that the upper and lower limits for the confidence interval is 75.42 and
76.46 respectively. This means the predicted y value is 75.42 unit at the minimum
and is 76.46 unit at the maximum for x = 69.
Exercise 8.1
(a) i = 4, y4 = 29; yˆ′4 = 9.430 + 5.266(2.4) + 2.0612(4) = 30.3132 ≈ 30.31, and
error, ∈4 = y4 − yˆ′4 = 29.0 – 30.31 = –1.31, “over-estimate”
Exercise 8.2
1.
2. Data (a)
Y X1 X2
10 2 5
24 3 6
40 7 6
20 3 5
15 4 3
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.999906
R square 0.999812
Adjusted R square 0.999435
Standard error 0.25713
Observations 4
ANOVA
Df SS MS F Significance F
Regression 2 350.6839 175.3419 2652.047 0.013729
Residual 1 0.066116 0.066116
Total 3 350.75
2. Data (b)
Y X1 X2
10 2 2
25 4 6
30 4 8
20 3 6
15 4 3
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.997365
R square 0.994737
Adjusted R square 0.984211
Standard error 0.811107
Observations 4
ANOVA
df SS MS F Significance df
Regression 2 124.3421 62.17105 94.5 0.072548
Residual 1 0.657895 0.657895
Total 3 125
Exercise 9.1
1. To test whether the population median exceeds 160, test:
Replacing values greater than 160 with “+” sign and values less than 160 with
“–“ sign, we will get,
+ + + + + – – + + + – + + – + + + + +
2. It is known that both S1 and S2 are distributed as binomial with n sample size
and θ = 0.5. For a binomial variable X with n and θ = 0.5, Pr(x ≥ a) =
Pr(x ≤ n – a) since the distribution is symmetrical. Hence, Pr(S1 ≥ c) =
Pr( S2 ≤ n – c).
Exercise 9.2
1. It is known that T + and T − are sum of rank differences with positive and
negative signs respectively. Hence, the sum of rank differences for both ranks
without taking into consideration of “+” or “–” signs is sum of all possible
ranks, that is:
The signed-rank test for small sample size is performed since n = 14. From
the calculation above, the total number of negative differences, T + = 10.
When n = 14, the critical value T0.01 = 13. Since T + is not < T0.01 , hence, we
do not reject H 0 that median hydrocarbon content is 98.5.
4.
Rank Rank
yi yi – 1 yi yi – 1
( yi – 1) ( yi – 1)
Sum of differences with “–” sign is T − = 262, Hence, the test statistics:
n ( n + 1) 25 ( 26 )
T− 262 −
4 4 262 − 162.5
z= = =
n ( n + 1)( 2n + 1) 25 ( 26 )( 51) 37.1652
24 24
= 2.6772
Since Z = 2.6772 is not < Za = 2.33, we cannot reject H0 . The same decision
can be obtained from the sign test.
Exercise 10.1
1. (a) n = Total positive and negative signs = 15 + 5 = 20 (0 or tie is not
counted). For one-sided (right) test, S = Number of positive signs = 15
Sign test:
The test statistics, S = number of “+” sign. By replacing positive differences
with “+” sign and negative differences with “–” sign, you will get: + + + +
+ + – + – + + + with this, n =12, x = 10. Using the binomial distribution
table with θ = ½, Pr(S ≥ 10)
= 1 – Pr(S ≤ 9)
= 1 – 0.9807 = 0.0193
Since p-value = 0.0193 < 0.05 = a, hence reject H0 . In conclusion, the new
traffic control system is more effective in reducing the number of accidents at
dangerous junctions at 0.05 significance level.
Exercise 10.2
1. (a) w2 = [(8)(9)/2] – 8 = 28 Y2 = 15 + [(3)(4)/2] – 8 = 13
Y1 = 15 + [(5)(6)/2] – 28 = 2
H1 : Students from video program group and solving real problem obtain
higher score.
α = 0.01
hence, w2 =
(17 )(18) − 41.5 = 111.5
2
From Υ table, the critical vale at α = 0.01 for a one-sided test with n1 = 7 and
n2 = 10 is 11. Since 13.5 is not < 11, we do not reject the null hypothesis. In
conclusion, there is no significance difference in the score for both groups at
0.01 significance level.
12 5 2 +
Η 0 : Time taken to solve the problem is the same for both groups or D1 = D2
versus H1 : Subjects taking alcohol take longer time to solve the problem or
D1 < D2 .
α = 0.05
Thus, test statistics Y1 with n1 = 9 and n2 = 10, the critical value for α = 0.05
is 24. Since 16 < 24, reject the null hypothesis. Alcohol does have effect on
individuals’ thinking ability at 0.05 level.
OR
Thank you.