Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
17 views35 pages

FM Chapter 11-3

Chapter 11 discusses non-parametric tests, focusing on various types such as the sign test and Wilcoxon signed-rank test, which do not assume normality and are used for analyzing continuous data. It provides detailed examples of single-sample tests, including hypotheses formulation and calculations for determining significance levels. The chapter emphasizes the importance of understanding the assumptions and conditions under which these tests are applicable.

Uploaded by

wentaodu010808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views35 pages

FM Chapter 11-3

Chapter 11 discusses non-parametric tests, focusing on various types such as the sign test and Wilcoxon signed-rank test, which do not assume normality and are used for analyzing continuous data. It provides detailed examples of single-sample tests, including hypotheses formulation and calculations for determining significance levels. The chapter emphasizes the importance of understanding the assumptions and conditions under which these tests are applicable.

Uploaded by

wentaodu010808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Chapter 11: Non-parametric tests

11.1 Non-parametric tests

Type of Test Test Assumptions

• The underlying data are


continuous
Sign test
Single sample • The data are independent

• The underlying data are


symmetric

Wilcoxon signed-rank test • The underlying data are


continuous

• The data are independent

• The data are in matched pairs

• The differences between matched


Two sample Paired sign test pairs are continuous

• The data are independent

1
• The data are in matched pairs

• The differences between matched


pairs are symmetric
Wilcoxon matched-pairs
signed-rank test • The differences between matched
pairs are continuous

• The data are independent

• The two samples are


independent

• The underlying data are


Wilcoxon rank-sum test symmetric

• The underlying data are


continuous

11.2 Single-sample sign tests

Key Point 11.1


Given n data points, a single-sample sign test is created using X ∼ Bin(n, 0.5). The test statistic can be the number of + signs, that
is, the number of data points greater than the median. We can calculate the probability that X is above this test statistic, below this
test statistic, or either in the case of a two-tailed test.
This can be expressed as
P (X ≤ ts | X ∼ Bin(n, 0.5)) or P (X ≥ ts | X ∼ Bin(n, 0.5))
where ts stands for the test statistic.

2
WORKED EXAMPLE 11.1
It is believed that the following dataset comes from a population with median 135.

150 130 125 140 170


140 190 180 175 165
160 130 140 140 145
Perform a single-sample sign test, at the 5% significance level, to test this claim.
Answer

H0 : The population median is 135.


H1 : The population median is not 135.
First, state the hypotheses. Notice that this is a two-tailed test.

Value Sign Value Sign


150 + 140 +
140 + 140 +
160 + 175 +
130 − 140 +
190 + 170 +
130 − 165 +
125 − 145 +
180 +
Here, the test statistic is 12, as there are 12 values above the stated median.
Consider X ∼ Bin(15, 0.5):
       
15 15 15 15 15 15 15
P (X ≥ 12) = (0.5) + (0.5) + (0.5) + (0.5)15
12 13 14 15

P (X ≥ 12) = 0.017578

3
Since 0.017578 < 0.025, the test statistic of 12 is in the critical region and, therefore, we reject H0 .
There is sufficient evidence to suggest the population median is not 135.

WORKED EXAMPLE 11.2


The data below give the lifetimes, in hours, of a random sample of fuses made by a certain manufacturer.

133, 14, 60, 315, 98, 44, 147, 389, 106, 238, 10, 80
The manufacturer claims that the fuses it produces have a median lifetime of 200 hours whereas a user suspects that the median
is less than this. The appropriate null and alternative hypotheses are H0 : median = 200 and H1 : median < 200 respectively. How
can the null hypothesis be tested?
Figure shows these values plotted on a number line.

Although the evidence is limited, this diagram suggests that the distribution of the lifetimes is far from symmetrical and so it
cannot be normal. Thus a hypothesis test cannot be carried out using a t-test. Consider instead how each value deviates from the
proposed median of 200. The deviations are

−67, −186, −140, +115, −102, −156, −53, +189, −94, +438, −190, −120.
Now consider just the signs of these deviations. These are

−, −, −, +, −, −, −, +, −, +, −, −.
There are fewer plus signs (3) than minus signs (9). This is what you might expect if the manufacturer is overestimating the
median lifetime, but are there so few + signs that the manufacturer’s claim should be rejected?

If the proposed value of 200 for the median is correct, then the probability of a + sign and the probability of a — sign in the list
of signs above are both 12 . Thus X, the number of + signs, should have a distribution which is B(12, 21 ). From the cumulative

4
binomial probability tables P (X ≤ 3) = 0.0730. For a test at the 5% significance level this result would not be significant since
0.0730 > 0.05. Thus at the 5% significance level the null hypothesis that the median is equal to 200 cannot be rejected. The test
which has just been carried out is called a single-sample (binomial) sign test.

If there had been more + signs than — signs in this example, then there would be no point in performing a significance test since
then the fuses last, on average, longer than 200 hours.

WORKED EXAMPLE 11.3


A machine is designed to produce rods whose median length is 2 cm. After the machine has been moved to a new position in the
factory the lengths of the first nine rods produced are measured in order to test whether the setting for the median length has
altered. If the lengths, in cm, are 1.89, 1.92, 2.05, 1.88, 1.96, 1.97, 2.01, 1.94, 1.90, test, at the 10% significance level, whether the
median setting differs from 2 cm.

Take H0 : median = 2, H1 : median ̸= 2.

The deviations from the hypothesised median of 2 are -0.11, -0.08, +0.05, -0.12, -0.04, -0.03, +0.01, -0.06, -0.10. In this list
there are 2 plus signs and 7 minus signs. Under H0 , X, the number of pluses, is B(9, 12 ). From the cumulative binomial
probability tables P (X ≤ 2) = 0.0898. For a two-tail test at the 10% significance level, this probability must be compared with
0.05. Since 0.0898 > 0.05, the result is not significant at the 5% level and the null hypothesis that the median is still 2 cm is accepted.

Alternatively, you could have calculated P (X ≥ 7), which, by symmetry, is equal to 0.0898.

When the data are paired, the sign test can be used to test whether the two samples come from identical populations. This version
of the test is called the paired-sample (binomial) sign test. It provides a non-parametric alternative to the paired-sample
t-test. The following example illustrates the method.

5
Key Point 11.2
Let S = min(number of + signs, number of − signs), then
n n
E(S) = , Var(S) = .
2 4
For large n(> 10), T ∼ N n2 , n4 , we can use the normal approximation of the binomial with p = 0.5. We must also make sure that


we use a continuity correction. As we are approximating a discrete distribution with a continuous distribution, our z-value is:

S + − µ + 0.5
z= .
σ

11.3 Single-sample Wilcoxon signed-rank test


The Wilcoxon signed-rank test is a non-parametric test which takes into account the sizes of the deviations from the median as well
as their signs. It can be used provided that you can assume that the population has a symmetrical (although not necessarily normal)
distribution. A consequence of this assumption is that mean and median are equal and so the null hypothesis can refer to either of these
parameters.

Consider again the data from Example(1.89, 1.92, 2.05, 1.88, 1.96, 1.97, 2.01, 1.94, 1.90). These values are plotted on a number line.
This diagram suggests that you would be justified in assuming that the lengths of the rods are distributed symmetrically, although not
necessarily about the value 2. The deviations from the median of 2 are shown in the first line of Table 2.3. The second line of the table
shows the deviations without their signs, that is |deviation|. The third line shows the ranks of the values in the second line. The bottom
line shows the ranks with the signs of the original values restored. These are the signed ranks.

6
Deviation −0.11 −0.08 +0.05 −0.12 −0.04 −0.03 +0.01 −0.06 −0.10
|Deviation| 0.11 0.08 0.05 0.12 0.04 0.03 0.01 0.06 0.10
Rank 8 6 4 9 3 2 1 5 7
Signed rank −8 −6 +4 −9 −3 −2 +1 −5 −7

Now consider P , the sum of the positive ranks, and Q, the sum of the negative ranks. In this case P = 5 and Q = 40. If H0 is correct,
then you would expect each rank to be equally likely to be preceded by a + sign or a − sign and so you would expect the values of P
and Q to be similar.
In this example the sum of all the ranks is 1 + 2 + 3 + . . . + 9 = 45, so if H0 were true you would expect P and Q to be roughly equal
to 12 × 45 = 22.5. In fact, P is considerably less than this and Q is correspondingly greater.
In order to decide whether this imbalance between P and Q is significant, you take as the test statistic the smaller of P and Q. This
statistic is given the symbol T ; in this example T = 5. The rejection region for T can be found from the table on page 273. This gives
the largest value of T for which H0 is rejected. For a two-tail test at the 10% significance level with n (sample size) = 9, the rejection
region is T ≤ 8. Since the calculated value of T (= 5) falls in the rejection region, the null hypothesis that the median equals 2 is rejected.
Note that this result is the opposite of that obtained in Example. This is because the Wilcoxon signed-rank test uses aspects of the
data which the sign test ignores and is thus a more discriminating test. It gives a lower probability of a Type II error; that is, it is better
at detecting a false null hypothesis.
The critical values in the table on page 273 are obtained by considering the sampling distributions of P and Q and hence the sampling
distribution of T . To understand how this would work consider the situation in which the sample size is n = 3 and hence there are three
ranks. If the population is symmetrical, then, under H0 , these ranks are equally likely to be plus or minus.
Table shows all 8 possible arrangements of the signed ranks and the resulting values of P , Q and T . Assuming that the null hypothesis
is true each arrangement is equally likely.

7
Signed Ranks P Q T
+1, +2, +3 6 0 0
−1, +2, +3 5 1 1
+1, −2, +3 4 2 2
+1, +2, −3 3 3 3
−1, −2, +3 3 3 3
−1, +2, −3 2 4 2
+1, −2, −3 1 5 1
−1, −2, −3 0 6 0

Table shows why there is no entry for n = 3. For a two-tail test, you are interested in an imbalance between P and Q in either
direction. In this case, the probability that T = 0 is 82 = 14 . Thus, the lowest significance level for which it is possible to give a critical
value for a two-tail test is 25%, when the rejection region is T = 0.
For a one-tail test, you are interested in an imbalance between P and Q in one direction only, say P > Q. Thus, the lowest significance
level for which it is possible to give a critical value for a one-tail test is 18 = 12.5%, when the rejection region is T = 0.
Table 2.4 also shows that the distribution of T is discrete. As a consequence, the significance levels given in the table on page 273
are only approximate and are not equal to P (Type I error). You met a similar situation in S2 Chapter 7 in connection with significance
tests involving the binomial and Poisson distributions.
For values of n greater than 20, the distributions of P and Q (when H0 is true) are approximately normal with mean 14 n(n + 1) and
1
variance 24 n(n + 1)(2n + 1). Thus, the significance of a large value of T can be tested by standardising to give a value of the standard
normal variable, Z. Since T is a discrete variable, and it is being approximated by a continuous variable, a continuity correction is
required.
Suppose, for example, you obtain T = 104 for a sample of size n = 23 and wish to carry out a one-tail test at the 5% significance level
in a situation where the alternative hypothesis suggests that P > Q. Under H0 , P and Q are both distributed normally with
1 1
mean = n(n + 1) = × 23 × 24 = 138,
4 4
1 1
variance = n(n + 1)(2n + 1) = × 23 × 24 × 47 = 1081.
24 24
You are interested in low values of T , which arise because of low values of Q. Using a continuity correction in going from Q to Z,

8
 
104.5 − 138
P (T ≤ 104) = P (Q ≤ 104) ≈ P Z≤ √
1081
= P (Z ≤ −1.0189...) = 1 − Φ(1.019) = 0.1541.
For a one-tail test at the 5% level, this probability is compared with 0.05. Since 0.1541 > 0.05, the result is not significant at the 5%
level and the null hypothesis would be accepted. Alternatively, you can calculate the value of Z and see whether it lies in the rejection
region, which is Z ≤ −1.645 for a one-tail test at the 5% significance level.
For a two-tail test at the 5% level, the probability which is calculated would be compared with 0.025. Alternatively, you can see
whether the value of Z lies in the rejection region, which is |Z| ≥ 1.96 for a two-tail test at the 5% significance level.

Key Point 11.3


Where P is the sum of the ranks corresponding to the positive differences from the stated median and N is the sum of the ranks
corresponding to the negative differences from the stated median:

T = min(P, N ).

WORKED EXAMPLE 11.4


The weights (in kg) of ten randomly selected Spanish mackerel are recorded:

1.6, 1.1, 2.1, 2.4, 2.2, 2.9, 2.6, 2.3, 2.7, 1.9
Test, at the 5% significance level, whether the median weight is greater than 1.8 kg.
Answer

H0 : The population median weight of Spanish mackerel is 1.8 kg.


H1 : The population median weight of Spanish mackerel is greater than 1.8 kg.

Define the hypotheses.

9
To perform this test, we first need to rank the magnitude of differences of each data point from the stated population median.
Ignoring signs, start with the smallest difference and give this rank 1, the next smallest difference is given rank 2, and so on.
We can check P and N here using the fact that:
n(n + 1)
P +N =
2

Weight, Wi Wi − Median P N
1.6 −0.2 2
1.1 −0.7 7
2.1 0.3 3
2.4 0.6 6
2.2 0.4 4
2.9 1.1 10
2.6 0.8 8
2.3 0.5 5
2.7 0.9 9
1.9 0.1 1
Sums: 46 9
So the test statistic here is:
T = min(P, N ) = 9.
We look up the critical value in the statistical tables:

10
Level of Significance
One-tailed 0.05 0.025 0.01 0.005
Two-tailed 0.1 0.05 0.02 0.01
n=6 2 0 – –
7 3 2 0 –
8 5 3 1 0
9 8 5 3 1
10 10 8 5 3
11 13 10 7 5
Since 9 < 10 (test statistic < critical value), there is sufficient evidence to reject H0 .

We are conducting a one-tailed test here at the 5% significance level. The critical value here is 10.

Be careful, as we require test statistic < critical value to reject H0 here. This is different from the other tests performed in
Chapter 9. We are testing whether the test statistic is significantly smaller than would happen by chance.

There is sufficient evidence to suggest that the population median is not 1.8 kg.

Write a conclusion in context.

Key Point 11.4


Given the statistic T = min(P, N ), then:

n(n + 1) n(n + 1)(2n + 1)


E(T ) = , Var(T ) = .
4 24
And for large n:

11
 
n(n + 1) n(n + 1)(2n + 1)
T ∼N , .
4 24
We use a continuity correction since we are approximating a discrete distribution with a continuous distribution. Our z-value is:
T − µ + 0.5
z= .
σ

WORKED EXAMPLE 11.5


In a clinical trial, the survival times, in weeks, for 19 patients with non-Hodgkin’s lymphoma are recorded:

37 54 73 89 94 110 112 123 129 132


148 151 173 189 201 204 213 276 281
Test, at the 5% significance level, whether the median differs from 150.
Answer

H0 : The population median is 150.


H1 : The population median is different from 150.

State the hypotheses.

Set up the table of ranks for the data. Ignoring signs, start with the smallest difference and give this rank 1, the next smallest
difference is given rank 2, and so on.

12
Wi Wi − Med |Wi − Med| P N
37 −113 113 17
54 −96 96 16
73 −77 77 15
89 −61 61 13
94 −56 56 12
110 −40 40 9
112 −38 38 7
123 −27 27 6
129 −21 21 4
132 −18 18 3
148 −2 2 2
151 1 1 1
173 23 23 5
189 39 39 8
201 51 51 10
204 54 54 11
213 63 63 14
276 126 126 18
281 131 131 19
Sum: 86 104

n(n+1)
We can check P + N = 86 + 104 = 190 = 2
.

T = min(P, N ) = 86

13
n(n + 1) 19 × 20
E(T ) = = = 95
4 4
n(n + 1)(2n + 1) 19 × 20 × 39
Var(T ) = = = 617.5
24 24
Calculate E(T ) and Var(T ) so we can approximate to the normal.
T − µ + 0.5
z=
σ
86.5 − 95
z= √ = −0.342
617.5

P (Z ≤ −0.342) = 0.3662
Since 0.3662 > 0.025, we do not reject H0 .

Since this is negative, but two-tailed, we consider only the bottom tail.
Since this is greater than 2.5%, it is not in the critical region.

We could instead have compared −0.342 with the critical value for the two-tailed test, −1.96.

Since −0.342 > −1.96, we do not reject H0 .

There is insufficient evidence to suggest that the population median differs from 150.

14
11.4 Paired-sample sign test
WORKED EXAMPLE 11.6
Data are collected on the time, in seconds, it takes nine children to tie up their left shoelace and their right shoelace.

Child Left (s) Right (s)


A 42 45
B 38 36
C 51 52
D 42 39
E 31 35
F 48 49
G 61 62
H 38 39
I 44 45
Test, at the 10% level of significance, whether there is a difference in the time it takes for the children to tie each shoelace.
Answer

H0 : There is no difference in the time taken to tie their left and right shoelaces.
H1 : There is a difference in the time taken to tie their left and right shoelaces.

Define the hypotheses.

15
Child Left (s) Right (s) Sign
A 42 45 −
B 38 36 +
C 51 52 −
D 42 39 +
E 31 35 −
F 48 49 −
G 61 62 −
H 38 39 −
I 44 45 −
The test statistic is 2.
     
9 9 9 9 9
P (X ≤ 2) = (0.5) + (0.5) + (0.5)9
0 1 2
P (X ≤ 2) = 0.089844
Since 0.089844 > 0.05, the test statistic of 2 is not in the critical region. Therefore, there is insufficient evidence to reject H0 .
There is insufficient evidence to say there is a difference in the times taken for children to tie their left and right shoelaces.

Set Li − Ri as the difference.


Let the number of + signs be the test statistic.
Use: X ∼ Bin(9, 0.5)
The test is two-tailed, but we need to consider only the lower tail.
The probability will be 5%, as the test is two-tailed.

16
WORKED EXAMPLE 11.7
The table below shows the times taken by a random sample of people to perform a simple task on their first and second attempts.
Test, at the 10% significance level, whether most people take less time on the second attempt than on the first attempt.

Person A B C D E F G H
First attempt 6.3 3.5 7.1 3.7 8.4 3.9 4.7 5.2
Second attempt 5.1 3.4 6.2 4.5 7.3 4.0 3.6 5.1

The null and alternative hypotheses are

H0 : the time taken on the second attempt is the same as that on the first attempt;
H1 : the time taken on the second attempt is lower than that on the first attempt.
The differences between the times taken on the first and second attempts are +1.2, +0.1, +0.9, −0.8, +1.1, −0.1, +1.1, +0.1.
There are 2 minus and 6 plus signs. Under H0 , X, the number of − signs, is distributed as B(8, 12 ); thus P (X ≤ 2) = 0.1445. The
test is one-tail, so 0.1445 is compared with 0.10. Since 0.1445 > 0.10, this result is not significant at the 10% level and the null
hypothesis of no difference in times is accepted.

The data analysed are the same as the data which were used in S3 Section 4.3 to illustrate the paired t-test. Unlike the
paired-sample sign test, the paired t-test gave a significant result and the null hypothesis was rejected. It may seem puzzling that
two tests can give different results for the same data. However, you must remember that the assumptions made about the data in
order to apply the tests are not the same.

The sign test makes only a very weak assumption about the population distributions, that you can tell whether each piece of data
lies above or below the median, so you can be certain that it is valid for the data. However, the sign test has the disadvantage
that it only uses a limited amount of the information available from the sample since it considers the sign of the deviations but
not their size. As a result, the probability of making a Type II error (that is, keeping a false null hypothesis) is high when a sign

17
test is used.

By contrast, the paired t-test, which assumes that the deviations are normally distributed, has a lower probability of giving a
Type II error. This is the reason why the t-test can give a significant result when the sign test does not. If you can be certain that
the f-test is valid, then it is the better test to use.

The paired-sample sign test and the paired t-test are also trying to detect slightly different things. The paired t-test is trying to
detect whether the mean difference is greater than zero. Since this test assumes that the differences are distributed normally, this
is equivalent to saying that the median difference is greater than zero. The paired-sample sign test is trying to detect only whether
the median difference is greater than zero. It says nothing about the mean difference.

By making an assumption about symmetry it is possible to develop a non-parametric test which gives a lower probability of a
Type II error than the sign test, and this is described in the next section.

The sign test can be used to test:

(a) the null hypothesis that a sample comes from a population with a given median (the single-sample sign test);

(b) the null hypothesis that paired data are drawn from the same population (the paired-sample sign test).

The sign test is based on the signs of differences: in case (a) the differences between the observed values and the hypothesised
median, in case (b) between the pairs of values.

If there are n differences, the number of − signs and the number of + signs are both distributed as B(n, 21 ) if the null hypothesis
is true.

18
To test for significance, the probability p of the observed (or a more extreme value) of the observed number of − signs (or
alternatively + signs) is calculated assuming that the null hypothesis is true. For a test at the α% significance level, the null
hypothesis is rejected if 100p < α for a one-tail test or 100p < 21 α for a two-tail test.

The sign test makes no assumptions about the population distribution other than that the data can be said to be either above or
below the median.

11.5 Wilcoxon matched-pairs signed-rank test

Key Point 11.5


When we have matched pairs of data of unknown distributions, but the differences between them are thought to be symmetric, it is
appropriate to use a Wilcoxon matched-pairs signed-rank test. We test to see whether the paired-difference median is 0.

WORKED EXAMPLE 11.8


An investigation is carried out into the effectiveness of two types of post-operative pain relief drug: Drug 1 and Drug 2. Seven
adults agree to take Drug 1 on one day, and Drug 2 on the second. The time, in hours, of pain relief is recorded.

Drug 1 Drug 2
A 4.1 3.9
B 3.2 3.3
C 5.3 5.0
D 5.1 4.6
E 4.2 4.6
F 3.8 3.2
G 3.6 4.3
Test, using the matched-pairs Wilcoxon signed-rank test, at the 5% significance level, whether Drug 2 gives longer pain relief than

19
Drug 1.
Answer

H0 : The times are the same before and after.


H1 : The times afterwards have increased.

Define the hypotheses.


This is a one-tailed test.

Before After Difference P N


A 4.1 3.9 0.2 2
B 3.2 3.3 −0.1 1
C 5.3 5.0 0.3 3
D 5.1 4.6 0.5 5
E 4.2 4.6 −0.4 4
F 3.8 3.2 0.6 6
G 3.6 4.3 −0.7 7
Sum: 16 12

T = min(P, N ) = 12
And so the test statistic is 12.
Find the critical value in the statistical tables:

20
Level of Significance
One-tailed 0.05 0.025 0.01 0.005
Two-tailed 0.1 0.05 0.02 0.01
n=6 2 0 – –
7 3 2 0 –
8 5 3 1 0
9 8 5 3 1
10 10 8 5 3
11 13 10 7 5
Since 12 > 3 (test statistic > critical value), there is insufficient evidence to reject H0 .
There is insufficient evidence to suggest that Drug 2 gives longer pain relief.

WORKED EXAMPLE 11.9


Ten people enrolled on a new slimming course for six months. Their weights in kilograms before and after the course are shown in
the table below.
Person 1 2 3 4 5 6 7 8 9 10
Before 75.4 78.1 79.7 70.3 72.0 74.1 78.5 74.9 70.3 72.9
After 70.9 71.3 69.5 73.2 72.1 72.0 71.6 73.1 70.8 71.6
Test at the 5% level whether the course is effective.
The differences between the initial and final weights (in kilograms) are given below with their signed ranks beneath them.
Difference 4.5 6.8 10.2 -2.9 -0.1 2.1 6.9 1.8 -0.5 1.3
Signed rank 7 8 10 -6 -1 5 9 4 -2 3
The null and alternative hypotheses are

H0 : median difference = 0;

21
H1 : median difference < 0.
The sum of the positive ranks is P = 46, and the sum of the negative ranks is Q = 9. Thus the test statistic, T , takes the value 9.
In this case the test is a one-tail test. For n = 10, the rejection region for a test at the 5% level is T ≤ 10. Since the calculated
value of T lies in the rejection region, H0 is rejected. There is evidence at the 5% significance level that the course is effective.
Note that in this example, if H0 is false, you would expect the sum of the negative ranks to be low and the sum of the positive
ranks to be high. If the reverse had been the case and the results gave a higher value for Q than for P, then the course would
obviously not be having the desired effect and a formal hypothesis test would not be appropriate.
In carrying out the tests described in Sections 2.2 and 2.3, it is possible that a difference of zero could have been obtained.
Although zero can be ranked, it does not have a sign, and so such values are ignored in carrying out these tests.
The Wilcoxon signed-rank test can be used to test:

(a) the null hypothesis that a sample comes from a population with a given median (the single-sample signed-rank test);

(b) the null hypothesis that paired data are drawn from the same population (the matched-pairs signed-rank test).

The steps in a signed-rank test are:

Step 1 Calculate differences: in case (a) between the observed values and the hypothesised median, in case (b) between the pairs of
values.

Step 2 Ignoring the signs of the differences, assign ranks to the differences in order of increasing size.

Step 3 Give each rank the same sign as the original value.

Step 4 Find P , the sum of the positive ranks, and Q, the sum of the negative ranks.

Step 5 Take the smaller of P and Q as the test statistic, T .

Step 6 For small samples, find the rejection region from the table on page 273 and reject H0 if the calculated value of T lies in the
rejection region.

22
For large samples, calculate:

T + 0.5 − 41 n(n + 1)
Z=q ,
1
24
n(n + 1)(2n + 1)
where n is the number of differences.

• For a one-tail test at the 100α% significance level, the rejection region is Z ≤ −z where Φ(z) = 1 − α.

• For a two-tail test at the 100α% significance level, the rejection region is Z ≤ −z where Φ(z) = 1 − 12 α.

The signed-rank test assumes:

• In case (a), that the population distribution is symmetrical.

• In case (b), that the differences have a symmetrical distribution.

11.6 Wilcoxon rank-sum test


This section introduces a non-parametric alternative to the two-sample t-test for two independent samples. The two-sample t-test tests
whether two samples come from populations with the same mean. It requires that the populations are normal with equal variances.
The Wilcoxon rank-sum test (or an equivalent test known as the Mann-Whitney U test) tests whether two samples come from
populations with identical distributions and does not require any assumptions about the two populations.
The test works by investigating whether two populations satisfy a condition necessary for them to have identical distributions. Given
two populations with identical distributions, if X is a random measurement from the first population and Y is a random measurement from
the second population, then P (X < Y ) = 12 . Therefore, if measurements X and Y from two populations are such that P (X < Y ) ̸= 12 ,
the populations cannot be distributed identically. The Wilcoxon rank-sum test provides a test of whether measurements from the first
population are likely to be higher (or lower) than measurements from the second population.
Consider the following situation:
Each week I visit my local supermarket on either Friday or Saturday at the same time of day, but I am not sure which is the better
day to shop. I suspect that the time taken is likely to be less on a Friday, and I decide to collect data to test this hypothesis. The first

23
line of data below gives the times on four Saturdays, and the second row gives the times on three Fridays, to the nearest minute.

Saturday 74 58 61 50
Friday 38 56 60

Is there any evidence from these data that shopping on Friday is likely to take less time than shopping on Saturday? Appropriate
null and alternative hypotheses are respectively:

H0 : Shopping time has the same distribution on Fridays and Saturdays;

H1 : Shopping is likely to take less time on Friday than on Saturday.


The values are listed below in order of increasing size with their ranks underneath them. The letter S or F in the third row indicates
whether the value refers to a Saturday or a Friday respectively.

Value 38 50 56 58 60 61 74
Rank 1 2 3 4 5 6 7
Day F S F S F S S
If shopping on Friday takes less time than shopping on Saturday,
P then you would expect low ranks in the table P
above to be associated
with the letter F . The sum of the ranks for the F values, rF , is 1 + 3 + 5 = 9. The highest possible value for P rF is 5 + 6 + 7 = 18
and the lowest possible value is 1 + 2 + 3 = 6. The average of these two extremes is 12, so the observed value of rF = 9 is on the low
side, suggesting
P that shopping on Friday may take less time than shopping on Saturday.
If rF had been on the high side, then there would be no reason to proceed further since the data could not provide evidence that
shopping on Friday is quicker than shoppingP on Saturday. P
In order to decide whether the value of rF is significantly low, you need to find the value of P ( rF ≤ 9) under the assumption
that shopping  on 7the
 two days takes the same time. Under such an assumption, all possible arrangements of Ss and F s are equally likely.
7 P
There are 3 = 4 = 35 such arrangements. The arrangements in which rF ≤ 9 are shown below.

24
P
1 2 3 4 5 6 7 rF
F F F S S S S 6
F F S F S S S 7
F S F F S S S 8
S F F F S S S 9
F F S S F S S 8
F S F S F S S 9
F F S S S F S 6
7
P
There are 7 such arrangements, so P ( rF ≤ 9) = 35 = 0.2. This result is not significant at, say, the 5% level, and so the null
hypothesis that the distributions of shopping times are the same on Saturday and Friday would be accepted in this case.
Listing all the possible arrangements would become tedious for larger samples, so you will be pleased to know that you can obtain
critical values from the table on page 274. An explanation of how to use the table is given at the top of that page. It is helpful to work
through this explanation using the shopping example.
The sizes of the samples are denoted by m and n, where m ≤ n; so for the shopping example m = 3 (Friday) and n = 4 (Saturday).
The sum of the ranks of the sample of size m is denoted by Rm , so Rm = 1 + 3 + 5 = 9. You also need to calculate

m(n + m + 1) − Rm = 3(4 + 3 + 1) − 9 = 15.


The statistic W is the smaller of the two values 9 and 15, in this case 9.
The quantity m(n + m + 1) − Rm is the rank sum which would have been obtained for the sample of size m if the values had been
ranked starting from the largest value rather than the smallest value, as was done above. Taking W to be the smaller of this quantity
and Rm ensures that W does not depend on the direction in which you rank the values.
The table on page 274 gives the largest value of W which leads to rejection of the null hypothesis. For a one-tail test at the 5%
significance level with m = 3 and n = 4, this is 6, so the rejection region is W ≤ 6. Since 9 does not lie in the rejection
P region, the null
hypothesis is not rejected. This is the same as the result which was obtained by calculating the probability that rF ≤ 9.

Key Point 11.6


If two samples have sizes m and n, where m ≤ n, Rm is the sum of the ranks of the items in the sample of size m, the test statistic is:

25
W = min (Rm , m(n + m + 1) − Rm )

WORKED EXAMPLE 11.10


Researchers are investigating the effect of vitamin B12 on the size of the brain. A sample of males aged between 25 and 40 years
is selected. Nine of them are known to have low B12 levels and seven are known to have high B12 levels. After a brain scan, the
ratio of brain volume to skull capacity is recorded.

Low B12 levels High B12 levels


0.795 0.786
0.798 0.789
0.802 0.792
0.805 0.796
0.806 0.799
0.807 0.800
0.808 0.803
0.810
0.812
Carry out a Wilcoxon rank-sum test, at the 5% significance level, to see whether the level of vitamin B12 affects the size of the
brain.
Answer

H0 : level of B12 has no effect on brain size.


H1 : level of B12 has an effect on brain size.

We can also state H0 as the samples are from the same population.

26
Low B12 High B12 Ranks
0.812 1 1
0.810 2 2
0.808 3 3
0.807 4 4
0.806 5 5
0.805 6 6
0.803 7 7
0.802 8 8
0.800 9 9
0.799 10 10
0.798 11 11
0.796 12 12
0.795 13 13
0.792 14 14
0.789 15 15
0.786 16 16
Sum 53 83
Calculate the test statistic:

Rm = 83 (rank sum from the smaller-sized sample).


m(n + m + 1) − Rm = 7(9 + 7 + 1) − 83 = 36.
The test statistic is the minimum of 83 and 36, which is W = 36.
Find the critical value in the statistical tables:

27
Level of significance
One-tailed 0.05 0.025 0.01
Two-tailed 0.1 0.05 0.02
m=7
7 39 36 34
8 41 38 35
9 43 40 37
10 45 42 39
Since 36 < 40, there is sufficient evidence to reject H0 .
There is evidence to suggest that the level of vitamin B12 affects brain size.

Key Point 11.7


For large n and m (n ≥ 10, m ≥ 10), it is possible to approximate W as a normal distribution:

m(n + m + 1)
E(W ) =
2
mn(n + m + 1)
Var(W ) =
12
We must also make sure that we use a continuity correction. Since we are approximating a discrete distribution with a continuous
distribution, our z-value is:
W − µ + 0.5
z= .
σ

28
WORKED EXAMPLE 11.11
A company is investigating a new production technique to improve the quality of camera lenses for a phone. Samples of the lenses
are given to a camera expert who is asked to rank the lenses, with rank 1 being the highest quality. The expert does not know
which production technique has been used.

Lens A B C D E F G H I J K L
Method old new new old old new old new old old old new
Rank 12 1 2 9 10 5 21 6 20 22 23 17
Lens M N O P Q R S T U V W X
Method new new old old old new old new old new new old
Rank 14 13 3 4 19 11 24 16 18 8 7 15
Using a suitable approximation as shown in Key Point 11.7, test, at the 5% significance level, whether there is a difference in the
quality of production techniques.
Answer

H0 : There is no difference in the quality of the two samples.


H1 : There is a difference in the quality of the two samples.

m = 11 (new), n = 13 (old)

m(n + m + 1) 11(25)
E(Rm ) = = = 137.5
2 2
mn(n + m + 1) 11 × 13 × 25
Var(Rm ) = = = 297.92
12 12

Rm = 1 + 2 + 5 + 6 + 17 + 14 + 13 + 11 + 16 + 8 + 7 = 100

29
100.5 − 137.5
z= q = −2.144
297·11
12

P (Z ≤ −2.144) = 0.0160
Since 0.0160 < 0.025, the test statistic is in the critical region and so we have sufficient evidence to reject H0 .
We could have compared −2.144 with the critical value for the two-tailed test, −1.96.
Since −2.144 < −1.96, the test statistic is in the critical region and so we have sufficient evidence to reject H0 .
There is a difference in quality between samples of camera lenses made by different production techniques.

WORKED EXAMPLE 11.12


Two different types of nylon fibre were tested for the amount of stretching under tension. Ten random samples of each fibre, of the
same length and diameter, were stretched by applying a standard load. For fibre 1, the increases in length, x mm, were as follows:

12.84, 14.26, 13.23, 14.75, 15.13, 14.15, 13.37, 12.96, 15.02, 14.38
For fibre 2, the increases in length, y mm, were as follows:

14.27, 13.25, 14.17, 13.11, 14.92, 12.12, 14.21, 13.68, 15.14, 14.81
Test whether the distribution of the increase in length differs for the two types of fibre. Use a 10% significance level.

A two-tail test is appropriate. The null and alternative hypotheses are:

H0 : The increases in length for the two types of fibre have the same distribution;

H1 : The increases in length for the two types of fibre have different distributions.
All 20 values are listed below in order of ascending size, together with their ranks. Values for fibre 1 are underlined.

30
12.12 12.84 12.96 13.11 13.23 13.25 13.37 13.68 14.15 14.17
1 2 3 4 5 6 7 8 9 10
14.21 14.26 14.27 14.38 14.75 14.81 14.92 15.02 15.13 15.14
11 12 13 14 15 16 17 18 19 20
The sums of the ranks for fibre 1 and fibre 2 are 104 and 106 respectively. (It is worth checking that the sum of these values is
equal to 1 + 2 + . . . + 20.) In this case, the sample sizes are equal; that is, m = n = 10. The sum of the ranks for the fibre 1 sample
is 104, so Rm = 104 and m(n + m + 1) − Rm = 10(10 + 10 + 1) − 104 = 106. The latter is just the sum of the ranks for the fibre 2
sample, a result which is obtained because the samples have the same size. Thus, the test statistic is W = 104. For a two-tail test
at the 10% significance level with m = n = 10, the rejection region is W ≤ 82. Since the observed value of W (= 104) does not lie
in the rejection region, the null hypothesis that the samples come from identical populations is accepted.

The data in Example 2.4.1 were analysed in S3 Example 4.2.2 using a two-sample t-test for the difference between population means.
The result of the two-sample t-test was the same; that is, the null hypothesis was accepted. The assumptions which were made in
order to carry out the two-sample t-test were that the populations are normally distributed with equal variance. Thus, accepting
the null hypothesis for this test is equivalent to saying that the samples come from identical populations. If the two-sample t-test
is valid, then its probability of a Type II error is lower than that of the Wilcoxon rank-sum test because of the assumptions made
about the population distributions in the two-sample t-test. However, the difference is not great, and so the Wilcoxon rank-
sum test is a useful alternative to the two-sample t-test when you cannot be sure that the conditions for the latter to be valid are met.

For values of m and n greater than 10, the Wilcoxon rank-sum test can be performed using the fact that Rm and m(n + m + 1) − Rm
are distributed approximately normally with mean 12 m(n + m + 1) and variance 12 1
mn(m + n + 1). As for the Wilcoxon signed-rank
test, a continuity correction should be applied since a discrete distribution is being approximated by a continuous one.

Suppose, for example, you obtain W = Rm = 106 with m = 12 and n = 15 and wish to carry out a one-tail test at the 1%
significance level, where the alternative hypothesis suggests low values of Rm . Under a null hypothesis of identical distributions,
Rm is distributed normally with

31
1 1
mean = m(n + m + 1) = × 12 × (12 + 15 + 1) = 168,
2 2
1 1
variance = mn(m + n + 1) = × 12 × 15(12 + 15 + 1) = 420.
12 12
Using a continuity correction in going from Rm to Z,
 
106.5 − 168
P (W ≤ 106) = P (Rm ≤ 106) = P Z ≤ √
420

= P (Z ≤ −3.000 . . .) = 0.0014.
Since 0.0014 < 0.01, the result is significant, and the null hypothesis is rejected.

For a two-tail test, you would calculate the probability that W is less than the observed value and then compare this probability
with half of the significance level.

In the tests described in Sections 2.3 and 2.4, it is possible for two values to have the same rank. The method then has to be
modified, but the modification will not be described here.

The Wilcoxon rank-sum test can be used to test the null hypothesis that two samples come from identical populations. The sample
sizes are denoted by m and n, where m ≤ n.
Steps in Carrying Out the Test

Step 1 Rank all the values from both samples in order of increasing size.

Step 2 Find Rm , the sum of the ranks of the items in the sample of size m.

Step 3 Take W as the test statistic, where W is the smaller of Rm and m(n + m + 1) − Rm .

32
Step 4 For small samples, find the rejection region from the table on page 274 and reject the null hypothesis if the calculated value
of W lies in the rejection region.

For large samples, calculate:

W + 0.5 − 12 m(m + n + 1)
Z= q .
1
12
mn(m + n + 1)

• For a one-tail test at the 100α% significance level, the rejection region is Z ≤ −z where Φ(z) = 1 − α.

• For a two-tail test at the 100α% significance level, the rejection region is Z ≤ −z where Φ(z) = 1 − 12 α.

The Wilcoxon rank-sum test for identity of distributions makes no assumptions about the population distributions other than that
the data are quantitative.

Checklist of learning and understanding


Single-sample sign test
• Given n data points, a sign test is created using X ∼ Bin(n, 0.5). The test statistic can be the number of + signs, that is, the number
of data points greater than the median.

• We can calculate the probability that X is above this test statistic, below this test statistic, or either in the case of a two-tailed test.

Wilcoxon signed-rank test


• A Wilcoxon signed-rank test can be performed when:

• the underlying data are symmetric.


• the underlying data are continuous.

• Where:

33
• P is the sum of the ranks corresponding to the positive differences from the stated median.
• N is the sum of the ranks corresponding to the negative differences from the stated median.
• T = min(P, N ) is the test statistic.
• Given the statistic T = min(P, N ), then:
n(n + 1) n(n + 1)(2n + 1)
E(T ) = , Var(T ) = .
4 24

• For large n:  
n(n + 1) n(n + 1)(2n + 1)
T ∼N , ,
4 24
allowing for an approximate z-test to be done using:
T − µ + 0.5
z= .
σ

Wilcoxon matched-pairs signed-rank test


• A Wilcoxon matched-pairs signed-rank test can be performed when:
• the difference between matched-pairs is symmetric.
• the difference between matched-pairs is continuous.
• Where:
• P is the sum of the ranks corresponding to the positive differences between the matched pairs.
• N is the sum of the ranks corresponding to the negative differences between the matched pairs.
• T = min(P, N ) is the test statistic.
• Given the statistic T = min(P, N ), then:
n(n + 1) n(n + 1)(2n + 1)
E(T ) = , Var(T ) = .
4 24
34
• For large n:  
n(n + 1) n(n + 1)(2n + 1)
T ∼N , ,
4 24
allowing for an approximate z-test with:
T − µ + 0.5
z= .
σ
Wilcoxon rank-sum test

• A Wilcoxon rank-sum test can be performed when the two samples are independent, where:

• the two samples have sizes m and n, where m ≤ n.


• Rm is the sum of the ranks of the items in the sample of size m.
• the test statistic is W = min(Rm , m(n + m + 1) − Rm ).

• Given the test statistic W , then:


m(n + m + 1) mn(n + m + 1)
E(W ) = , Var(W ) = .
2 12
• For large n and m (n ≥ 10, m ≥ 10), it is possible to approximate W as a normal distribution:
 
m(n + m + 1) mn(n + m + 1)
W ∼N , ,
2 12

allowing for an approximate z-test with:


W − µ + 0.5
z= .
σ

35

You might also like