The Kruskal-Wallis test:
The Kruskal-Wallis test is a nonparametric test used to compare three or more samples. It is
used to test the null hypothesis that all populations have identical distribution functions against
the alternative hypothesis that at least two of the samples differ only with respect to location
(median), if at all.
It is the analogue to the F-test used in analysis of variance. While analysis of variance tests
depend on the assumption that all populations under comparison are normally distributed, the
Kruskal-Wallis test places no such restriction on the comparison.
It is a logical extension of the Wilcoxon-Mann-Whitney Test
This test is appropriate for use under the following circumstances:
(a) you have three or more conditions that you want to compare;
(b) each condition is performed by a different group of participants; i.e. you have an independent
measures design with three or more conditions.
(c) the data do not meet the requirements for a parametric test. (i.e. use it if the data are not normally
distributed; if the variances for the different conditions are markedly different; or if the data are
measurements on an ordinal scale). If the data meet the requirements for a parametric test, it is better
to use a one-way independent-measure Analysis of Variance (ANOVA) because it is more powerful than
the Kruskal-Wallis test.
Step by step example of the Kruskal-Wallis test:
Example problem:
Does physical exercise alleviate depression? We find some depressed people and check that they are all
equivalently depressed to begin with. Then we allocate each person randomly to one of three groups:
no exercise; 20 minutes of jogging per day; or 60 minutes of jogging per day. At the end of a month, we
ask each participant to rate how depressed they now feel, on a Likert scale that runs from 1 ("totally
miserable") through to 100 (ecstatically happy"). The appropriate test here is the Kruskal-Wallis test. We
have three separate groups of participants, each of whom gives us a single score on a rating scale.
Ratings are examples of an ordinal scale of measurement, and so the data are not suitable for a
parametric test. The Kruskal-Wallis test will tell us if the differences between the groups are so large
that they are unlikely to have occurred by chance. Here are the data:
Rating on depression scale:
No exercise Jogging for Jogging for 60
20 minutes minutes
23 22 59
26 27 66
51 39 38
49 29 49
58 46 56
37 48 60
29 49 56
44 65 62
mean rating 39.63 40.63 55.75
(SD): (12.85) (14.23) (8.73)
Step 1: Rank all of the scores, ignoring which group they belong to. The
procedure for ranking is as follows: the lowest score gets the lowest rank. If two
or more scores are the same then they are "tied". "Tied" scores get the average of
the ranks that they would have obtained, had they not been tied. Here's the scores
again, now with their ranks in brackets:
C1 w/o exercise C2 jogging for 20 min C3 jogging for 60 min
23 (2) 22 (1) 59 (20)
26 (3) 27 (4) 66 (24)
51 (16) 39 (9) 38 (8)
49 (14) 29 (5.5) 49 (14)
58 (19) 46 (11) 56 (17.5)
37 (7) 48 (12) 60 (21)
29 (5.5) 49 (14) 56 (17.5)
44 (10) 65 (23) 62 (22)
mean rank 9.56 9.94 18.00
(SD) (6.25) (6.84) (5.09)
sum of ranks 76.5 79.5 144
(Tc)
In detail, this is how the ranks are arrived at for these scores.
(a) "22" is the lowest score. This gets a rank of 1.
(b) "23" is the next lowest score. This gets a rank of 2.
(c) "26" is the next lowest score. This gets a rank of 3.
(d) "27" is the next lowest score. This gets a rank of 4.
(e) There are two instances of "29". This is a "tie". They both get the average of the ranks that they
would have been allocated, had they been different from each other. So the next two ranks are 5 and 6.
The average of 5 and 6 is 11/2 = 5.5. Both instances of "29" therefore get a rank of 5.5.
(f) "37" is the next lowest score. This gets a rank of 7 (because we've just "used up" ranks 5 and 6).
(g) "38" is the next lowest score, and it gets a rank of 8.
(h) "39" is the next lowest score, and it gets a rank of 9.
(i) "44" gets a rank of 10, "46" gets a rank of 11, and "48" gets a rank of 12.
(j) There are three instances of "49", so this is another tie. They each get the average of the next three
unused ranks ( (13+14+15) / 3 = 14).
(k) "51" is the next lowest score, and it gets the next "unused" rank, which is 16.
(l) There are two instances of "56", so they get the average of the next two unused ranks ( (17+18) /2 =
17.5).
(m) "58" gets the next unused rank, which is 19.
(n) "59" gets a rank of 20, "60" gets 21, "62" gets 22, "65" gets 23, and 66 gets 24.
Step 2: Find "Tc", the total of the ranks for each group. Just add together all of the
ranks for each group in turn. Here,
Tc1 (the rank total for the "no exercise" group) is 76.5.
Tc2 (the rank total for the "20 minutes" group) is 79.5.
Tc3 (the rank total for the "60 minutes" group) is 144.
Step 3: Find "H".
𝟏𝟐 𝐓𝐜 𝟐
𝑯=[ (∑ )] − 𝟑(𝑵 + 𝟏)
𝑵(𝑵 + 𝟏) 𝑵𝒄
N is the total number of participants (all groups combined). We have 24participants (3 groups of
8).
Tc is the rank total for each group. Tc1 = 76.5, Tc2 = 79.5, and Tc3 = 144.
nc is the number of participants in each group. Here, nc1 = 8, nc2 = 8 and nc3 = 8.
For our data,
𝟏𝟐 𝑻𝒄𝟐
𝑯=[ (∑ ] − 𝟑 (𝟐𝟒 + 𝟏)
𝟐𝟒(𝟐𝟒 + 𝟏) 𝑵𝒄
𝑻𝒄𝟐
means the following:
𝑵𝒄
𝑻𝒄𝟐 𝟕𝟔.𝟓𝟐 𝟕𝟗.𝟓𝟐 𝟏𝟒𝟒𝟐
= + +
𝑵𝒄 𝟖 𝟖 𝟖
𝑻𝒄𝟐
= 731.5313 + 790.0313 + 2592.0000 = 𝟒𝟏𝟏𝟑. 𝟓𝟔𝟐𝟓
𝑵𝒄
𝟏𝟐
𝑯=[ (𝟒𝟏𝟏𝟑. 𝟓𝟔𝟐𝟓)] − 𝟕𝟓
𝟔𝟎𝟎
𝟏𝟐
𝑯=[ (𝟒𝟏𝟏𝟑. 𝟓𝟔𝟐𝟓)] − 𝟕𝟓
𝟔𝟎𝟎
𝑯 = 𝟕. 𝟐𝟕
Step 4: the degrees of freedom is the number of groups minus one. Here we have
three groups, and so we have 2 d.f.
Step 5:
Assessing the significance of H depends on the number of participants and the
number of groups. If you have three groups, with five or fewer participants in each group, then you
need to use the special table for small sample sizes (which is on my website).
If you have more than five participants per group, then treat H as Chi-Square. H is statistically
significant if it is equal to or larger than the critical value of Chi-Square for your particular d.f.
Here, we have eight participants per group, and so we treat H as Chi-Square. H is
7.27, with 2 d.f. Here's the relevant part of the Chi-Square table:
Table of critical Chi-Square values:
df P = .05 P = .01 P = .001
1 3.84 6.24 10.83
2 5.99 9.21 13.82
3 7.82 11.35 16.27
Look along the row that corresponds to your number of degrees of freedom.
So in this case, we look along the row for 2 d.f. We compare our obtained value of H to each of the
critical values in that row of the table, starting on the lefthand side and stopping once our value of
H is no longer equal to or larger than the critical value. So here, we start by comparing our H of 7.27
to 5.99. With 2 degrees of freedom, a value of Chi-Square as large as 5.99 is likely to occur by
chance only 5 times in a hundred: i.e. it has a p of .05. Our obtained value of 7.27 is even larger than
this, and so this tells us that our value of H is even less likely to occur by chance. Our H will occur by
chance with a probability of less than 0.05. Move on, and compare our H to the next value in the
row, 9.21. 9.21 will occur
by chance one time in a hundred, i.e. with a p of .01. However, our H of 7.27 is less than 9.21, not
bigger than it. This tells us that our value of H is not so large that it is likely to occur with a
probability of 0.01.
Conclusion:
The likelihood of obtaining a value of H as large as the one we've found, purely by chance, is
somewhere between 0.05 and 0.01 - i.e. pretty unlikely, and so we would conclude that there is a
difference of some kind between our three groups. Note that the Kruskal-Wallis test merely tells
you that the groups differ in some way: you need to inspect the group means or medians to decide
precisely how they differ. However in this particular case, the interpretation seems fairly
straightforward: exercise does seem to reduce self-reported ratings of depression, but only in the
case of participants who are doing an hour of it. There seems to be no difference between those
participants who took 20 minutes of exercise per day, and those who did not exercise at all.