Statistics
Statistics
Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics, but
these are, I think, the three most common, and are certainly the three you are most likely to encounter
in your pre-statistics courses, if the topic comes up at all.
The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the
number of numbers. The "median" is the "middle" value in the list of numbers. To find the median,
your numbers have to be listed in numerical order, so you may have to rewrite your list first. The
"mode" is the value that occurs most often. If no number is repeated, then there is no mode for the
list.
The "range" is just the difference between the largest and smallest values.
Find the mean, median, mode, and range for the following list of values:
(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) 9 = 15
Note that the mean isn't a value from the original list. This is a common result. You should not
assume that your mean will be one of your original numbers.
The median is the middle value, so I'll have to rewrite the list in order:
There are nine numbers in the list, so the middle one will be the (9 + 1) 2 = 10 2 = 5th
number:
So the median is 14. Copyright Elizabeth Stapel 2004-2011 All Rights Reserved
The mode is the number that is repeated more often than any other, so 13 is the mode.
The largest value in the list is 21, and the smallest is 13, so the range is 21 13 = 8.
mean: 15
median: 14
mode: 13
range: 8
Note: The formula for the place to find the median is "( [the number of data points] + 1) 2", but you
don't have to use this formula. You can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.
Find the mean, median, mode, and range for the following list of values:
1, 2, 4, 7
(1 + 2 + 4 + 7) 4 = 14 4 = 3.5
(2 + 4) 2 = 6 2 = 3
The largest value in the list is 7, the smallest is 1, and their difference is 6, so the range is 6.
mean: 3.5
median: 3
mode: none
range: 6
The list values were whole numbers, but the mean was a decimal value. Getting a decimal value for
the mean (or for the median, if you have an even number of data points) is perfectly okay; don't round
your answers to try to match the format of the other numbers.
Find the mean, median, mode, and range for the following list of values:
The median is the middle value. In a list of ten values, that will be the (10 + 1) 2 = 5.5th
value; that is, I'll need to average the fifth and sixth numbers to find the median:
The mode is the number repeated most often. This list has two values that are repeated three
times.
While unusual, it can happen that two of the averages (the mean and the median, in this case) will
have the same value.
Note: Depending on your text or your instructor, the above data set may be viewed as having no
mode (rather than two modes), since no single solitary number was repeated more often than any
other. I've seen books that go either way; there doesn't seem to be a consensus on the "right"
definition of "mode" in the above case. So if you're not certain how you should answer the "mode" part
of the above example, ask your instructor before the next test.
About the only hard part of finding the mean, median, and mode is keeping straight which "average" is
which. Just remember the following:
(In the above, I've used the term "average" rather casually. The technical definition of "average" is the
arithmetic mean: adding up the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with "measure of central tendency", I used
the more comfortable term.)
A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants
an 85or better overall. What is the minimum grade he must get on the last test in order
to achieve that average?
(87 + 95 + 76 + 88 + x) 5 = 85
87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group:
But if we want an estimated Median value we need to look more closely at the 61 - 65 group.
We call it "61 - 65", but it really includes values from 60.5 up to (but not including) 65.5.
Why? Well, the values are in whole seconds, so a real time of 60.5 is measured as 61. Likewise 65.4 is
measured as 65.
At 60.5 we already have 9 runners, and by the next boundary at 65.5 we have 17 runners. By drawing
a straight line in between we can pick out where the median frequency of n/2 runners is:
And this handy formula does the calculation:
(n/2) B
Estimated Median = L + w
G
where:
L = 60.5
n = 21
B=2+7=9
G=8
w=5
= 60.5 + 0.9375
= 61.4375
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
The groups (51-55, 56-60, etc), also called class intervals, are of width5
The midpoints are in the middle of each class: 53, 58, 63 and 68
Think about the 7 runners in the group 56 - 60: all we know is that they ran somewhere between 56
and 60 seconds:
So we take an average and assume that all seven of them took 58 seconds.
Midpoint Frequency
53 2
58 7
63 8
68 4
Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8 people took 63 sec and 3 took 68 sec". In
other words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68
Then we add them all up and divide by 21. The quick way to do it is to multiply each midpoint by each
frequency:
53 2 106
58 7 406
63 8 504
68 4 272
Totals: 21 1288
And then our estimate of the mean time to complete the race is:
1288
Estimated Mean = = 61.333...
21
Seconds Frequency
51 - 55 2
56 - 60 7
61 - 65 8
66 - 70 4
We can easily find the modal group (the group with the highest frequency), which is 61 - 65
We can say "the modal group is 61 - 65"
But the actual Mode may not even be in that group! Or there may be more than one mode. Without the
raw data we don't really know.
fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)
where:
In this example:
L = 60.5
fm-1 = 7
fm = 8
fm+1 = 4
w=5
87
Estimated Mode = 60.5 + 5
(8 7) + (8 4)
= 60.5 + (1/5) 5
= 61.5
(Compare that with the true Mean, Median and Mode of 61.38..., 61 and 62 that we got at the very
start.)
Example: You grew fifty baby carrots using special soil. You dig them up and measure their
lengths (to the nearest mm) and group the results:
150 - 154 5
155 - 159 2
160 - 164 6
165 - 169 8
170 - 174 9
175 - 179 11
180 - 184 6
185 - 189 3
Mean
Midpoint Frequency
Length (mm)
x f fx
Totals: 50 8530
8530
Estimated Mean = = 170.6 mm
50
Median
The Median is the mean of the 25th and the 26th length, so is in the 170 - 174 group:
(50/2) 21
Estimated Median = 169.5 + 5
9
= 169.5 + 2.22...
Mode
The Modal group is the one with the highest frequency, which is 175 - 179:
= 174.5 + 1.42...
Age Example
Age is a special case.
When we say "Sarah is 17" she stays "17" up until her eighteenth birthday.
She might be 17 years and 364 days old and still be called "17".
Example: The ages of the 112 people who live on a tropical island are grouped as follows:
Age Number
0-9 20
10 - 19 21
20 - 29 23
30 - 39 16
40 - 49 11
50 - 59 10
60 - 69 7
70 - 79 3
80 - 89 1
A child in the first group 0 - 9 could be almost 10 years old. So the midpoint for this group is 5 not 4.5
The midpoints are 5, 15, 25, 35, 45, 55, 65, 75 and 85
Similarly, in the calculations of Median and Mode, we will use the class boundaries 0, 10, 20 etc
Mean
Age Midpoint Number
x f fx
0-9 5 20 100
10 - 19 15 21 315
20 - 29 25 23 575
30 - 39 35 16 560
40 - 49 45 11 495
50 - 59 55 10 550
60 - 69 65 7 455
70 - 79 75 3 225
80 - 89 85 1 85
3360
Estimated Mean = = 30
112
Median
The Median is the mean of the ages of the 56th and the 57th people, so is in the 20 - 29 group:
L = 20 (the lower class boundary of the class interval containing the median)
n = 112
B = 20 + 21 = 41
G = 23
w = 10
(112/2) 41
Estimated Median = 20 + 10
23
= 20 + 6.52...
= 26.5 (to 1 decimal)
Mode
The Modal group is the one with the highest frequency, which is 20 - 29:
23 21
Estimated Mode = 20 + 10
(23 21) + (23 16)
= 20 + 2.22...
Summary
For grouped data, we cannot find the exact Mean, Median and Mode, we can only
give estimates.
(n/2) B
Estimated Median = L + w
G
where:
fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)
where:
L is the lower class boundary of the modal group
fm-1 is the frequency of the group before the modal group
fm is the frequency of the modal group
fm+1 is the frequency of the group after the modal group
w is the group width
For various important reasons we'll see as we get further into this course, we often want to know
not only what the central tendency is in a set of scores or values (i.e., the mean, the median, or the
mode), we also want to know how bunched up or spread out the scores are. The most widely used
indicator of dispersion is the standard deviation which, in a nutshell, is based on the deviation of
each score from the mean.
To illustrate, compare the distribution of test scores in Figures 4 and 5. The first is flat and spread
out, while the second is concentrated and bunched up closely around the mean.
Figure 4
Graphic Display of Flat or Spread-Out Score Distribution
Figure 5
Display of a Narrow or Concentrated Distribution
Note that he mean and median of these two quite different distributions are the same ( = 150,
Mdn = 150), so simply calculating and reporting those two measures of central tendency would fail
to reveal how different the dispersion of scores is between the two groups. But we can do this by
calculating the standard deviation.
The standard deviation provides us with a measure of just how spread out the scores are: a high
standard deviation means the scores are widely spread; a low standard deviation means they're
bunched up closely on either side of the mean.
We'll now calculate the standard deviation for both these distributions. The formula for the
standard deviation is:
Where:
The numbers we need to calculate the standard deviation for Figure 4, the flat distribution, are in
Table 6.
Table 6
Data for Figure 4the Flat Distribution
A B C D E
Test Score (X) Frequency (f) XMean (d) fd fd2
100 8 50 400 20,000
110 13 40 520 20,800
120 17 30 510 15,300
130 20 20 400 8,000
140 21 10 210 2,100
150 22 0 0 0
160 21 -10 -210 2,100
170 20 -20 -400 8,000
180 17 -30 -510 15,300
190 13 -40 -520 20,800
200 8 -50 -400 20,000
SUM 180 132,400
Column B shows how many people got each test score (f).
Column C is the test score minus the mean (X minus the mean or d).
Of course, to get the deviation of each score from the mean (column C), we have to calculate the
mean, and you already know how to do that. We now have what we need to calculate the standard
deviation for the flat distribution in Figure 4:
or
You can do the last part of this calculation, the square root of 132,400/180 (which is 736) by using
the square-root button on your little hand calculator.
Now let's compute the standard deviation for the data in Figure 5. The data are in Table 7, and you
follow the same steps we've just completed.
Table 7
Example of a Narrow or Concentrated Distribution
A B C D E
Test Score (X) Frequency (f) X - Mean (d) fd fd2
100 0 50 0 0
110 0 40 0 0
120 0 30 0 0
130 10 20 200 4,000
140 45 10 450 4,500
150 70 0 0 0
160 45 -10 -450 4,500
170 10 -20 -200 4,000
180 0 -30 0 0
190 0 -40 0 0
200 0 -50 0 0
SUM 180 17,000
or
The two standard deviations provide a statistical indication of the how different the distributions
are: 27 for the spread-out distribution and 10 for the bunched-up distribution.
So once we know the mean and median, why do we need to know the standard deviation? What
use is it?
The standard deviation is important because, regardless of the mean, it makes a great deal of
difference whether the distribution is spread out over a broad range or bunched up closely around
the mean. For example, suppose you have two classes whose mean reading scores are the same.
With only that information, you would be inclined to teach the two classes in the same way. But
suppose you discover that the standard deviation of one of the classes is 27 and the other is 10, as
in the examples we just finished working with. That means that in the first class (the one
where 27), you have many students throughout the entire range of performance. You'll need
to have teaching strategies for both the gifted and the challenged. But in the second class (the one
where = 10), you don't have any gifted or challenged students. They're all average, and your
teaching strategy will be entirely different.
This is one of the most important parts of this course in basic statistics. Here were going to learn
about testing the significance of difference between means. What does that mean?
Suppose youre the superintendent, and one of your principals bursts into your office
enthusiastically and says, "I know youll be happy to learn that after our big effort this year in
reading, my third graders improved from 187 to 195 on the state reading test!"
You immediately ask her, "Is the 8-point difference between those means statistically significant?"
When her eyes glaze over and she says, "Huh?" you smile, forebearingly, (because youve taken
this course in basic statistics, and she hasnt), and you patiently explain to her that simply because
there is a numerical difference between last years and this years mean scores doesnt mean that
there is real difference. It could be due to chance variation in the scores.
So how do we know when the difference between two means is probably a real difference, not one
due to chance? We have to say "probably" because nothing in statistics is absolutely certain (as is
the case with most things in life). But there are statistical tests which can tell us how likely a
difference between two means is due to chance.
One of the most widely used statistical methods for testing the difference between means, and the
one were going to get you up-to-speed on, is called the t-test.
Lets go back to the salary data we worked with in Table 1 of Lesson 1, but now lets compare the
mean salary of that group with another group, and ask whether the mean salaries of the two
groups are significantly different.
First, lets look at the formula for the t-test, and determine what we need to make the computation:
Where:
2
s 1 the variance for Group 1.
The only thing in this formula youre not familiar with is the symbol s2, which stands for the
variance. The variance is the same as the standard deviation without the square root, i.e., its
nothing more than the sum of the deviations of all the scores from the mean divided by n-1.
Where:
But for now, well test the significance of difference between the mean salary of two different groups. You
can try the one for dependent samples on your own. (I knew youd welcome that opportunity.)
Tables 8 and 9 provide the numbers we need to compute the t-test for the difference in mean salaries of the
two groups.
Table 8
Salaries and t-Test Calculation Data for Group 1
A B C D E
Salary (X) Frequency (f) X - Mean (d) fd fd2
20 1 25 25 625
25 2 20 40 800
30 3 15 45 675
35 4 10 40 400
40 5 5 25 125
45 6 0 0 0
50 5 -5 -25 125
55 4 -10 -40 400
60 3 -15 -45 675
65 2 -20 -40 800
70 1 -25 -25 625
SUM 36 5,250
Table 9
Salaries and t-Test Calculation Data for Group 2
A B C D E
Salary(X) Frequency (f) X - Mean (d) fd fd2
20 0 27 0 0
25 2 22 44 968
30 3 17 51 867
35 3 12 36 432
40 4 7 28 196
45 6 2 12 24
50 6 -3 -18 54
55 5 -8 -40 320
60 3 -13 -39 507
65 2 -18 -36 648
70 2 -23 -46 1,058
SUM 36 5,074
You can see from a quick inspection of the two tables that the salary distributions are similar.
There a few more people making higher salaries. The mean of the second group (which has been
calculated for you) is slightly higher (47 vs. 45 for the first group). And the variance is smaller (145
vs. 150). So lets plug the numbers into the t-test formula and see what we get.
We now know that t = .222. So what does that mean? Is the difference between the two means
statistically significant or not? To find out whether a t-test of any value is significant or not, we
simply look it up in a table that can be found in the appendices of any statistical text book. The
quick answer in this case is no, it is not statistically significant. That is, the 2-point difference in
the mean salaries of these two groups could likely have occurred by chance.