Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
170 views20 pages

Statistics

Statistics

Uploaded by

Aditya Nanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views20 pages

Statistics

Statistics

Uploaded by

Aditya Nanda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Introduction to Statistics

INTRODUCTION TO MEAN ,MEDIAN ,MODE

Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics, but
these are, I think, the three most common, and are certainly the three you are most likely to encounter
in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the
number of numbers. The "median" is the "middle" value in the list of numbers. To find the median,
your numbers have to be listed in numerical order, so you may have to rewrite your list first. The
"mode" is the value that occurs most often. If no number is repeated, then there is no mode for the
list.

The "range" is just the difference between the largest and smallest values.

Find the mean, median, mode, and range for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) 9 = 15

Note that the mean isn't a value from the original list. This is a common result. You should not
assume that your mean will be one of your original numbers.

The median is the middle value, so I'll have to rewrite the list in order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) 2 = 10 2 = 5th
number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14. Copyright Elizabeth Stapel 2004-2011 All Rights Reserved

The mode is the number that is repeated more often than any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 13 = 8.

mean: 15
median: 14
mode: 13
range: 8

Note: The formula for the place to find the median is "( [the number of data points] + 1) 2", but you
don't have to use this formula. You can just count in from both ends of the list until you meet in the
middle, if you prefer. Either way will work.
Find the mean, median, mode, and range for the following list of values:

1, 2, 4, 7

The mean is the usual average: ADVERTISEMENT

(1 + 2 + 4 + 7) 4 = 14 4 = 3.5

The median is the middle number. In


this example, the numbers are already
listed in numerical order, so I don't
have to rewrite the list. But there is no
"middle" number, because there are
an even number of numbers. In this
case, the median is the mean (the
usual average) of the middle two
values:

(2 + 4) 2 = 6 2 = 3

The mode is the number that is


repeated most often, but all the numbers in this list appear only once, so there is no mode.

The largest value in the list is 7, the smallest is 1, and their difference is 6, so the range is 6.

mean: 3.5
median: 3
mode: none
range: 6

The list values were whole numbers, but the mean was a decimal value. Getting a decimal value for
the mean (or for the median, if you have an even number of data points) is perfectly okay; don't round
your answers to try to match the format of the other numbers.

Find the mean, median, mode, and range for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The mean is the usual average:

(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) 10 = 105 10 = 10.5

The median is the middle value. In a list of ten values, that will be the (10 + 1) 2 = 5.5th
value; that is, I'll need to average the fifth and sixth numbers to find the median:

(10 + 11) 2 = 21 2 = 10.5

The mode is the number repeated most often. This list has two values that are repeated three
times.

The largest value is 13 and the smallest is 8, so the range is 13 8 = 5.


mean: 10.5
median: 10.5
modes: 10 and 11
range: 5

While unusual, it can happen that two of the averages (the mean and the median, in this case) will
have the same value.

Note: Depending on your text or your instructor, the above data set may be viewed as having no
mode (rather than two modes), since no single solitary number was repeated more often than any
other. I've seen books that go either way; there doesn't seem to be a consensus on the "right"
definition of "mode" in the above case. So if you're not certain how you should answer the "mode" part
of the above example, ask your instructor before the next test.

About the only hard part of finding the mean, median, and mode is keeping straight which "average" is
which. Just remember the following:

mean: regular meaning of "average"


median: middle value
mode: most often

(In the above, I've used the term "average" rather casually. The technical definition of "average" is the
arithmetic mean: adding up the values and then dividing by the number of values. Since you're
probably more familiar with the concept of "average" than with "measure of central tendency", I used
the more comfortable term.)

A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants
an 85or better overall. What is the minimum grade he must get on the last test in order
to achieve that average?

The unknown score is "x". Then the desired average is:

(87 + 95 + 76 + 88 + x) 5 = 85

Multiplying through by 5 and simplifying, I get:

87 + 95 + 76 + 88 + x = 425
346 + x = 425
x = 79

He needs to get at least a 79 on the last test.


Very close to the exact answer we got earlier.

Estimating the Median from Grouped Data


Let's look at our data again:

Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4

The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group:

We can say "the median group is 61 - 65"

But if we want an estimated Median value we need to look more closely at the 61 - 65 group.

We call it "61 - 65", but it really includes values from 60.5 up to (but not including) 65.5.

Why? Well, the values are in whole seconds, so a real time of 60.5 is measured as 61. Likewise 65.4 is
measured as 65.

At 60.5 we already have 9 runners, and by the next boundary at 65.5 we have 17 runners. By drawing
a straight line in between we can pick out where the median frequency of n/2 runners is:
And this handy formula does the calculation:

(n/2) B
Estimated Median = L + w
G

where:

L is the lower class boundary of the group containing the median


n is the total number of values
B is the cumulative frequency of the groups before the median group
G is the frequency of the median group
w is the group width

For our example:

L = 60.5
n = 21
B=2+7=9
G=8
w=5

Estimated Median = 60.5 + (21/2) 98 5

= 60.5 + 0.9375

= 61.4375

Estimating the Mean from Grouped Data


So all we have left is:

Seconds Frequency
51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4

The groups (51-55, 56-60, etc), also called class intervals, are of width5
The midpoints are in the middle of each class: 53, 58, 63 and 68

We can estimate the Mean by using the midpoints.

So, how does this work?

Think about the 7 runners in the group 56 - 60: all we know is that they ran somewhere between 56
and 60 seconds:

Maybe all seven of them did 56 seconds,


Maybe all seven of them did 60 seconds,
But it is more likely that there is a spread of numbers: some at 56, some at 57, etc

So we take an average and assume that all seven of them took 58 seconds.

Let's now make the table using midpoints:

Midpoint Frequency

53 2

58 7

63 8

68 4
Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8 people took 63 sec and 3 took 68 sec". In
other words we imagine the data looks like this:

53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68

Then we add them all up and divide by 21. The quick way to do it is to multiply each midpoint by each
frequency:

Midpoint Frequency Midpoint


x f Frequency
fx

53 2 106

58 7 406

63 8 504

68 4 272

Totals: 21 1288

And then our estimate of the mean time to complete the race is:

1288
Estimated Mean = = 61.333...
21

Very close to the exact answer we got earlier.

Estimating the Mode from Grouped Data


Again, looking at our data:

Seconds Frequency

51 - 55 2

56 - 60 7

61 - 65 8

66 - 70 4

We can easily find the modal group (the group with the highest frequency), which is 61 - 65
We can say "the modal group is 61 - 65"

But the actual Mode may not even be in that group! Or there may be more than one mode. Without the
raw data we don't really know.

But, we can estimate the Mode using the following formula:

fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)

where:

L is the lower class boundary of the modal group


fm-1 is the frequency of the group before the modal group
fm is the frequency of the modal group
fm+1 is the frequency of the group after the modal group
w is the group width

In this example:

L = 60.5
fm-1 = 7
fm = 8
fm+1 = 4
w=5

87
Estimated Mode = 60.5 + 5
(8 7) + (8 4)

= 60.5 + (1/5) 5

= 61.5

Our final result is:

Estimated Mean: 61.333...


Estimated Median: 61.4375
Estimated Mode: 61.5

(Compare that with the true Mean, Median and Mode of 61.38..., 61 and 62 that we got at the very
start.)

And that is how it is done.


Now let us look at two more examples, and get some more practice along the way!

Baby Carrots Example

Example: You grew fifty baby carrots using special soil. You dig them up and measure their
lengths (to the nearest mm) and group the results:

Length (mm) Frequency

150 - 154 5

155 - 159 2

160 - 164 6

165 - 169 8

170 - 174 9

175 - 179 11

180 - 184 6

185 - 189 3

Mean

Midpoint Frequency
Length (mm)
x f fx

150 - 154 152 5 760

155 - 159 157 2 314

160 - 164 162 6 972

165 - 169 167 8 1336


170 - 174 172 9 1548

175 - 179 177 11 1947

180 - 184 182 6 1092

185 - 189 187 3 561

Totals: 50 8530

8530
Estimated Mean = = 170.6 mm
50

Median

The Median is the mean of the 25th and the 26th length, so is in the 170 - 174 group:

L = 169.5 (the lower class boundary of the 170 - 174 group)


n = 50
B = 5 + 2 + 6 + 8 = 21
G=9
w=5

(50/2) 21
Estimated Median = 169.5 + 5
9

= 169.5 + 2.22...

= 171.7 mm (to 1 decimal)

Mode

The Modal group is the one with the highest frequency, which is 175 - 179:

L = 174.5 (the lower class boundary of the 175 - 179 group)


fm-1 = 9
fm = 11
fm+1 = 6
w=5

Estimated Mode = 174.5 + 11 9 5


(11 9) + (11 6)

= 174.5 + 1.42...

= 175.9 mm (to 1 decimal)

Age Example
Age is a special case.

When we say "Sarah is 17" she stays "17" up until her eighteenth birthday.
She might be 17 years and 364 days old and still be called "17".

This changes the midpoints and class boundaries.

Example: The ages of the 112 people who live on a tropical island are grouped as follows:

Age Number

0-9 20

10 - 19 21

20 - 29 23

30 - 39 16

40 - 49 11

50 - 59 10

60 - 69 7

70 - 79 3

80 - 89 1

A child in the first group 0 - 9 could be almost 10 years old. So the midpoint for this group is 5 not 4.5

The midpoints are 5, 15, 25, 35, 45, 55, 65, 75 and 85
Similarly, in the calculations of Median and Mode, we will use the class boundaries 0, 10, 20 etc

Mean
Age Midpoint Number
x f fx

0-9 5 20 100

10 - 19 15 21 315

20 - 29 25 23 575

30 - 39 35 16 560

40 - 49 45 11 495

50 - 59 55 10 550

60 - 69 65 7 455

70 - 79 75 3 225

80 - 89 85 1 85

Totals: 112 3360

3360
Estimated Mean = = 30
112

Median

The Median is the mean of the ages of the 56th and the 57th people, so is in the 20 - 29 group:

L = 20 (the lower class boundary of the class interval containing the median)
n = 112
B = 20 + 21 = 41
G = 23
w = 10

(112/2) 41
Estimated Median = 20 + 10
23

= 20 + 6.52...
= 26.5 (to 1 decimal)

Mode

The Modal group is the one with the highest frequency, which is 20 - 29:

L = 20 (the lower class boundary of the modal class)


fm-1 = 21
fm = 23
fm+1 = 16
w = 10

23 21
Estimated Mode = 20 + 10
(23 21) + (23 16)

= 20 + 2.22...

= 22.2 (to 1 decimal)

Summary
For grouped data, we cannot find the exact Mean, Median and Mode, we can only
give estimates.

To estimate the Mean use the midpoints of the class intervals.

(n/2) B
Estimated Median = L + w
G

where:

L is the lower class boundary of the group containing the median


n is the total number of data
B is the cumulative frequency of the groups before the median group
G is the frequency of the median group
w is the group width

fm fm-1
Estimated Mode = L + w
(fm fm-1) + (fm fm+1)

where:
L is the lower class boundary of the modal group
fm-1 is the frequency of the group before the modal group
fm is the frequency of the modal group
fm+1 is the frequency of the group after the modal group
w is the group width

A Measure of Dispersion: The Standard Deviation

For various important reasons we'll see as we get further into this course, we often want to know
not only what the central tendency is in a set of scores or values (i.e., the mean, the median, or the
mode), we also want to know how bunched up or spread out the scores are. The most widely used
indicator of dispersion is the standard deviation which, in a nutshell, is based on the deviation of
each score from the mean.

To illustrate, compare the distribution of test scores in Figures 4 and 5. The first is flat and spread
out, while the second is concentrated and bunched up closely around the mean.

Figure 4
Graphic Display of Flat or Spread-Out Score Distribution

Figure 5
Display of a Narrow or Concentrated Distribution
Note that he mean and median of these two quite different distributions are the same ( = 150,
Mdn = 150), so simply calculating and reporting those two measures of central tendency would fail
to reveal how different the dispersion of scores is between the two groups. But we can do this by
calculating the standard deviation.

The standard deviation provides us with a measure of just how spread out the scores are: a high
standard deviation means the scores are widely spread; a low standard deviation means they're
bunched up closely on either side of the mean.

We'll now calculate the standard deviation for both these distributions. The formula for the
standard deviation is:

Where:

(little sigma) is the standard deviation.

d2 is a score's deviation from the mean squared.

is the number of cases.

The numbers we need to calculate the standard deviation for Figure 4, the flat distribution, are in
Table 6.

Table 6
Data for Figure 4the Flat Distribution

A B C D E
Test Score (X) Frequency (f) XMean (d) fd fd2
100 8 50 400 20,000
110 13 40 520 20,800
120 17 30 510 15,300
130 20 20 400 8,000
140 21 10 210 2,100
150 22 0 0 0
160 21 -10 -210 2,100
170 20 -20 -400 8,000
180 17 -30 -510 15,300
190 13 -40 -520 20,800
200 8 -50 -400 20,000
SUM 180 132,400

Column A displays the test scores (X).

Column B shows how many people got each test score (f).

Column C is the test score minus the mean (X minus the mean or d).

Column D is the sum of the deviations in column C (fd).

Column E contains the squares of all the deviations.

Of course, to get the deviation of each score from the mean (column C), we have to calculate the
mean, and you already know how to do that. We now have what we need to calculate the standard
deviation for the flat distribution in Figure 4:

or

You can do the last part of this calculation, the square root of 132,400/180 (which is 736) by using
the square-root button on your little hand calculator.

Now let's compute the standard deviation for the data in Figure 5. The data are in Table 7, and you
follow the same steps we've just completed.

Table 7
Example of a Narrow or Concentrated Distribution

A B C D E
Test Score (X) Frequency (f) X - Mean (d) fd fd2
100 0 50 0 0
110 0 40 0 0
120 0 30 0 0
130 10 20 200 4,000
140 45 10 450 4,500
150 70 0 0 0
160 45 -10 -450 4,500
170 10 -20 -200 4,000
180 0 -30 0 0
190 0 -40 0 0
200 0 -50 0 0
SUM 180 17,000

or

The two standard deviations provide a statistical indication of the how different the distributions
are: 27 for the spread-out distribution and 10 for the bunched-up distribution.

So once we know the mean and median, why do we need to know the standard deviation? What
use is it?

The standard deviation is important because, regardless of the mean, it makes a great deal of
difference whether the distribution is spread out over a broad range or bunched up closely around
the mean. For example, suppose you have two classes whose mean reading scores are the same.
With only that information, you would be inclined to teach the two classes in the same way. But
suppose you discover that the standard deviation of one of the classes is 27 and the other is 10, as
in the examples we just finished working with. That means that in the first class (the one
where 27), you have many students throughout the entire range of performance. You'll need
to have teaching strategies for both the gifted and the challenged. But in the second class (the one
where = 10), you don't have any gifted or challenged students. They're all average, and your
teaching strategy will be entirely different.

Testing the Difference Between Means: The t-Test

This is one of the most important parts of this course in basic statistics. Here were going to learn
about testing the significance of difference between means. What does that mean?

Suppose youre the superintendent, and one of your principals bursts into your office
enthusiastically and says, "I know youll be happy to learn that after our big effort this year in
reading, my third graders improved from 187 to 195 on the state reading test!"

You immediately ask her, "Is the 8-point difference between those means statistically significant?"
When her eyes glaze over and she says, "Huh?" you smile, forebearingly, (because youve taken
this course in basic statistics, and she hasnt), and you patiently explain to her that simply because
there is a numerical difference between last years and this years mean scores doesnt mean that
there is real difference. It could be due to chance variation in the scores.

So how do we know when the difference between two means is probably a real difference, not one
due to chance? We have to say "probably" because nothing in statistics is absolutely certain (as is
the case with most things in life). But there are statistical tests which can tell us how likely a
difference between two means is due to chance.

One of the most widely used statistical methods for testing the difference between means, and the
one were going to get you up-to-speed on, is called the t-test.
Lets go back to the salary data we worked with in Table 1 of Lesson 1, but now lets compare the
mean salary of that group with another group, and ask whether the mean salaries of the two
groups are significantly different.

First, lets look at the formula for the t-test, and determine what we need to make the computation:

Where:

is the mean for Group 1.

is the mean for Group 2.

n1 is the number of people in Group 1.

n2 is the number of people in Group 2.

2
s 1 the variance for Group 1.

is the variance for Group 2.

The only thing in this formula youre not familiar with is the symbol s2, which stands for the
variance. The variance is the same as the standard deviation without the square root, i.e., its
nothing more than the sum of the deviations of all the scores from the mean divided by n-1.

The formula above is for testing the significance of difference between


two independent samples, i.e., groups of different people. If we wanted to test the difference
between, say, the pre-test and post-test means of the same group of people, we would use a
different formula for dependent samples. That formula is:

Where:

is the sum of all the individuals pre-post score differences.


is the sum of all the individuals pre-post score differences squared.

is the number of paired observations.

But for now, well test the significance of difference between the mean salary of two different groups. You
can try the one for dependent samples on your own. (I knew youd welcome that opportunity.)

Tables 8 and 9 provide the numbers we need to compute the t-test for the difference in mean salaries of the
two groups.

Table 8
Salaries and t-Test Calculation Data for Group 1

A B C D E
Salary (X) Frequency (f) X - Mean (d) fd fd2
20 1 25 25 625
25 2 20 40 800
30 3 15 45 675
35 4 10 40 400
40 5 5 25 125
45 6 0 0 0
50 5 -5 -25 125
55 4 -10 -40 400
60 3 -15 -45 675
65 2 -20 -40 800
70 1 -25 -25 625
SUM 36 5,250

The variance (s2)

Table 9
Salaries and t-Test Calculation Data for Group 2

A B C D E
Salary(X) Frequency (f) X - Mean (d) fd fd2
20 0 27 0 0
25 2 22 44 968
30 3 17 51 867
35 3 12 36 432
40 4 7 28 196
45 6 2 12 24
50 6 -3 -18 54
55 5 -8 -40 320
60 3 -13 -39 507
65 2 -18 -36 648
70 2 -23 -46 1,058
SUM 36 5,074

The variance (s2)

You can see from a quick inspection of the two tables that the salary distributions are similar.
There a few more people making higher salaries. The mean of the second group (which has been
calculated for you) is slightly higher (47 vs. 45 for the first group). And the variance is smaller (145
vs. 150). So lets plug the numbers into the t-test formula and see what we get.

We now know that t = .222. So what does that mean? Is the difference between the two means
statistically significant or not? To find out whether a t-test of any value is significant or not, we
simply look it up in a table that can be found in the appendices of any statistical text book. The
quick answer in this case is no, it is not statistically significant. That is, the 2-point difference in
the mean salaries of these two groups could likely have occurred by chance.

You might also like