Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views23 pages

Part II - Data Aalysis

Uploaded by

petersmog286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

Part II - Data Aalysis

Uploaded by

petersmog286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Department of Library and Information.

Science
Federal University Lokoja
LIS 313: Research and Statistical Method in Library & Information Science

Part II

Introduction to Statistics
When a person is ill or has had an accident and received an injury there are many variables
associated with these situations which could be measured. An infection may cause the person's
temperature to rise, a broken bone will cause pain. There may also be psychological
consequences which a researcher may wish to measure, e.g. anxiety or health beliefs. The
methods used to measure these variable will very often be of a quantitative nature. The
researcher will use techniques which allow some form of number to be used to assess or quantify
the condition under investigation. They will seek to investigate the relationships between
variables using systematic controlled observations. These observations of a carefully chosen
sample of the population of interest, and the associated statistical procedures, will enable
researchers to test their hypotheses and verify or refute the theories which attempt to explain
the observations. The techniques used by researchers to test their hypotheses are many and
varied and in quantitative studies will often involve some form of experimental study. Other
types of study may be better approached using survey techniques which may employ a variety
of questionnaires or attitude measures

A great deal of Library and Information Centres related research is often concerned with
measuring associations between variables, e.g. reading habit and performance. These types of
studies may look at relatively simple linear relationships, i.e correlations, as in the cited example,
but it is possible, using multiple regression techniques, to examine the complex interrelationships
between several factors which may have a bearing on the topic of interest. For example, the
likelihood of a person failing his exams is related to many factors including poor preparation,
information seeking behavior, use of relevant and current material, and other lifestyle measures.
The relative contributions of these factors can be teased out using these sort of techniques.

1
Regardless of the research techniques used, in quantitative research the aim of the research
activity may be summarised as Understanding, Prediction, and Control. The researcher is
attempting to gain an understanding of the phenomena under study so that they may use this
understanding in order to make predictions about the real world, and thus develop technologies
or procedures which allow a degree of control to be exerted over that phenomena. Thus in
Library and Information Science research we may be seeking to understand the transmission of
information to meet human needs in order to make predictions about how to help patrons find
information that will fill an information gap.

Nature of Data

Numerical data is the essence of quantitative methods. In order to try to understand the
phenomena under study, the researcher will first have to find a means of expressing the
variables to be measured using some form of numerical technique. For most practical purposes,
data can be measured at four different levels; each level has a specific purpose and also has
important implications for the type of analysis to be undertaken. These four levels of
measurement are known as nominal, ordinal, interval and ratio.

Consider a study in which a patron register in a library to enable him borrow books and use the
library facilities. During the registration, the patrons name is taken and a patron’s identification
number issued. The number is a unique identifier - since there may well be a patrons with the
same name. But the number that is issued is rather like a name in that it identifies the patron
and probably will not be used in a numerical sense; i.e. this patron's identifier has no numerical
significance relative to other patron's identification numbers. This type of data in which
numbers are used as identifiers are called nominal data and researchers speak of nominal levels
of measurement. Another example might be where a number is used to identify gender e.g. 1
= male and 2 = female.

Continuing with this example, after registration and the patron would want to see a reference
librarian, he will be asked to wait to see the reference librarian, and is told they are number five
on a waiting list. Now although this indicates to the patron that four other people will have to
be seen before their turn, it doesn't indicate very much about how long it will take to see the
reference, since there is no way of knowing how long each of the other four patients will need

2
with the reference librarian . This is an example of ordinal data, where the numerical value
indicates something about relative rather than absolute position in a series. Other common
examples of ordinal data include ranks, where absolute numerical values are turned into a
numerical series because we are more interested in relative values than absolute. Many
statistical tests make use of this type of ranked data.

For someone who weighs 150 lb not only do we know they are heavier than someone with a
weight of 140lb, but we also know by how much heavier. This is known as interval level
measurement because this numerical system also tells us about the intervals between the units
of measurement. Weight is a special types of interval measurement because it does not have
an absolute zero point, i.e. you can't have less than zero weight - this special type of scale is
known as a ratio scale.

For most purposes the distinction between interval and ratio scales is not that important, but
knowing the difference between nominal, ordinal, and interval/ratio is of importance because it
helps us to choose the appropriate statistical tool for the analysis of the data.

Variables

Put simply, a variable is something that can have more than one value! In research, particularly
quantitative research where we are using experiments to try to establish cause and effect certain
variables are especially important:

• Independent Variable (IV)

• Dependent Variable (DV)

• Extraneous Variable (EV)

Consider an experimental study aimed at establishing the efficacy of electronic scholarly on


Scholarly communication. We hypothesise that electronic scholarly methods will have beneficial
effects on scholarly communication. In this situation the new model of electronic scholarly is the
Independent Variable and the scholarly communication the Dependent variable. However, it is
very likely that the Dependent variable may be influenced by other factors as well as the
Independent variable.

3
Scholarly communication may be affected by the poor quality of content, out date research focus
or other problems that effect scholarly communication. All these other potential sources of
influence are known as extraneous variables. The purpose of experimental design is, as far as
possible, to control these extraneous factors.

Target Populations

The target population in a research study comprises all those potential participants that could
make up the study group. Thus in a study of the provision of online information services, the
target population would be those entire library with a website. Of course a researcher might
want to narrow the target populations and might choose (for example) all University libraries
as the target population. It is important to realise that the ability to generalise the findings of a
study will be restricted by the chosen target population. Thus in the above study the researcher
could only generalise the findings back to a population of university libraries with a website.
Though it should be added that the findings may well be suggestive of similar results in
populations that do not differ too greatly from the target population.

4
MEP Pupil Text 9

9 Data Analysis
9.1 Mean, Median, Mode and Range
In Unit 8, you were looking at ways of collecting and representing data. In this unit, you
will go one step further and find out how to calculate statistical quantities which summa-
rise the important characteristics of the data.

The mean, median and mode are three different ways of describing the average.

• To find the mean, add up all the numbers and divide by the number of numbers.
• To find the median, place all the numbers in order and select the middle number.
• The mode is the number which appears most often.
• The range gives an idea of how the data are spread out and is the difference between
the smallest and largest values.

Worked Example 1
Find
(a) the mean (b) the median (c) the mode (d) the range
of this set of data.
5, 6, 2, 4, 7, 8, 3, 5, 6, 6

Solution
(a) The mean is
5+6 +2 + 4 + 7+8+3+ 5+ 6+ 6
10
52
=
10
= 5.2 .

(b) To find the median, place all the numbers in order.


2, 3, 4, 5, 5, 6, 6, 6, 7, 8
As there are two middle numbers in this example, 5 and 6,
5+6
median =
2
11
=
2
= 5.5 .

(c) From the list above it is easy to see that 6 appears more than any other number, so
mode = 6 .

(d) The range is the difference between the smallest and largest numbers, in this case
2 and 8. So the range is 8 − 2 = 6 .
145
MEP Pupil Text 9
9.1

Worked Example 2
Five people play golf and at one hole their scores are
3, 4, 4, 5, 7.
For these scores, find
(a) the mean (b) the median (c) the mode (d) the range .

Solution
(a) The mean is
3+ 4+ 4+5+ 7
5
23
=
5
= 4.6 .

(b) The numbers are already in order and the middle number is 4. So
median = 4 .

(c) The score 4 occurs most often, so,


mode = 4 .

(d) The range is the difference between the smallest and largest numbers, in this case
3 and 7, so
range = 7 − 3
= 4.

Exercises
1. Find the mean median, mode and range of each set of numbers below.
(a) 3, 4, 7, 3, 5, 2, 6, 10
(b) 8, 10, 12, 14, 7, 16, 5, 7, 9, 11
(c) 17, 18, 16, 17, 17, 14, 22, 15, 16, 17, 14, 12
(d) 108, 99, 112, 111, 108
(e) 64, 66, 65, 61, 67, 61, 57
(f) 21, 30, 22, 16, 24, 28, 16, 17

2. Twenty children were asked their shoe sizes. The results are given below.

8, 6, 7, 6, 5, 4 12 , 7 12 , 6 12 , 8 12 , 10
1 1
7, 5, 5 2
8, 9, 7, 5, 6, 8 2
6

For this data, find


(a) the mean (b) the median (c) the mode (d) the range.

146
MEP Pupil Text 9
9.1

17. Eight judges each give a mark out of 6 in an ice-skating competition.


Oksana is given the following marks.
5.3, 5.7, 5.9, 5.4, 4.5, 5.7, 5.8, 5.7
The mean of these marks is 5.5, and the range is 1.4.
The rules say that the highest mark and the lowest mark are to be deleted.
5.3, 5.7, 5.9, 5.4, 4.5, 5.7, 5.8, 5.7
(a) (i) Find the mean of the six remaining marks.
(ii) Find the range of the six remaining marks.
(b) Do you think it is better to count all eight marks, or to count only the six
remaining marks? Use the means and the ranges to explain your answer.
(c) The eight marks obtained by Tonya in the same competition have a mean
of 5.2 and a range of 0.6. Explain why none of her marks could be as high
as 5.9. (MEG)

9.2 Finding the Mean from Tables and


Tally Charts
Often data are collected into tables or tally charts. This section considers how to find the
mean in such cases.

Worked Example 1
A football team keep records of the number of goals it scores per match during a season.

No. of Goals Frequency


0 8
1 10
2 12
3 3
4 5
5 2

Find the mean number of goals per match.

Solution No. of Goals Frequency No. of Goals × Frequency


The table above can
0 8 0 × 8 = 0
be used, with a third
1 10 1 × 10 = 10
column added.
2 12 2 × 12 = 24
The mean can now 3 3 3 × 3 = 9
be calculated. 4 5 4 × 5 = 20

73 5 2 5 × 2 = 10
Mean =
40 TOTALS 40 73

= 1.825 . (Total matches) (Total goals)

150
MEP Pupil Text 9

Worked Example 2
The bar chart shows how many cars were sold by a salesman over a period of time.

6
5
4
Frequency
3
2
1

0 1 2 3 4 5
Cars sold per day

Find the mean number of cars sold per day.

Solution
The data can be transferred to a table and a third column included as shown.

Cars sold daily Frequency Cars sold × Frequency

0 2 0 × 2 = 0
1 4 1 × 4 = 4
2 3 2 × 3 = 6
3 6 3 × 6 = 18
4 3 4 × 3 = 12
5 2 5 × 2 = 10

TOTALS 20 50

(Total days) (Total number of cars sold)

50
Mean =
20
= 2.5

Worked Example 3
A police station kept records of the number of road traffic accidents in their area each day
for 100 days. The figures below give the number of accidents per day.

1 4 3 5 5 2 5 4 3 2 0 3 1 2 2 3 0 5 2 1
3 3 2 6 2 1 6 1 2 2 3 2 2 2 2 5 4 4 2 3
3 1 4 1 7 3 3 0 2 5 4 3 3 4 3 4 5 3 5 2
4 4 6 5 2 4 5 5 3 2 0 3 3 4 5 2 3 3 4 4
1 3 5 1 1 2 2 5 6 6 4 6 5 8 2 5 3 3 5 4

Find the mean number of accidents per day.

151
MEP Pupil Text 9
9.2

Solution
The first step is to draw out and complete a tally chart. The final column shown below
can then be added and completed.

Number of Accidents Tally Frequency No. of Accidents × Frequency

0 |||| 4 0 × 4 = 0
1 |||| |||| 10 1 × 10 = 10
2 |||| |||| |||| |||| || 22 2 × 22 = 44
3 |||| |||| |||| |||| ||| 23 3 × 23 = 69
4 |||| |||| |||| | 16 4 × 16 = 64
5 |||| |||| |||| || 17 5 × 17 = 85
6 |||| | 6 6 × 6 = 36
7 | 1 7 × 1 = 7
8 | 1 8 × 1 = 8
TOTALS 100 323

323
Mean number of accidents per day = = 3.23.
100

Exercises
1. A survey of 100 households asked how many cars there were in each household
The results are given below.

No. of Cars Frequency


0 5
1 70
2 21
3 3
4 1

Calculate the mean number of cars per household.

2. The survey of question 1 also asked how many TV sets there were in each house-
hold. The results are given below.

No. of TV Sets Frequency


0 2
1 30
2 52
3 8
4 5
5 3

Calculate the mean number of TV sets per household.


152
MEP Pupil Text 9

8. The mean of 6 numbers is 12.3. When an extra number is added, the mean changes
to 11.9. What is the extra number?

9. When 5 is added to a set of 3 numbers the mean increases to 4.6. What was the
mean of the original 3 numbers?

10. Three numbers have a mean of 64. When a fourth number is included the mean is
doubled. What is the fourth number?

9.4 Mean, Median and Mode for Grouped Data


The mean and median can be estimated from tables of grouped data.
The class interval which contains the most values is known as the modal class.

Worked Example 1
The table below gives data on the heights, in cm, of 51 children.

Class Interval 140 ≤ h < 150 150 ≤ h < 160 160 ≤ h < 170 170 ≤ h < 180
Frequency 6 16 21 8

(a) Estimate the mean height. (b) Estimate the median height.
(c) Find the modal class.

Solution
(a) To estimate the mean, the mid-point of each interval should be used.

Class Interval Mid-point Frequency Mid-point × Frequency

140 ≤ h < 150 145 6 145 × 6 = 870


150 ≤ h < 160 155 16 155 × 16 = 2480
160 ≤ h < 170 165 21 165 × 21 = 3465
170 ≤ h < 180 175 8 175 × 8 = 1400

Totals 51 8215

8215
Mean =
51
= 161 (to the nearest cm)

(b) The median is the 26th value. In this case it lies in the 160 ≤ h < 170 class interval.
The 4th value in the interval is needed. It is estimated as
4
160 + × 10 = 162 (to the nearest cm)
21

(c) The modal class is 160 ≤ h < 170 as it contains the most values.

157
MEP Pupil Text 9
9.4

Also note that when we speak of someone by age, say 8, then the person could be any age
from 8 years 0 days up to 8 years 364 days (365 in a leap year!). You will see how this is
tackled in the following example.

Worked Example 2
The age of children in a primary school were recorded in the table below.

Age 5–6 7–8 9 – 10

Frequency 29 40 38

(a) Estimate the mean. (b) Estimate the median. (c) Find the modal age.

Solution
(a) To estimate the mean, we must use the mid-point of each interval; so, for example
for '5 – 6', which really means
5 ≤ age < 7 ,
the mid-point is taken as 6.

Class Interval Mid-point Frequency Mid-point × Frequency

5–6 6 29 6 × 29 = 174
7–8 8 40 8 × 40 = 320
9 – 10 10 38 10 × 38 = 380
Totals 107 874

874
Mean =
107
= 8.2 (to 1 decimal place)

(b) The median is given by the 54th value, which we have to estimate. There are 29
values in the first interval, so we need to estimate the 25th value in the second
interval. As there are 40 values in the second interval, the median is estimated as
being
25
40
of the way along the second interval. This has width 9 − 7 = 2 years, so the
median is estimated by
25
× 2 = 1.25
40
from the start of the interval. Therefore the median is estimated as
7 + 1.25 = 8.25 years.

(c) The modal age is the 7 – 8 age group.

158
MEP Pupil Text 9

Worked Example 1 uses what are called continuous data, since height can be of any value.
(Other examples of continuous data are weight, temperature, area, volume and time.)

The next example uses discrete data, that is, data which can take only a particular value,
such as the integers 1, 2, 3, 4, . . . in this case.

The calculations for mean and mode are not affected but estimation of the median
requires replacing the discrete grouped data with an approximate continuous interval.

Worked Example 3
The number of days that children were missing from school due to sickness in one year
was recorded.

Number of days off sick 1–5 6 – 10 11 – 15 16 – 20 21 – 25

Frequency 12 11 10 4 3

(a) Estimate the mean (b) Estimate the median. (c) Find the modal class.

Solution
(a) The estimate is made by assuming that all the values in a class interval are equal to
the midpoint of the class interval.

Class Interval Mid-point Frequency Mid-point × Frequency

1–5 3 12 3 × 12 = 36
6–10 8 11 8 × 11 = 88
11–15 13 10 13 × 10 = 130
16–20 18 4 18 × 4 = 72
21–25 23 3 23 × 3 = 69

Totals 40 395

395
Mean =
40
= 9.925 days.

(b) As there are 40 pupils, we need to consider the mean of the 20th and 21st values.
These both lie in the 6–10 class interval, which is really the 5.5–10.5 class interval,
so this interval contains the median.
As there are 12 values in the first class interval, the median is found by considering
the 8th and 9th values of the second interval.
As there are 11 values in the second interval, the median is estimated as being
8.5
11
of the way along the second interval.

159
MEP Pupil Text 9
9.4

But the length of the second interval is 10.5 − 5.5 = 5 , so the median is estimated by
8.5
× 5 = 3.86
11
from the start of this interval. Therefore the median is estimated as
5.5 + 3.86 = 9.36 .

(c) The modal class is 1–5, as this class contains the most entries.

Exercises
1. A door to door salesman keeps a record of the number of homes he visits each day.

Homes visited 0–9 10 – 19 20 – 29 30 – 39 40 – 49


Frequency 3 8 24 60 21

(a) Estimate the mean number of homes visited.


(b) Estimate the median.
(c) What is the modal class?

2. The weights of a number of students were recorded in kg.

Mean (kg) 30 ≤ w < 35 35 ≤ w < 40 40 ≤ w < 45 45 ≤ w < 50 50 ≤ w < 55


Frequency 10 11 15 7 4

(a) Estimate the mean weight. (b) Estimate the median.


(c) What is the modal class?

3. A stopwatch was used to find the time that it took a group of children to run 100 m.

Time (seconds) 10 ≤ t < 15 15 ≤ t < 20 20 ≤ t < 25 25 ≤ t < 30


Frequency 6 16 21 8

(a) Is the median in the modal class? (b) Estimate the mean.
(c) Estimate the median.
(d) Is the median greater or less than the mean?

4. The distances that children in a year group travelled to school is recorded.

Distance (km) 0 ≤ d < 0.5 0.5 ≤ d < 1.0 1.0 ≤ d < 1.5 1.5 ≤ d < 2.0
Frequency 30 22 19 8

(a) Does the modal class contain the median?


(b) Estimate the median and the mean.
(c) Which is the largest, the median or the mean?

160
MEP Pupil Text 9

9.5 Cumulative Frequency


Cumulative frequencies are useful if more detailed information is required about a set of
data. In particular, they can be used to find the median and inter-quartile range.

The inter-quartile range contains the middle 50% of the sample and describes how
spread out the data are. This is illustrated in Example 2.

Worked Example 1 Height (cm) Frequency

For the data given in the table, draw up a 90 < h ≤ 100 5


cumulative frequency table and then draw 100 < h ≤ 110 22
a cumulative frequency graph.
110 < h ≤ 120 30
120 < h ≤ 130 31
Solution
130 < h ≤ 140 18
The table below shows how to calculate 140 < h ≤ 150 6
the cumulative frequencies.

Height (cm) Frequency Cumulative Frequency

90 < h ≤ 100 5 5
100 < h ≤ 110 22 5 + 22 = 27
110 < h ≤ 120 30 27 + 30 = 57
120 < h ≤ 130 31 57 + 31 = 88
130 < h ≤ 140 18 88 + 18 = 106
140 < h ≤ 150 6 106 + 6 = 112

A graph can then be plotted using points as shown below.

120
(150,112)

(140,106)
100
y

(130,88)
80
q

Cumulative
Frequency
60
(120,57)

40

(110,27)
20

(90,0) (100,5)
0
90 100 110 120 130 140 150
Height (cm)

164
MEP Pupil Text 9

Note
A more accurate graph is found by drawing a smooth curve through the points, rather
than using straight line segments.

120
(150,112)

(140,106)
100

(130,88)
80
Cumulative
Frequency
60
(120,57)

40

(110,27)
20

(90,0) (100,5)
0
90 100 110 120 130 140 150
Height (cm)
Worked Example 2
The cumulative frequency graph below gives the results of 120 students on a test.

120

100

80
Cumulative
Frequency
60

40

20

0
0 20 40 60 80 100
Test Score

165
MEP Pupil Text 9
9.5

Use the graph to find:


(a) the median score, (b) the inter-quartile range,
(c) the mark which was attained by only 10% of the students,
(d) the number of students who scored more than 75 on the test.

Solution
1
(a) Since 2
of 120 is 60, the median 120

can be found by starting at 60


on the vertical scale, moving 100
horizontally to the graph line
and then moving vertically 80
down to meet the horizontal scale. Start at 60
Cumulative 60
In this case the median is 53. Frequency

40

20

Median = 53
0
0 20 40 60 80 100

Score

(b) To find out the inter-quartile range, we must consider the middle 50% of the
students.

To find the lower quartile, 120

1
start at 4
of 120, which is 30.
100 90
This gives
Lower Quartile = 43 . 80
Cumulative
Frequency
60

To find the upper quartile,


3
start at 4
of 120, which is 90. 40 30

This gives 20

Upper Quartile = 67 . Upper quartile = 67


0
0 20 40 60 80 100
Lower quartile = 43
The inter-quartile range is then Test Score

Inter - quartile Range = Upper Quartile − Lower Quartile


= 67 − 43
= 24 .

166
MEP Pupil Text 9

(c)
120
108

Here the mark which was attained


100 by the top 10% is required.
10% of 120 = 12
80
so start at 108 on the cumulative
Cumulative
Frequency 60
frequency scale.

This gives a mark of 79.


40

20

79
0
0 20 40 60 80 100
120
Test Score 103

(d) To find the number of students who 100


scored more than 75, start at 75 on
the horizontal axis.
80

This gives a cumulative frequency Cumulative


of 103. Frequency 60

So the number of students with a 40

score greater than 75 is


20
120 − 103 = 17 .
75

0
0 20 40 60 80 100
Test Score

As in Worked Example 1, a more accurate estimate for the median and inter-quartile
range is obtained if you draw a smooth curve through the data points.

Exercises
1. Make a cumulative frequency table for each set of data given below. Then draw a
cumulative frequency graph and use it to find the median and inter-quartile range.
(a) John weighed each apple in a large box. His results are given in this table.

Weight of
apple (g) 60 < w ≤ 80 80 < w ≤ 100 100 < w ≤ 120 120 < w ≤ 140 140 < w ≤ 160
Frequency 4 28 33 27 8

(b) Pasi asked the students in his class how far they travelled to school each day.
His results are given below.

Distance (km) 0 < d ≤1 1< d ≤ 2 2<d≤3 3<d≤4 4<d≤5 5<d≤6


Frequency 5 12 5 6 5 3

167
MEP Pupil Text 9

120

100

80
Cumulative
Frequency
60

40

20

0 20 40 60 80 100 120 140


Distance travelled (miles)

(c) (i) Use the cumulative frequency curve to estimate the median distance
travelled by the guests.
(ii) Give a reason for the large difference between the mean distance and
the median distance.
(MEG)

9.6 Standard Deviation


The two frequency polygons drawn on the graph below show samples which have the
same mean, but the data in one are much more spread out than in the other.

25

20

15
Frequency
10

0 1 2 3 4 5 6 7 8 9 10 11 12
Length

The range (highest value – lowest value) gives a simple measure of how much the data
are spread out.

175
MEP Pupil Text 9
9.6

Standard deviation (s.d.) is a much more useful measure and is given by the formula:

∑ (x i − x )2
i =1
s.d. =
n

where xi represents each datapoint ( x1 , x 2 , ..., x n)


x is the mean,
n is the number of values.

Then ( xi − x )2 gives the square of the difference between each value and the mean
(squaring exaggerates the effect of data points far from the mean and gets rid of negative
values), and
n
2
∑ (x − x)
i =1
i

sums up all these squared differences.

The expression
n
1 2
n ∑ (x
i =1
i − x)

gives an average value to these differences. If all the data were the same, then each xi
would equal x and the expression would be zero.

Finally we take the square root of the expression so that the dimensions of the standard
deviation are the same as those of the data.

So standard deviation is a measure of the spread of the data. The greater its value, the
more spread out the data are. This is illustrated by the two frequency polygons shown
above. Although both sets of data have the same mean, the data represented by the
'dotted' frequency polygon will have a greater standard deviation than the other.

Worked Example 1
Find the mean and standard deviation of the numbers,
6, 7, 8, 5, 9.

Solution
The mean, x , is given by,
6+7+8+5+9
x =
5
35
=
5
= 7.

176
MEP Pupil Text 9

Now the standard deviation can be calculated.

(6 − 7)2 + ( 7 − 7)2 + (8 − 7)2 + (5 − 7)2 + (9 − 7)2


s.d. =
5

1+ 0 +1+ 4 + 4
=
5
10
=
5

= 2
= 1.414 (to 3 decimal places)

An alternative formula for standard deviation is

n
2
∑x i
s.d. = i =1
− x2
n

This expression is much more convenient for calculations done without a calculator. The
proof of the equivalence of this formula is given below although it is beyond the scope of
the GCSE syllabus.

Proof
You can see the proof of the equivalence of the two formulae by noting that

n n
2
∑ ( xi
i =1
− x) = ∑ (x
i =1
i
2
− 2 xi x + x 2 )
n n n
= ∑
i =1
xi 2 − ∑ (2 xi x ) + ∑ x 2
i =1 i =1

n n n
= ∑ xi2 − 2 x ∑ xi + x 2 ∑1
i =1 i =1 i =1

(since the expressions 2x and x 2 are common for each term in the summation).

n n
1
But ∑
i =1
1 = n , since you are summing 1 + 1 + ... + 1 = n , and x =
14 4244 3 n ∑ x , by
i =1
n terms
definition, thus
n
1 2 1 ⎛ n 2 n
⎞ n

∑ (x i − x) = ⎜ ∑ xi − 2 x xi + x 2 n⎟ ∑ (substituting ∑ 1 = n)
n i =1 n ⎝ i =1 i =1 ⎠ i =1

177
MEP Pupil Text 9
9.6

n
⎛ n ⎞
∑ xi2 ⎜ ∑xi ⎟
= i =1
− 2x ⎜ i =1
⎟ + x2 (dividing by n)
n ⎜ n ⎟
⎜ ⎟
⎝ ⎠

n n

∑x i
2
∑x
i =1
i

= i =1
− 2x 2 + x 2 (substituting x for )
n n
n
2
∑x i
= i =1
− x2
n
and the result follows.

Worked Example 2
Find the mean and standard deviation of each of the following sets of numbers.
(a) 10, 11, 12, 13, 14 (b) 5, 6, 12, 18, 19

Solution
(a) The mean, x , is given by
10 + 11 + 12 + 13 + 14
x =
5
60
=
5
= 12

The standard deviation can now be calculated using the alternative formula.

⎛ 10 2 + 112 + 12 2 + 132 + 14 2 ⎞ 2
s.d. = ⎜ ⎟ − 12
⎝ 5 ⎠

= 146 − 144

= 32
= 1.414 (to 3 decimal places) .

(b) The mean, x , is given by


5 + 6 + 12 + 18 + 19
x =
5
= 12 (as in part (a)).

178
MEP Pupil Text 9

The standard deviation is given by

⎛ 52 + 6 2 + 12 2 + 182 + 19 2 ⎞ 2
s.d. = ⎜ ⎟ − 12
⎝ 5 ⎠

= 178 − 144

= 5.831 (to 3 decimal places).

Note that both sets of numbers have the same mean value, but that set (b) has a much
larger standard deviation. This is expected, as the spread in set (b) is clearly far more
than in set (a).

Worked Example 3
The table below gives the number of road traffic accidents per day in a small town.

Accidents per day 0 1 2 3 4 5 6


Frequency 5 8 6 3 2 1 1

Find the mean and standard deviation of this data.

Solution
The necessary calculations for each datapoint, xi , are set out below.

Accidents per day Frequency


( xi ) ( fi ) xi 2 xi fi xi 2 fi
0 5 0 0 0
1 8 1 8 8
2 6 4 12 24
3 3 9 9 27
4 2 16 8 32
5 0 25 0 0
6 1 36 6 36
TOTALS 25 43 127

From the totals,


n n
2
n = 25 , ∑x
i =1
i fi = 43 , ∑x
i =1
i = 127 .

The mean, x , is now given by


n

∑x
i =1
i fi

x =
n
43
=
25
= 1.72 .
179
MEP Pupil Text 9
9.6

The standard deviation is now given by


n

∑x
i =1
i
2
fi
s.d. = − x2
n

127
= − 1.72 2
25
= 1.457 .

Most scientific calculators have statistical functions which will calculate the mean and
standard deviation of a set of data.

Exercises
1. (a) Find the mean and standard deviation of each set of data given below.

A 51 56 51 49 53 62
B 71 76 71 69 73 82
C 102 112 102 98 106 124

(b) Describe the relationship between each set of numbers and also the relation-
ship between their means and standard deviations.

2. Two machines, A and B, fill empty packets with soap powder. A sample of boxes
was taken from each machine and the weight of powder (in kg) was recorded.

A 2.27 2.31 2.18 2.2 2.26 2.24


B 2.78 2.62 2.61 2.51 2.59 2.67 2.62 2.68 2.70

(a) Find the mean and standard deviation for each machine.
(b) Which machine is most consistent?

3. Two groups of students were trying to find the acceleration due to gravity.
Each group conducted 5 experiments.

Group A 9.4 9.6 10.2 10.8 10.1


Group B 9.5 9.7 9.6 9.4 9.8

Find the mean and standard deviation for each group, and comment on their results.

4. The number of matches per box was counted for 100 boxes of matches.
The results are given in the table below.

180

You might also like