Part II - Data Aalysis
Part II - Data Aalysis
Science
Federal University Lokoja
LIS 313: Research and Statistical Method in Library & Information Science
Part II
Introduction to Statistics
When a person is ill or has had an accident and received an injury there are many variables
associated with these situations which could be measured. An infection may cause the person's
temperature to rise, a broken bone will cause pain. There may also be psychological
consequences which a researcher may wish to measure, e.g. anxiety or health beliefs. The
methods used to measure these variable will very often be of a quantitative nature. The
researcher will use techniques which allow some form of number to be used to assess or quantify
the condition under investigation. They will seek to investigate the relationships between
variables using systematic controlled observations. These observations of a carefully chosen
sample of the population of interest, and the associated statistical procedures, will enable
researchers to test their hypotheses and verify or refute the theories which attempt to explain
the observations. The techniques used by researchers to test their hypotheses are many and
varied and in quantitative studies will often involve some form of experimental study. Other
types of study may be better approached using survey techniques which may employ a variety
of questionnaires or attitude measures
A great deal of Library and Information Centres related research is often concerned with
measuring associations between variables, e.g. reading habit and performance. These types of
studies may look at relatively simple linear relationships, i.e correlations, as in the cited example,
but it is possible, using multiple regression techniques, to examine the complex interrelationships
between several factors which may have a bearing on the topic of interest. For example, the
likelihood of a person failing his exams is related to many factors including poor preparation,
information seeking behavior, use of relevant and current material, and other lifestyle measures.
The relative contributions of these factors can be teased out using these sort of techniques.
1
Regardless of the research techniques used, in quantitative research the aim of the research
activity may be summarised as Understanding, Prediction, and Control. The researcher is
attempting to gain an understanding of the phenomena under study so that they may use this
understanding in order to make predictions about the real world, and thus develop technologies
or procedures which allow a degree of control to be exerted over that phenomena. Thus in
Library and Information Science research we may be seeking to understand the transmission of
information to meet human needs in order to make predictions about how to help patrons find
information that will fill an information gap.
Nature of Data
Numerical data is the essence of quantitative methods. In order to try to understand the
phenomena under study, the researcher will first have to find a means of expressing the
variables to be measured using some form of numerical technique. For most practical purposes,
data can be measured at four different levels; each level has a specific purpose and also has
important implications for the type of analysis to be undertaken. These four levels of
measurement are known as nominal, ordinal, interval and ratio.
Consider a study in which a patron register in a library to enable him borrow books and use the
library facilities. During the registration, the patrons name is taken and a patron’s identification
number issued. The number is a unique identifier - since there may well be a patrons with the
same name. But the number that is issued is rather like a name in that it identifies the patron
and probably will not be used in a numerical sense; i.e. this patron's identifier has no numerical
significance relative to other patron's identification numbers. This type of data in which
numbers are used as identifiers are called nominal data and researchers speak of nominal levels
of measurement. Another example might be where a number is used to identify gender e.g. 1
= male and 2 = female.
Continuing with this example, after registration and the patron would want to see a reference
librarian, he will be asked to wait to see the reference librarian, and is told they are number five
on a waiting list. Now although this indicates to the patron that four other people will have to
be seen before their turn, it doesn't indicate very much about how long it will take to see the
reference, since there is no way of knowing how long each of the other four patients will need
2
with the reference librarian . This is an example of ordinal data, where the numerical value
indicates something about relative rather than absolute position in a series. Other common
examples of ordinal data include ranks, where absolute numerical values are turned into a
numerical series because we are more interested in relative values than absolute. Many
statistical tests make use of this type of ranked data.
For someone who weighs 150 lb not only do we know they are heavier than someone with a
weight of 140lb, but we also know by how much heavier. This is known as interval level
measurement because this numerical system also tells us about the intervals between the units
of measurement. Weight is a special types of interval measurement because it does not have
an absolute zero point, i.e. you can't have less than zero weight - this special type of scale is
known as a ratio scale.
For most purposes the distinction between interval and ratio scales is not that important, but
knowing the difference between nominal, ordinal, and interval/ratio is of importance because it
helps us to choose the appropriate statistical tool for the analysis of the data.
Variables
Put simply, a variable is something that can have more than one value! In research, particularly
quantitative research where we are using experiments to try to establish cause and effect certain
variables are especially important:
3
Scholarly communication may be affected by the poor quality of content, out date research focus
or other problems that effect scholarly communication. All these other potential sources of
influence are known as extraneous variables. The purpose of experimental design is, as far as
possible, to control these extraneous factors.
Target Populations
The target population in a research study comprises all those potential participants that could
make up the study group. Thus in a study of the provision of online information services, the
target population would be those entire library with a website. Of course a researcher might
want to narrow the target populations and might choose (for example) all University libraries
as the target population. It is important to realise that the ability to generalise the findings of a
study will be restricted by the chosen target population. Thus in the above study the researcher
could only generalise the findings back to a population of university libraries with a website.
Though it should be added that the findings may well be suggestive of similar results in
populations that do not differ too greatly from the target population.
4
MEP Pupil Text 9
9 Data Analysis
9.1 Mean, Median, Mode and Range
In Unit 8, you were looking at ways of collecting and representing data. In this unit, you
will go one step further and find out how to calculate statistical quantities which summa-
rise the important characteristics of the data.
The mean, median and mode are three different ways of describing the average.
• To find the mean, add up all the numbers and divide by the number of numbers.
• To find the median, place all the numbers in order and select the middle number.
• The mode is the number which appears most often.
• The range gives an idea of how the data are spread out and is the difference between
the smallest and largest values.
Worked Example 1
Find
(a) the mean (b) the median (c) the mode (d) the range
of this set of data.
5, 6, 2, 4, 7, 8, 3, 5, 6, 6
Solution
(a) The mean is
5+6 +2 + 4 + 7+8+3+ 5+ 6+ 6
10
52
=
10
= 5.2 .
(c) From the list above it is easy to see that 6 appears more than any other number, so
mode = 6 .
(d) The range is the difference between the smallest and largest numbers, in this case
2 and 8. So the range is 8 − 2 = 6 .
145
MEP Pupil Text 9
9.1
Worked Example 2
Five people play golf and at one hole their scores are
3, 4, 4, 5, 7.
For these scores, find
(a) the mean (b) the median (c) the mode (d) the range .
Solution
(a) The mean is
3+ 4+ 4+5+ 7
5
23
=
5
= 4.6 .
(b) The numbers are already in order and the middle number is 4. So
median = 4 .
(d) The range is the difference between the smallest and largest numbers, in this case
3 and 7, so
range = 7 − 3
= 4.
Exercises
1. Find the mean median, mode and range of each set of numbers below.
(a) 3, 4, 7, 3, 5, 2, 6, 10
(b) 8, 10, 12, 14, 7, 16, 5, 7, 9, 11
(c) 17, 18, 16, 17, 17, 14, 22, 15, 16, 17, 14, 12
(d) 108, 99, 112, 111, 108
(e) 64, 66, 65, 61, 67, 61, 57
(f) 21, 30, 22, 16, 24, 28, 16, 17
2. Twenty children were asked their shoe sizes. The results are given below.
8, 6, 7, 6, 5, 4 12 , 7 12 , 6 12 , 8 12 , 10
1 1
7, 5, 5 2
8, 9, 7, 5, 6, 8 2
6
146
MEP Pupil Text 9
9.1
Worked Example 1
A football team keep records of the number of goals it scores per match during a season.
73 5 2 5 × 2 = 10
Mean =
40 TOTALS 40 73
150
MEP Pupil Text 9
Worked Example 2
The bar chart shows how many cars were sold by a salesman over a period of time.
6
5
4
Frequency
3
2
1
0 1 2 3 4 5
Cars sold per day
Solution
The data can be transferred to a table and a third column included as shown.
0 2 0 × 2 = 0
1 4 1 × 4 = 4
2 3 2 × 3 = 6
3 6 3 × 6 = 18
4 3 4 × 3 = 12
5 2 5 × 2 = 10
TOTALS 20 50
50
Mean =
20
= 2.5
Worked Example 3
A police station kept records of the number of road traffic accidents in their area each day
for 100 days. The figures below give the number of accidents per day.
1 4 3 5 5 2 5 4 3 2 0 3 1 2 2 3 0 5 2 1
3 3 2 6 2 1 6 1 2 2 3 2 2 2 2 5 4 4 2 3
3 1 4 1 7 3 3 0 2 5 4 3 3 4 3 4 5 3 5 2
4 4 6 5 2 4 5 5 3 2 0 3 3 4 5 2 3 3 4 4
1 3 5 1 1 2 2 5 6 6 4 6 5 8 2 5 3 3 5 4
151
MEP Pupil Text 9
9.2
Solution
The first step is to draw out and complete a tally chart. The final column shown below
can then be added and completed.
0 |||| 4 0 × 4 = 0
1 |||| |||| 10 1 × 10 = 10
2 |||| |||| |||| |||| || 22 2 × 22 = 44
3 |||| |||| |||| |||| ||| 23 3 × 23 = 69
4 |||| |||| |||| | 16 4 × 16 = 64
5 |||| |||| |||| || 17 5 × 17 = 85
6 |||| | 6 6 × 6 = 36
7 | 1 7 × 1 = 7
8 | 1 8 × 1 = 8
TOTALS 100 323
323
Mean number of accidents per day = = 3.23.
100
Exercises
1. A survey of 100 households asked how many cars there were in each household
The results are given below.
2. The survey of question 1 also asked how many TV sets there were in each house-
hold. The results are given below.
8. The mean of 6 numbers is 12.3. When an extra number is added, the mean changes
to 11.9. What is the extra number?
9. When 5 is added to a set of 3 numbers the mean increases to 4.6. What was the
mean of the original 3 numbers?
10. Three numbers have a mean of 64. When a fourth number is included the mean is
doubled. What is the fourth number?
Worked Example 1
The table below gives data on the heights, in cm, of 51 children.
Class Interval 140 ≤ h < 150 150 ≤ h < 160 160 ≤ h < 170 170 ≤ h < 180
Frequency 6 16 21 8
(a) Estimate the mean height. (b) Estimate the median height.
(c) Find the modal class.
Solution
(a) To estimate the mean, the mid-point of each interval should be used.
Totals 51 8215
8215
Mean =
51
= 161 (to the nearest cm)
(b) The median is the 26th value. In this case it lies in the 160 ≤ h < 170 class interval.
The 4th value in the interval is needed. It is estimated as
4
160 + × 10 = 162 (to the nearest cm)
21
(c) The modal class is 160 ≤ h < 170 as it contains the most values.
157
MEP Pupil Text 9
9.4
Also note that when we speak of someone by age, say 8, then the person could be any age
from 8 years 0 days up to 8 years 364 days (365 in a leap year!). You will see how this is
tackled in the following example.
Worked Example 2
The age of children in a primary school were recorded in the table below.
Frequency 29 40 38
(a) Estimate the mean. (b) Estimate the median. (c) Find the modal age.
Solution
(a) To estimate the mean, we must use the mid-point of each interval; so, for example
for '5 – 6', which really means
5 ≤ age < 7 ,
the mid-point is taken as 6.
5–6 6 29 6 × 29 = 174
7–8 8 40 8 × 40 = 320
9 – 10 10 38 10 × 38 = 380
Totals 107 874
874
Mean =
107
= 8.2 (to 1 decimal place)
(b) The median is given by the 54th value, which we have to estimate. There are 29
values in the first interval, so we need to estimate the 25th value in the second
interval. As there are 40 values in the second interval, the median is estimated as
being
25
40
of the way along the second interval. This has width 9 − 7 = 2 years, so the
median is estimated by
25
× 2 = 1.25
40
from the start of the interval. Therefore the median is estimated as
7 + 1.25 = 8.25 years.
158
MEP Pupil Text 9
Worked Example 1 uses what are called continuous data, since height can be of any value.
(Other examples of continuous data are weight, temperature, area, volume and time.)
The next example uses discrete data, that is, data which can take only a particular value,
such as the integers 1, 2, 3, 4, . . . in this case.
The calculations for mean and mode are not affected but estimation of the median
requires replacing the discrete grouped data with an approximate continuous interval.
Worked Example 3
The number of days that children were missing from school due to sickness in one year
was recorded.
Frequency 12 11 10 4 3
(a) Estimate the mean (b) Estimate the median. (c) Find the modal class.
Solution
(a) The estimate is made by assuming that all the values in a class interval are equal to
the midpoint of the class interval.
1–5 3 12 3 × 12 = 36
6–10 8 11 8 × 11 = 88
11–15 13 10 13 × 10 = 130
16–20 18 4 18 × 4 = 72
21–25 23 3 23 × 3 = 69
Totals 40 395
395
Mean =
40
= 9.925 days.
(b) As there are 40 pupils, we need to consider the mean of the 20th and 21st values.
These both lie in the 6–10 class interval, which is really the 5.5–10.5 class interval,
so this interval contains the median.
As there are 12 values in the first class interval, the median is found by considering
the 8th and 9th values of the second interval.
As there are 11 values in the second interval, the median is estimated as being
8.5
11
of the way along the second interval.
159
MEP Pupil Text 9
9.4
But the length of the second interval is 10.5 − 5.5 = 5 , so the median is estimated by
8.5
× 5 = 3.86
11
from the start of this interval. Therefore the median is estimated as
5.5 + 3.86 = 9.36 .
(c) The modal class is 1–5, as this class contains the most entries.
Exercises
1. A door to door salesman keeps a record of the number of homes he visits each day.
3. A stopwatch was used to find the time that it took a group of children to run 100 m.
(a) Is the median in the modal class? (b) Estimate the mean.
(c) Estimate the median.
(d) Is the median greater or less than the mean?
Distance (km) 0 ≤ d < 0.5 0.5 ≤ d < 1.0 1.0 ≤ d < 1.5 1.5 ≤ d < 2.0
Frequency 30 22 19 8
160
MEP Pupil Text 9
The inter-quartile range contains the middle 50% of the sample and describes how
spread out the data are. This is illustrated in Example 2.
90 < h ≤ 100 5 5
100 < h ≤ 110 22 5 + 22 = 27
110 < h ≤ 120 30 27 + 30 = 57
120 < h ≤ 130 31 57 + 31 = 88
130 < h ≤ 140 18 88 + 18 = 106
140 < h ≤ 150 6 106 + 6 = 112
120
(150,112)
(140,106)
100
y
(130,88)
80
q
Cumulative
Frequency
60
(120,57)
40
(110,27)
20
(90,0) (100,5)
0
90 100 110 120 130 140 150
Height (cm)
164
MEP Pupil Text 9
Note
A more accurate graph is found by drawing a smooth curve through the points, rather
than using straight line segments.
120
(150,112)
(140,106)
100
(130,88)
80
Cumulative
Frequency
60
(120,57)
40
(110,27)
20
(90,0) (100,5)
0
90 100 110 120 130 140 150
Height (cm)
Worked Example 2
The cumulative frequency graph below gives the results of 120 students on a test.
120
100
80
Cumulative
Frequency
60
40
20
0
0 20 40 60 80 100
Test Score
165
MEP Pupil Text 9
9.5
Solution
1
(a) Since 2
of 120 is 60, the median 120
40
20
Median = 53
0
0 20 40 60 80 100
Score
(b) To find out the inter-quartile range, we must consider the middle 50% of the
students.
1
start at 4
of 120, which is 30.
100 90
This gives
Lower Quartile = 43 . 80
Cumulative
Frequency
60
This gives 20
166
MEP Pupil Text 9
(c)
120
108
20
79
0
0 20 40 60 80 100
120
Test Score 103
0
0 20 40 60 80 100
Test Score
As in Worked Example 1, a more accurate estimate for the median and inter-quartile
range is obtained if you draw a smooth curve through the data points.
Exercises
1. Make a cumulative frequency table for each set of data given below. Then draw a
cumulative frequency graph and use it to find the median and inter-quartile range.
(a) John weighed each apple in a large box. His results are given in this table.
Weight of
apple (g) 60 < w ≤ 80 80 < w ≤ 100 100 < w ≤ 120 120 < w ≤ 140 140 < w ≤ 160
Frequency 4 28 33 27 8
(b) Pasi asked the students in his class how far they travelled to school each day.
His results are given below.
167
MEP Pupil Text 9
120
100
80
Cumulative
Frequency
60
40
20
(c) (i) Use the cumulative frequency curve to estimate the median distance
travelled by the guests.
(ii) Give a reason for the large difference between the mean distance and
the median distance.
(MEG)
25
20
15
Frequency
10
0 1 2 3 4 5 6 7 8 9 10 11 12
Length
The range (highest value – lowest value) gives a simple measure of how much the data
are spread out.
175
MEP Pupil Text 9
9.6
Standard deviation (s.d.) is a much more useful measure and is given by the formula:
∑ (x i − x )2
i =1
s.d. =
n
Then ( xi − x )2 gives the square of the difference between each value and the mean
(squaring exaggerates the effect of data points far from the mean and gets rid of negative
values), and
n
2
∑ (x − x)
i =1
i
The expression
n
1 2
n ∑ (x
i =1
i − x)
gives an average value to these differences. If all the data were the same, then each xi
would equal x and the expression would be zero.
Finally we take the square root of the expression so that the dimensions of the standard
deviation are the same as those of the data.
So standard deviation is a measure of the spread of the data. The greater its value, the
more spread out the data are. This is illustrated by the two frequency polygons shown
above. Although both sets of data have the same mean, the data represented by the
'dotted' frequency polygon will have a greater standard deviation than the other.
Worked Example 1
Find the mean and standard deviation of the numbers,
6, 7, 8, 5, 9.
Solution
The mean, x , is given by,
6+7+8+5+9
x =
5
35
=
5
= 7.
176
MEP Pupil Text 9
1+ 0 +1+ 4 + 4
=
5
10
=
5
= 2
= 1.414 (to 3 decimal places)
n
2
∑x i
s.d. = i =1
− x2
n
This expression is much more convenient for calculations done without a calculator. The
proof of the equivalence of this formula is given below although it is beyond the scope of
the GCSE syllabus.
Proof
You can see the proof of the equivalence of the two formulae by noting that
n n
2
∑ ( xi
i =1
− x) = ∑ (x
i =1
i
2
− 2 xi x + x 2 )
n n n
= ∑
i =1
xi 2 − ∑ (2 xi x ) + ∑ x 2
i =1 i =1
n n n
= ∑ xi2 − 2 x ∑ xi + x 2 ∑1
i =1 i =1 i =1
(since the expressions 2x and x 2 are common for each term in the summation).
n n
1
But ∑
i =1
1 = n , since you are summing 1 + 1 + ... + 1 = n , and x =
14 4244 3 n ∑ x , by
i =1
n terms
definition, thus
n
1 2 1 ⎛ n 2 n
⎞ n
∑ (x i − x) = ⎜ ∑ xi − 2 x xi + x 2 n⎟ ∑ (substituting ∑ 1 = n)
n i =1 n ⎝ i =1 i =1 ⎠ i =1
177
MEP Pupil Text 9
9.6
n
⎛ n ⎞
∑ xi2 ⎜ ∑xi ⎟
= i =1
− 2x ⎜ i =1
⎟ + x2 (dividing by n)
n ⎜ n ⎟
⎜ ⎟
⎝ ⎠
n n
∑x i
2
∑x
i =1
i
= i =1
− 2x 2 + x 2 (substituting x for )
n n
n
2
∑x i
= i =1
− x2
n
and the result follows.
Worked Example 2
Find the mean and standard deviation of each of the following sets of numbers.
(a) 10, 11, 12, 13, 14 (b) 5, 6, 12, 18, 19
Solution
(a) The mean, x , is given by
10 + 11 + 12 + 13 + 14
x =
5
60
=
5
= 12
The standard deviation can now be calculated using the alternative formula.
⎛ 10 2 + 112 + 12 2 + 132 + 14 2 ⎞ 2
s.d. = ⎜ ⎟ − 12
⎝ 5 ⎠
= 146 − 144
= 32
= 1.414 (to 3 decimal places) .
178
MEP Pupil Text 9
⎛ 52 + 6 2 + 12 2 + 182 + 19 2 ⎞ 2
s.d. = ⎜ ⎟ − 12
⎝ 5 ⎠
= 178 − 144
Note that both sets of numbers have the same mean value, but that set (b) has a much
larger standard deviation. This is expected, as the spread in set (b) is clearly far more
than in set (a).
Worked Example 3
The table below gives the number of road traffic accidents per day in a small town.
Solution
The necessary calculations for each datapoint, xi , are set out below.
∑x
i =1
i fi
x =
n
43
=
25
= 1.72 .
179
MEP Pupil Text 9
9.6
∑x
i =1
i
2
fi
s.d. = − x2
n
127
= − 1.72 2
25
= 1.457 .
Most scientific calculators have statistical functions which will calculate the mean and
standard deviation of a set of data.
Exercises
1. (a) Find the mean and standard deviation of each set of data given below.
A 51 56 51 49 53 62
B 71 76 71 69 73 82
C 102 112 102 98 106 124
(b) Describe the relationship between each set of numbers and also the relation-
ship between their means and standard deviations.
2. Two machines, A and B, fill empty packets with soap powder. A sample of boxes
was taken from each machine and the weight of powder (in kg) was recorded.
(a) Find the mean and standard deviation for each machine.
(b) Which machine is most consistent?
3. Two groups of students were trying to find the acceleration due to gravity.
Each group conducted 5 experiments.
Find the mean and standard deviation for each group, and comment on their results.
4. The number of matches per box was counted for 100 boxes of matches.
The results are given in the table below.
180