School of Engineering Engineering Mathematics 4
(MTH60403/ENG 2123)
1
Distinguish between discrete and continuous data
Construct frequency and relative frequency tables for
grouped and ungrouped discrete data
Determine class boundaries, class intervals and central
values for discrete and continuous data
Construct a histogram and a frequency polygon
Determine the mean, median and mode of grouped and
ungrouped data
Determine the range, variance and standard deviation of
discrete data
Measure dispersion of data using the normal and
standard normal curves.
2
Topics to be covered
Introduction
Arrangement of data
Histograms
Measure of central tendency
Dispersion
Frequency polygons
Frequency curves
Normal distribution curve
Standardized normal curve
3
Introduction
Statistics as a discipline is the development and
application of methods to collect, analyze and interpret
data.
Statistical techniques are used in a wide range of types of
scientific and social research. Areas that use modern
statistical methods including medical, economics, finance,
marketing research, manufacturing and etc.
Some fields of inquiry use applied statistics so extensively
that they have specialized terminology. Some examples
include: Data mining, Energy statistics, Engineering
statistics, Reliability engineering, Social statistics etc.
4
Introduction
Statistics is concerned with the collection, ordering and analysis of
data. Data consist of sets of recorded observations or values. Any
quantity that can have a number of values is a variable. A variable
maybe one of two kinds:
(a)Discrete – a variable that can be counted, or for which there is a
fixed set of values. Examples: number of people in a room, shoe
size of children, number of components in a machine.
(b)Continuous – a variable that can be measured on a continuous
scale, the result depending on the precision of the measuring
instrument, or the accuracy of the observer. Examples: weight of
people, output voltage of an analogue system, loads on a beam,
temperature of a coolant, the capacity of a container.
Definition of “continuous data” – data which can take values between two
end points – weights of people can be5 60.28 kg, 70.3 kg….
Introduction
A statistical exercise normally consists of four stages :
1. Collection of data ( measure and record)
2. Arrangement/ordering and presentation of the data
3. Analysis of the collected data
4. Interpretation of the results and conclusions formulated.
6
Arrangement of data
A set of data:
28 31 29 27 30 29 29 26 30 28
28 29 27 26 32 28 32 31 25 30
27 30 29 30 28 29 31 27 28 28
Can be arranged in ascending order:
25 26 26 27 27 27 27 28 28 28
28 28 28 28 29 29 29 29 29 29
30 30 30 30 30 31 31 31 32 32
7
Arrangement of data
Number of
Once the data is in ascending order:
Value times
25 26 26 27 27 27 27 28 28 28
28 28 28 28 29 29 29 29 29 29 25 1
30 30 30 30 30 31 31 31 32 32 26 2
27 4
It can be entered into a table.
28 7
The number of occasions on which any 29 6
particular value occurs is called the 30 5
frequency, denoted by f. 31 3
32 2
8
Arrangement of data
When dealing with large numbers of readings, instead
of writing all the values in ascending order, it is more
convenient to compile a tally diagram, recording the
range of values of the variable and adding a stroke for
each occurrence of that reading:
9
Arrangement of data
Grouped Data
If the range of values of the variable is large, it is often
helpful to consider these values arranged in regular
groups or classes.
10
Arrangement of data
Grouping with Continuous Data
The lengths (in mm) of 40 spindles were measured as below :
20.90 20.57 20.86 20.74 20.82 20.63 20.53 20.89 20.75 20.65
20.71 21.03 20.72 20.41 20.94 20.75 20.79 20.65 21.08 20.89
20.50 20.88 20.97 20.78 20.61 20.92 21.07 21.16 20.80 20.77
20.82 20.72 20.60 20.90 20.86 20.68 20.75 20.88 20.56 20.94
Lowest value = 20.41 } form classes from 20.40 to 21.20 at 0.10 intervals.
Highest value = 21.16
11
Arrangement of data
Grouping with Continuous Data
With continuous data the groups boundaries are given
to the same number of significant figures or decimal
places as the data:
The lengths
(in mm) of 40
spindles were
measured and
arranged in this
table.
12
Arrangement of data
Relative Frequency
If the frequency of any one group is divided by the
sum of the frequencies the ratio is called the relative
frequency of that group. Relative frequencies can be
expressed as percentages:
1
100 2.5
40
9
100 22.5
40
13
Arrangement of data
Rounding off Data
If the value 21.7 is expressed to two significant
figures, the result is rounded up to 22. similarly, 21.4
is rounded down to 21.
To maintain consistency of group boundaries, middle
values will always be rounded up. So that 21.5 is
rounded up to 22 and 42.5 is rounded up to 43.
Therefore, when a result is quoted to two significant
figures as 37 on a continuous scale this includes all
possible values between:
36.50000… and 37.49999…
14
Arrangement of data
Class Boundaries
A class or group boundary lies midway between the data
values. For example, for data in the class or group labelled:
7.1 – 7.3
(a)The class values 7.1 and 7.3 are the lower and upper limits
of the class and their difference gives the class width.
(b) The class boundaries are 0.05 below the lower class limit
and 0.05 above the upper class limit.
(c) The class interval is the difference between the upper and
lower class boundaries.
(d) The central value (or mid-value) of the class interval is one
half of the difference between the upper and lower class
boundaries.
15
Arrangement of data
Class Boundaries
These terms can be summarized in the following diagram, using
the class 7.1 – 7.3 (inclusive) as example
(a) (a) (a)
(b) (d) (b)
(c)
16
Histograms
Frequency histogram
A histogram is a graphical
representation of a frequency
distribution in which vertical
rectangular blocks are drawn so that:
(a)the centre of the base indicates
the central value of the class and
(b) the area of the rectangle
represents the class frequency.
17
Histograms
Frequency histogram
For example, the measurement of the lengths of 50
brass rods gave the following frequency distribution:
18
Histograms
Frequency histogram
This gives rise to the histogram:
A relative frequency histogram is identical in shape to
the frequency histogram but differs in that the vertical
axis measures relative frequency ( percentage).
19
Measure of central tendency
Most of the whole range of values is clustered within the
middle classes and knowledge of the center region of the
histogram is important. We can put a numerical value on
this by determining a measure of central tendency.
There are three common measures of central tendency, the
1) Mean,
2) Mode,
3) Median of a set of observations.
So these all are measures of central tendency - a single value
that attempts to quantify the "average" value around which the
values in a data set tend to cluster.
20
Measure of central tendency
Mean
The arithmetic mean: x of a set of n observations is
their average:
mean =
sum of observations
that is x
x
number of observations n
When calculating from a frequency distribution, this
becomes:
x
xf xf
n f
21
Measure of central tendency
Mean
Find the average of the data shown below
xf
25
52
108
196
174
150
93
64
30 862
x
xf xf
862
28.73
n f 30
22
Measure of central tendency
Coding for calculating the mean
A deal of tedious work can be avoided by coding with
a false mean. It involves converting the x-values into
simpler values for the calculation and then converting
back again for the final result.
(a) Choose a convenient value of x near the middle of
the range (the false mean),
(b) Subtract it from every other value of x,
(c) Divide by a suitable data interval to give the coded
value of xc.
(d) Proceed to find the mean of the coded values: xc
23
Measure of central tendency
Coding for calculating the mean
Find the average of the data shown below
using coding procedure Data interval
(b)
(a) (c)
False mean
(d) xc
x f
c
2.0
0.0333 to 4 dp
f 60
24
Measure of central tendency
Decoding for calculating the mean
Decoding requires the coding process to be reversed.
This means multiplying by the appropriate data interval
and then adding the false mean:
xc
x f
c
2.0
0.0333 to 4 dp where xc
x 30.8
f 60 0.2
Therefore:
x (0.0333) 0.2 30.8 30.79 to 2 dp
25
Measure of central tendency
Coding with a grouped frequency distribution
This procedure is similar where the false mean is the
centre value of a convenient class.
xc
xc f
11
0.22
50 11
f 50
26
Measure of central tendency
Decoding with a grouped frequency distribution
Decoding again requires the coding process to be reversed.
This means multiplying by the appropriate data interval and
then adding the false mean:
xc
x f
c
11 x 2.30
0.22 where xc m
f 50 0.03
Therefore:
x m (0.22) 0.03 2.30 2.3067 to4 dp
giving:
x 2.307 to 3 dp
27
Measure of central tendency
Mode of a set of data
The mode of a set of data is that value of the variable
that occurs most often.
The mode of:
2, 2, 6, 7, 7, 7, 10, 13
is clearly 7. The mode may not be unique, for instance
the modes of:
23, 25, 25, 25, 27, 27, 28, 28, 28
are 25 and 28.
28
Measure of central tendency
Mode of a grouped frequency distribution
The modal class of grouped data is the class with the
greatest population.
For example, the modal class of:
Is the third class.
We can also by plotting the histogram of the data find the
mode. (please refer to the textbook Stroud, page 1155-1156)
29
Measure of central tendency
Median of a set of data
The median is the value of the middle datum when the
data is arranged in ascending or descending order.
If there is an even number of values the median is the
average of the two middle data.
The data 4, 7, 8, 9, 12, 15, 26 has a median of 9.
The data 5, 6, 10, 12, 14, 17, 23, 30 has a
median of 13.
14 12
Why? Because 13
2
30
Measure of central tendency
Median with grouped frequency distribution
In the case of grouped data the median divides the population
of the largest block of the histogram into two parts A and B:
In this frequency distribution A + B = 20
6 12 15 A B 13 9 5
B
A 20
Note : A+B=20; B =20-A
so that A = 7: 15
13
12
7
The width of A class interval 6
9
20 5
0.35 0.3
0.105
Therefore, Median = 30.85 + 0.105
30.85
= 30.96 to 2 dp 31.15
31
Mean, Mode, Median
How we know which is the correct measure of location
to use in a given situation?
Mode: This is used when data is qualitative, or quantitative
with either a single mode or bimodal. It is not very
informative if each value occurs only once.
Median: This is used for quantitative data. It is usually used
when there are extreme values.
Mean: This is used for quantitative data and uses all the
pieces of data. It therefore gives a true measure
of the data. However, it is affected by extreme values.
32
Mean, Mode, Median
How we know which is the correct measure of location to use in a given situation?
A child at a junior school records the maximum temperature,
in °C, for seven days at his school. The results are given below
15.7 16.1 16.2 47.6 17.4 18.6 16.7
a) Find the mean and median of these data?
b) Why we did not ask about the mode?
The child’s teacher realizes that the figure 47.6 should be
17.6.
c) Write down what effect this will have on the median and
mean.
Mean 21.2; median 16.7
Median 16.7; mean 16.9
33
Mean, Mode, Median
How we know which is the correct measure of location to use in a given situation?
A company consists of seven workers paid at $10 per hour
and their supervisor who is paid at $50 per hour.
a) Find the mode, median and mean of all eight workers?
Write down, with reason, which of the mean, mode and
median you should use in the following situations:
b) When asked the typical hourly rate of pay for the company.
c) When trying to persuade a prospective employee to work
for the company.
Mode 10; median 10; mean 15
Mode or median
The mean as it is a higher value and more
likely to34persuade the prospective employee
Dispersion
The mean, mode and median give important information
about the central tendency of data but they do not tell
anything about the spread or dispersion about the centre.
For example,
the set 26, 27, 28 ,29 30 has a mean of 28.
and the set 5, 19, 20, 36, 60 also has a mean of 28.
but one is clearly more tightly arranged about the mean
than the other.
We therefore need a measure to indicate the spread of
the values about the mean.
35
Dispersion
Range
The simplest measure of dispersion is the range – the
difference between the highest and the lowest values.
In the previous two case,
the range of set 1 is 30 - 26 = 4,
while that of set 2 is 60 – 5 = 55.
The disadvantage of the range, however, is that it deals
only with the extreme values; it does not take into account
the behaviour of the intermediate values.
36
Dispersion
Standard Deviation
The standard deviation is the most widely
used measure of dispersion. The variance
of a set of data is the average of the square
of the difference in value of a datum from the http://www.scienceofrelationships.com/
mean: home/tag/love-letter
( x1 x ) 2 ( x2 x ) 2 ( xn x ) 2
variance
n
This has the disadvantage of being n
measured in the square of the units
ix x 2
of the data. The standard deviation is i 1
the square root of the variance: n
37
Dispersion
Standard Deviation (Alternative formula)
Since: n n
( xi x ) 2
i
( x 2
2 xi x x 2
)
i 1
i 1
n n
n n n n
x 2 x xi x
2
i
2
i
x 2
2 nx 2
nx 2
i 1 i 1 i 1
i 1
n n
n
i
x 2
i 1
x2
n
That is:
x x 2 2
38
A question to try
Find the mean and standard deviation for the following:
a. 1,2,3,4,5 and 6
b. 1001,1002, 1003, 1004, 1005 and 1006
c. 0.1, 0.2, 0.3, 0.4, 0.5 and 0.6
a 3.5, 1.71; b 1003.5, 1.71; c 0.35, 0.17.
39
Frequency Polygons and Frequency Curves
If the centre points of the tops If the frequency polygon is
of the rectangular blocks of a smoothed out, or if we plot the
frequency histogram are joined frequency against the central
by straight lines, the resulting value of each class and draw
figure is called a frequency a smooth curve, the result is a
polygon. frequency curve.
A represents
the total
frequency of
the variable.
A
40
Normal Distribution Curve
When very large numbers of observations are made and
the range is divided into a very large number of ‘narrow’
classes, the resulting frequency curve, in many cases,
approximates closely to a standard curve known as the
normal distribution curve, which has a characteristic
bell-shaped formation.
The normal distribution AR= AL
curve is symmetrical
about its centre line AL AR
which coincides with the
mean of the observations.
41
Normal Distribution Curve
Values within 1 standard deviation of the mean
There are two points on the normal distribution curve where the concavity
switches, one from concave to convex and the other from convex to concave.
The horizontal distance of each of these two points from the mean line is one
standard deviation.
Of the area beneath the
normal distribution curve:
68%
lies within one standard
deviation from the mean.
42
Normal Distribution Curve
Values within 1 standard deviation of the mean
(68%)
On a manufacturing run to produce 1000 bolts of
nominal length 32.5 mm, sampling gave a mean of
32.58 mm and a standard deviation of 0.06 mm.
From this observation, x 32.58 mm and = 0.06 mm.
We conclude that 68% of
x 32.58 0.06 32.52 the bolts, i.e. 680, are
likely to have lengths
x 32.58 0.06 32.64 between
32.52 mm and 32.64 mm
43
Normal Distribution Curve
Values within 2 standard deviations of the mean
Of the area beneath the
normal distribution curve: 95%
lies within two standard deviations
from the mean.
From example,x 32.58 mm and = 0.06 mm.
We conclude that 95% of
x 2 32.58 0.12 32.46 the bolts, i.e. 950, are likely
to have lengths between
x 2 32.58 0.12 32.70
32.46 mm and 32.70 mm
44
Normal Distribution Curve
Values within 3 standard deviations of the mean
Of the area beneath the
normal distribution curve: 99.7%
lies within three standard deviations
from the mean.
From example, x 32.58mm and = 0.06 mm.
We conclude that 99.7% of
x 3 32.58 0.18 32.40 the bolts, i.e. 997, are likely
to have lengths between
x 3 32.58 0.18 32.76
32.40 mm and 32.76 mm
45
Normal Distribution Curve
We can enter the same information in a slightly different
manner, dividing the figure into columns of 1σ width on
each side of the mean.
46
Standardized Normal Curve
The standardized normal curve is the same shape as the
normal curve but the axis of symmetry is the vertical axis;
the horizontal axis carries a scale of z-values where:
xx
z
and the area beneath the
curve is 1. Its equation is:
z2
1
( z) e 2
2
47
A question to try
A computer operator transfers an hourly wage list from a paper
copy to her computer. The data transferred is given below
$5.50 $6.10 $7.80 $6.10 $9.20 $91.00 $11.3
a) Find the mode, median and range of these data?
b) Find the mean and the standard deviation of these data?
The office manager looks at the figures and decides that
something must be wrong.
c) Write down with a reason the mistake that probably been
made.
d) Recalculate with the corrected data the mean, range and the
standard deviation and compare both results.
Mode 6.1;median 7.8; range 85.5; mean 19.57; std dev 29.22
Mean 7.87; range 5.8; Std dev 1.96 48
Thank you for your attention
49
This lecture note is taken from Dr. Abdul Kareem