Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
135 views31 pages

Session 1 DEN1015H 2013 Lecture Notes

Uploaded by

Jeff Chadwick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views31 pages

Session 1 DEN1015H 2013 Lecture Notes

Uploaded by

Jeff Chadwick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DEN 1015H LECTURE NOTES Session 1

INTRODUCTION

The field of Statistics is the subject that deals with the collection, analysis and
interpretation of numerical information. Statistics is applied to the planning and
analysis of research studies in virtually all subject areas. Biostatistics is a branch
of statistics that deals primarily with the health sciences and the biological
sciences.

Statistical and biostatistical methods are used for summarizing and organizing data
to allow efficient reporting and presentation of the results and also to provide
simple interpretations of the findings. This lecture gives many examples of the
graphical and tabular techniques of descriptive statistics that are used for
describing and summarizing data. The next lecture introduces concepts of
probability. Probability models describe the variability in data and also form the
logical basis of statistical inference, by which conclusions are drawn from data.
The remainder of the course will focus on several methods of statistical inference
that have important applications in biomedical research. Many of the examples
used to illustrate the concepts involve data from recent dental research.

DESCRIPTIVE STATISTICS

1. Defining & Summarizing Data

“Raw”   data are observations derived from people, lab animals, lab specimens,
hospitals, etc.

The number of individuals (or animals, specimens, cells, hospitals, etc) is called
the sample size.

A variable relates to anything that is measured, recorded or manipulated in a


study.

In a data set the variables are the columns and the observations are the rows (Fig.
1).

1
© Dr. Herenia P. Lawrence
Implant Subject Type Irradiation Graft Failure
1 1 1 1 1 1
2 1 2 1 1 0
3 2 1 1 1 1
4 2 2 1 1 1
5 3 1 1 1 1
6 3 2 1 1 1
7 4 1 1 0 0
8 4 2 1 0 0
9 5 1 1 0 1
10 5 2 1 0 1
11 6 1 1 0 1
12 6 2 1 0 1
13 7 1 0 1 1
14 7 2 0 1 0
15 8 1 0 1 0
16 8 2 0 1 0
17 8 2 0 1 0
18 9 1 0 1 1
19 9 2 0 1 1
20 10 1 0 0 0
21 10 2 0 0 0
22 11 1 0 0 0
23 11 2 0 0 0
24 12 1 0 0 0
25 12 2 0 0 0
‘Type’:  1  =  maxilla; 2 = mandible. For  ‘irradiation’,  ‘graft’,  and  
‘failure’:  1  =  yes;;  0  =  no.

Fig. 1. Hypothetical data set from a study investigating the failure rate of 25
implants placed in grafted and non-grafted mandibular/maxillary bone in 6
irradiated and 6 non-irradiated subjects.

TYPES OF VARIABLES

The first step, before performing statistical tests, is to decide what type of variables
(or data) one is dealing with, since different statistical analyses are needed for each
type of variable. Nominal and ordinal variables consist of counts in categories and
are  analyzed  using  “non-parametric”  statistics.    Interval and ratio variables consist
of  actual  quantitative  measurements  and  are  analyzed  using  “parametric”  statistics.

2
© Dr. Herenia P. Lawrence
There are a number of typologies, but one that has proven useful is given in Table
1 and Fig. 2. The basic distinction is between quantitative data (for which one asks
“how   much?”)   and   qualitative or categorical data   (for   which   one   asks   “what  
type?”).    For  a  review,  see  suggested  readings  for  today’s  class.

Table 1. Examples of types of variables (data)


Qualitative (or categorical): A categorical variable records the category to which an
observation belongs. Numerals (1, 2, 3, etc.) are usually used to record the category
membership but these are not to be confused as having any meaning as numerical values.
If the data could be recorded using letters (A, B, C, etc.) without losing information, then
we have a categorical variable.
a) Nominal (unordered categories, e.g. Caucasian/Black/Hispanic)
Subtype: Binary/Dichotomous (e.g. yes vs. no, dead vs. alive, male vs. female,
treatment vs. control). For purposes of statistical analysis such variables are
usually numerically coded using the numerals 0 and 1.
b) Ordinal (ordered categories, e.g. categories of pain severity: none, mild,
moderate, and severe; stage I, II or III cancer; a Likert scale – strongly
disagree, disagree, neutral, agree, strongly agree).
Please note: you should not use mean scores for these!!!
b) Ranked (e.g. ten leading causes of failure of dental implants, which have been
arranged from the cause that resulted in the greatest number of failures to the
cause that resulted in the fewest. These causes were then assigned consecutive
integers that correspond to their place in the sequence).

Quantitative: For a quantitative variable, differences between possible values have


meaning independent of the values themselves. For example, a difference between a
value of 0 and a value of 1 is comparable to a difference between a value of 1 and a value
of 2.
a) Discrete (integer or whole numbers, e.g. DMF score; number of teeth; number
of children). OK to compute.... Text Nominal Ordinal Interval Ratio
frequency distribution. Yes Yes Yes Yes
b) Continuous median and percentiles.
add or subtract.
No
No
Yes
No
Yes
Yes
Yes
Yes
These have two subtypes: mean, standard deviation, standard error of the mean No
ratio, or coefficient of variation. No
No
No
Yes
No
Yes
Yes
Interval scale – no natural zero (e.g. IQ, degree Celsius, probing pocket depth,
clinical attachment level)
Ratio scale – has a natural zero (e.g. length in metres, salivary flow rate, age,
pulse rate, vital capacity).

In general, the amount of information increases as one goes from nominal to ratio
variables. Classifying interval measures into large categories is akin to throwing
away data.
3
© Dr. Herenia P. Lawrence
There are other ways of defining types of variables. In an experiment, the
independent variables are those that are varied by and under the control of the
experimenter; the dependent variables are those that respond to experimental
manipulation. For example, in a clinical trial to determine the effect of periodontal
therapies on attachment gain, the independent variable is the type of therapy and
the dependent variable is the gain in attachment measured in millimeters.
Dependent variables should be clinically important and related to the independent
variables.

Data Types

Qualitative Quantitative
Dependent Independent
(periapical (CHX irrigation
lesion) versus saline)
Categorical/Nominal Discrete
[sex (dichotomous), (# teeth)
marital status]
Continuous
Categorical/Ordinal
(stage of cancer, pain
rating, Likert scale) Interval Ratio
(36º-38º C, (age, pulse rate,
probing pocket vital capacity,
depth) VAS)

Fig. 2. Types of Variables

4
© Dr. Herenia P. Lawrence
FREQUENCIES, FREQUENCY DISTRIBUTIONS AND GRAPHS
The next step in the process of analyzing data is to describe the data by using a
frequency distribution to reflect the probability of the occurrence of an event. A
frequency distribution consists of a set of frequencies for all possibilities.

Frequencies for categorical variables


Summarizing categorical variables is straightforward, the main task being to count
the number of observations in each category. These counts are called frequencies.
They are often also presented as relative frequencies; that is, as proportions or
percentages of the total number of individuals (sample size).

Example: Type of malocclusion in a sample of 200 schoolchildren (WHO criteria


of malocclusion)
Type of Malocclusion Frequency Relative
Frequency (%)
Normal 100 50.0
Moderate 61 30.5
Severe 39 19.5
Total 200 100.0

Frequency (or count) the number of people belonging to a category, e.g. number
of students who present moderate malocclusion = 61
Relative frequency the proportion of people belonging to a category (frequency
divided by the total sample size). It can be expressed as a
percentage, e.g. the percentage of pupils who have moderate
degree of malocclusion = 61/200 100 = 30.5%.

Bar Diagram & Pie Chart


Bar and pie charts are popular types of graphs used to display frequencies and
relative frequencies.

5
© Dr. Herenia P. Lawrence
Severe Malocclusion 39

Moderate Malocclusion 61

Normal Occlusion 100


0 20 40 60 80 100 120
Severe
Malocclusion
Number of Schoolchildren 19.5%

Normal Occlusion
50%

Moderate
Malocclusion
30.5%

6
© Dr. Herenia P. Lawrence
Frequency distributions for quantitative variables

Example: DMF scores for a group of 50 8-year-old children.

Relative Cumulative
Frequency (no. Cumulative
DMF score Frequency Relative
of children) Frequency
(%) Frequency (%)
0 4 8 4 8
1 2 4 6 12
2 3 6 9 18
3 6 12 15 30
4 7 14 22 44
5 10 20 32 64
6 9 18 41 82
7 3 6 44 88
8 1 2 45 90
9 1 2 46 92
10 4 8 50 100
Total 50 100

Cumulative Frequency the number of people in the sample with values less than
or equal to a specified value, e.g. 32 of the children have
a DMF score of 5 or less.

Relative Frequency (%) the number of people in the sample taking each value
divided by the total number of people studied and
multiplied by 100, i.e., the proportion of the total sample
in each category. This proportion can be interpreted as
the probability that an individual chosen at random from
the original sample may fall in a particular category or
within a range of categories.

Cumulative Relative the percentage of people in the sample with values less
Frequency (%) than or equal to a specified value, e.g. 64% of the sample
have a DMF score of 5 or less.

7
© Dr. Herenia P. Lawrence
Histograms
The histogram is an appropriate method for depicting a frequency distribution for
discrete or continuous data. Values are grouped into intervals, generally of equal
size. These intervals are then represented by bars with (if intervals have equal
width) heights proportional to the frequency of observations contained within
them.

Example: Histogram with equal class intervals

The following frequency distribution shows the age at onset of edentia in a sample
of 200 edentulous persons.

Age at the last birthday Frequency


10-19 8
20-29 56
30-39 70
40-49 43
50-59 18
60-69 4
70-79 1
Total 200

The class intervals are all ten years.

80
70 10-19 yr
70
20-29 yr
F 60 56
r 30-39 yr
e 50
q 43 40-49 yr
u 40
e 50-59 yr
n 30
c
18 60-69 yr
y 20
70-79 yr
10 8
4 1
0
Age (years)

8
© Dr. Herenia P. Lawrence
Example: Histogram with unequal class intervals

The following data are abstracted from a paper on the age at onset of edentia in
another sample of edentulous persons. The percentage distribution is shown.

Age at the Class interval % of total


last birthday
11-15 5 0.5
16-19 4 3.5
20-24 5 10.5
25-29 5 17.5
30-34 5 20.0
35-44 10 29.0
45-54 10 14.0 *5% divided by
55-74 20 5.0 20 and
multiplied by 5
Total 100.0 equals 1.25

25 11-15 yr
% 20 16-20 yr
20 21-25 yr
p 17.5
e
26-30 yr
o 14.5 14.5
r 31-35 yr
f 15
36-40 yr
5 10.5
a 41-45 yr
g 10
y 7 7 46-50 yr
e
e
51-55 yr
a 5 4.4
r 56-60 yr
s 1.25 1.25 1.25 1.25 61-65 yr
0.5
0 65-70 yr
Age (years)
71-75 yr

To standardize the data into equal intervals of 5 years, you will need to divide the
relative frequency by the width of the interval and then multiply it by 5 (see
example above*).

9
© Dr. Herenia P. Lawrence
Frequency Polygon
The frequency polygon is similar to the histogram. It is constructed by placing a
point at the center of each interval of the histogram such that the height of the point
is equal to the frequency or relative frequency associated with that interval.

NUMBER OF HOURS WORKED PER WEEK


AS REPORTED BY DENTISTS
60
Histogram
45
Percentage

30

15

0
1 to 10 11 to 20 21 to 30 31 to 40 41 to 50 51 to 60
Number of Hours

NUMBER OF HOURS WORKED PER WEEK


AS REPORTED BY DENTISTS
60
Frequency
Polygon
45
Percentage

30

15

0
1 to 10 11 to 20 21 to 30 31 to 40 41 to 50 51 to 60
Number of Hours

10
© Dr. Herenia P. Lawrence
Stem-and-leaf plot
It  resembles  a  histogram,  with  the  first  digit(s)  of  each  datum  along  the  “stem”  and  
the  last  digit(s)  forming  the  “leaves”.  

Urinary concentration of lead in 15 children from housing estate ( mol/24hr)


0.6, 0.1, 1.1, 0.4, 2.6, 2.0, 0.8, 1.3, 1.2, 1.5, 3.2, 1.7, 1.9, 1.9, 2.2

a) Stem-and-leaf  “as  they  come”

Stem Leaf

0 6 1 4 8
1 1 3 2 5 7 9 9
2 6 0 2
3 2

We  then  order  the  leaves,  as  in  “b)”

b) Ordered stem-and-leaf plot

Stem Leaf

0 1 4 6 8
1 1 2 3 5 7 9 9
2 0 2 6
3 2

11
© Dr. Herenia P. Lawrence
Numerical data can be further summarized by measures that describe where the
center of the distribution lies – mean, median, and mode – and measures of how
wide the distribution is – range, percentile, and standard deviation.

MEASURES OF CENTRAL TENDENCY

MEAN the sum of all observations divided by the total number of


observations (or sample mean or arithmetic mean), as follows:
n
xi
i 1
x
n
pronounced   “x-bar,”   where   xi = the value for the ith subject in the
sample
= the Greek capital letter sigma, indicating a summation over all xi’s
n = the number of subjects in the sample (also called the sample size).

MEDIAN the middle value of the distribution, i.e., the value for which 50% of
the sample have values less than or equal to the median and 50% have
values greater than or equal to the median (50th percentile). It is
calculated by rank ordering (from lowest to highest) the values and
then determining the value corresponding to the middle rank, i.e., the
rank order (n+1)/2. Thus, if the sample contains an odd number of
subjects, the median will be the value of the subject with the middle
rank. If the sample contains an even number of subjects, the median
value will fall half-way between the values of the two midmost
subjects.

MODE the most common single value, i.e., the peak of the frequency
distribution. A distribution with two or more modes is referred to as
bimodal, trimodal, etc.

Example: Number of teeth in a sample of 10 babies


6, 6, 6, 6, 7, 8, 8, 10, 10, 12

Mean = 6 + 6 + 6 + 6 + 7 + 8 + 8 + 10 + 10 + 12 = 7.9 teeth


10

Median = 7.5 teeth [the average of the (n/2)th and (n/2 + 1)th observations if n is
even].

12
© Dr. Herenia P. Lawrence
Mode = 6 teeth

N.B.: The mode is seldom used. If the sample is small, either it may not be
possible to estimate the mode (e.g. when all the values are different), or the
estimate obtained may be misleading.

Properties of the mean and median:

1. If the sample is shifted by a constant c, i.e., c is added to all data values, then
the mean and median are also changed by this same amount.

Example: If a measuring device is not calibrated correctly so that every


measurement is out by c units then the mean of any sample of values will
also be out by c units.

2. If the sample is rescaled by a constant c, i.e., all data values are multiplied
by c, then the mean and median are also multiplied by c.

Example: If the units of serum glucose are changed from mg/dl to g/l
(multiply by 10/1000=1/100), then the mean is changed from 110.1 mg/dl to
1.101 g/l.

Comparison of the mean and the median:

1 Mean values are substantially influenced by unusual values (outliers) so it is


most suitable for distributions that are roughly symmetrical. If unusually
large or small values (outliers) can arise, the median will be less influenced
by these.

2. In some cases, the mean must be used for interpretative reasons since it takes
into account each individual observation.

Example: In a needs assessment survey, the mean number of decayed teeth


per child in a sample of children is more relevant than the median for
inferring the total needs of the population.

13
© Dr. Herenia P. Lawrence
Possible distributions of data values
Symmetric vs. asymmetric (skewed) distributions

Two asymmetric curves, one with positive skew (Curve A - skewed to the right)
and one with negative (Curve B - skewed to the left) skew.

Symmetrical and bell-shaped distributions differing in terms of kurtosis.

Bimodal

14
© Dr. Herenia P. Lawrence
The mean, median, and mode in a symmetric distribution

The mean, median, and mode in a skewed distribution

Histogram of highly skewed data. Note the outlier = 43.

15
© Dr. Herenia P. Lawrence
Data Transformations

Logarithmic (natural logarithms or logarithms to base e = ln or logarithms to the


base 10 = log) transformation (only with positive values)

For Positively skewed distributions

Square or cubic transformations

For Negatively skewed distributions

Example: The following are the number of days spent in hospital by 17 subjects
following an operation:

3, 4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37, 42

0 10 20 30 40 50

a) Raw Data

0 1 2 3 4 5

b) Ln Data

Fig. 1. One-way scatter plots of length of hospital stay showing a) raw data and b)
data on a logarithmic scale.

The Geometric Mean

Consider the 17 observations of the duration of stay in hospital plotted on a one-


way scatter plot in Figure 1a above. The distribution is skewed to the right with a
few rather large observations. Because of this skewness, the mean duration (14.65)
would not be a satisfactory measure of the central value; the median (10) would be
more useful. Figure 1b shows a one-way scatter plot of the logarithms of the
observations and now the distribution is more symmetric. The mean log duration
(2.41) is therefore a satisfactory measure of the central value of the distribution of
log duration. The anti-logarithm of this mean (antilog 2.41 = exp2.41 = 11.13),
known as the geometric mean, is a better measure of the central value of the
distribution of duration than the original mean. In fact the geometric mean is
usually close to the value of the median.

16
© Dr. Herenia P. Lawrence
Thus, the geometric mean can be calculated by the following three steps:

1. take the logarithm of all data values;

2. calculate the sample mean of the log data values;

3. the geometric mean is the anti-logarithm of the sample mean found in step 2.

antilog of log10 x = 10log x


10

antilog of ln x = eln x

The geometric mean is used only with data which are heavily positively skewed.
Examples of variables for which logarithms and geometric means are sometimes
useful include concentrations and bacterial counts.

17
© Dr. Herenia P. Lawrence
MEASURES OF DISPERSION or VARIATION, i.e., STATISTICS USED
TO DESCRIBE THE SPREAD OF A DISTRIBUTION

Range: the interval between the lowest and highest value in the distribution.

Percentile ranges
A percentile range is an interval between two specified percentile points, e.g. the
interquartile range includes those between the 25th (Q25) and 75th (Q75)
percentiles; the median is equivalent to the 50th (Q50) percentile point, since
(n+1)/2 = (n+1) 50/100. For a distribution with a large number of observations
the quartiles are most easily found from the cumulative relative frequency by
reading off the values that correspond to 25%, 50%, and 75% (see cumulative
distribution plot below illustrating the data on page 7).

120

100

80

60

40

20

0
0 1 2 3 4 5 6 7 8 9 10

DMF

18
© Dr. Herenia P. Lawrence
A box plot (or box and whiskers plot) displays the 1st and 3rd quartiles as a box with
the median at the centre. Lines are drawn from the box to the extreme
observations (although values lying too far from the box are sometimes identified
separately as outliers). Like the histogram, the box plot is useful for checking the
symmetry of a distribution (i.e., whether it has the same shape on either side of the
median).

Example: Using the data from the previous example, i.e., the number of days spent
in hospital by 17 subjects following an operation:

3, 4, 4, 6, 8, 8, 8, 10, 10, 12, 14, 14, 17, 25, 27, 37, 42

The smallest value = 3


The lower quartile, Q25 = n+1 0.25 = 4.5th value of the ordered observations,
i.e., the average of the 4th and 5th values = (6 + 8)/2 = 7
The median, Q50 = 10 or [(17+1)/2 = the 9th value]
The upper quartile, Q75 = n+1 0.75 = 13.5th value of the ordered
observations, i.e., the average of the 13th and 14th values = (17 + 25)/2 = 21
The largest value = 42

DAYS 0 10 20 30 40 50

Q25 Q50 Q75

BOX PLOT

Min Max
10

19
© Dr. Herenia P. Lawrence
Another important use of the box plot is for comparing distributions. For example,
the figure below displays box plots of changes in number of decayed, missing, or
filled surfaces (DMFS) in a clinical trial to compare caries-preventive effects of
various chewing gums. The plots are on the same scale, allowing easy comparison
between treatment groups.

Fig. Box plots of change in DMFS by treatment group in a chewing gum study.

20
© Dr. Herenia P. Lawrence
Sample Variance: the   average   square   ‘distance’/difference   of   each   observation  
from the mean. By squaring the difference, all terms will be positive (see your
algebra notes).
2
2 xi x
s
n 1

xi = the value for the ith subject in the sample


x = the sample mean
= the Greek letter sigma, indicating a summation over all xi’s
n = the sample size

The quantity n-1 is called the number of degrees of freedom of the variance. The
formula uses n-1 because the variance of the sample calculated in this way better
approximates the variance of its target population.

Sample Standard Deviation (abbreviated SD, s.d., or s): the square root of the
variance. The standard deviation will have the same unit of measurement as the
original data. The smaller the standard deviation, the less each score varies from
the mean. The larger the spread of scores, the larger the SD becomes.
Algebraically, the formula looks like this:

( xi x) 2
s.d. (or s) =
n 1

Example 1: The number of teeth in a sample of 9 babies aged 9 months are:

5, 6, 6, 7, 8, 8, 10, 10, 12

The range is 12 - 5 = 7

The sample mean x = 5 + 6 + 6 + 7 + 8 + 8 + 10 + 10 + 12 = 72/9 = 8 teeth


9

N.B.: Many calculators have built-in functions for the mean and standard
deviation. The keys are commonly labeled x and n 1 , respectively, where is
the lower case Greek letter sigma.

21
© Dr. Herenia P. Lawrence
Calculation of the Variance and Standard Deviation

Observation Value (xi) xi - x (xi - x)2


1 5 -3 9
2 6 -2 4
3 6 -2 4
4 7 -1 1
5 8 0 0
6 8 0 0
7 10 2 4
8 10 2 4
9 12 4 16
Sum 72 0 42
The variance is 42/8 = 5.25 teeth2
The standard deviation = square root of the variance = 5.25 = 2.29 teeth
Alternative formula for the variance
s2 = (xi - x)2 or xi2 - ( xi)2/n
n-1 n-1
Example 2: Variance calculation using the second formula
xi x i2
5 25
6 36
6 36
7 49
8 64
8 64
10 100
10 100
12 144
Sum 72 618
xi2 = 618
( xi)2/n = 722/9 = 576
Thus, variance = 618-576 = 5.25 teeth2
9-1

22
© Dr. Herenia P. Lawrence
Properties of the standard deviation and inter-quartile range:

1. If the sample is translated by a constant c, i.e., c is added to all data values,


then the standard deviation and inter-quartile range are not changed.

Example: If a measuring device is not calibrated correctly then the


variability of the values will be unaffected, unlike the mean.

2. If the sample is rescaled by a constant c, i.e., all data values are multiplied
by c, then the standard deviation and inter-quartile range are also multiplied
by c.

Example: In mg/dl, serum glucose has a mean of 110.1 mg/dl and a standard
deviation of 30.0 mg/dl; in g/l, the mean is 1.101 g/l with a standard
deviation of 0.3 g/l.

Comparison of measures of spread:

1. The inter-quartile range is much less influenced by unusually large or small


observations than the standard deviation. In cases when such observations
can arise, summarizing the data using the median and inter-quartile range is
preferable to the mean and standard deviation.

2. For some measurements the coefficient of variation (the standard deviation


divided by the mean) is the most meaningful summary measure. The
coefficient of variation is denoted by CV, and so we have CV = s/ x . The CV
expresses the standard deviation as a percentage of the mean, i.e., CV = 100
(s/ x )%. The coefficient of variation is a unitless measure of spread. For
example, for serum glucose, we have CV = 30.0mg/dl 110.1mg/dl = 0.27
or 27%. It is often used along with the geometric mean for data which are
heavily positively skewed.

3. The range (largest value minus smallest) of a set of data is not a good
indicator of spread because...

(a) it is highly sensitive to extreme values,


(b) it does not make efficient use of the data, and
(c) ranges based on different numbers of values cannot be meaningfully
compared.

23
© Dr. Herenia P. Lawrence
GENERAL RULES FOR REPORTING SUMMARIES

Always report the number of observations (n) on which the summary is based.
For binary responses (e.g. A, B) report the percentage of As or Bs but not both.

If the central value of a quantitative distribution is measured using the median


(as in positively skewed distributions), give the lower and upper quartiles as
well.

If the central value of a quantitative distribution is measured using the mean


give the standard deviation as well.

24
© Dr. Herenia P. Lawrence
CALCULATING THE MEAN AND STANDARD DEVIATION FROM A
FREQUENCY DISTRIBUTION

The table below shows the distribution of the number of previous pregnancies of a
group of women aged 30-34 taking part in a study of the association between
periodontal disease in expecting women and reproductive outcomes. Eighteen of
the 100 women had no previous pregnancies, 27 had one, 31 had two, 19 had three,
and five had four previous pregnancies.

No. of previous pregnancies


0 1 2 3 4 Total
No. of women 18 27 31 19 5 100

As, for example, adding 2 thirty-one times is equivalent to adding the product (2 ×
31), the total number of previous pregnancies is calculated by
x = (0 × 18) + (1 × 27) + (2 × 31) + (3 × 19) + (4 × 5)
= 0 + 27 + 62 + 57 + 20 = 166

The average number of previous pregnancies is, therefore:


x = 166/100 = 1.66

In the same way:


2 2 2 2 2
x 2 = (0 × 18) + (1 × 27) + (2 × 31) + (3 × 19) + (4 × 5)
= 0 + 27 + 124 + 171 + 80 = 402

The standard deviation is, therefore:


402 166 2 / 100 126.44
s = = 1.13
99 99

If a variable has been grouped when constructing a frequency distribution, its mean
and standard deviation should be calculated using the original values, not the
frequency distribution. There are occasions, however, when only the frequency
distribution is available. In such a case, approximate values for the mean and
standard deviation can be calculated by using the values of the mid-points of the
groups and proceeding as above.

25
© Dr. Herenia P. Lawrence
Suggested  readings  for  today’s  class:

Norman GR, Streiner DL. Biostatistics. The bare essentials (2nd ed.). Hamilton,
ON: B.C. Decker Inc., 2000. Chapters 1, 2, and 3.

Weintraub JA, Douglass CW, Gillings DB. Biostats. Data analysis for dental
health care professionals (2nd ed.). Research Triangle Park, NC: CAVCO Inc.,
1985. Chapters 4 and 5.

Kim JS, Dailey RJ. Biostatistics for oral healthcare (1st ed.). Ames, IA: Blackwell
Pub. Professional, 2008. Chapters 2 and 3.

26
© Dr. Herenia P. Lawrence
DEN 1015H Review Exercises Session 1

1. The data below show the arch lengths of the maxilla of 110 boys aged 6 years. Arch length
is defined as the perpendicular distance between lines tangent to the labial surfaces of the
central incisors and the distal portions of the second primary molars or their permanent
successors. The measurements are in mm.
25.5 31.0 30.0 33.4 30.6 32.0 32.6 30.3 31.6 30.7
31.1 28.0 32.7 32.1 30.1 30.5 29.4 27.6 34.7 29.5
30.6 32.8 32.2 30.2 33.2 30.4 31.0 29.6 31.3 28.9
31.2 31.8 26.4 32.7 30.5 32.3 28.0 31.9 31.2 33.4
27.5 32.9 30.9 31.7 27.3 34.3 28.1 33.6 27.7 30.8
31.3 30.8 32.8 28.3 31.6 28.2 32.4 27.8 33.5 28.9
30.7 32.9 27.4 31.5 29.1 29.7 29.3 30.0 32.5 29.4
29.1 30.5 29.3 30.4 29.2 31.5 26.6 30.2 28.8 29.7
27.2 29.0 33.5 28.4 29.9 30.3 29.6 31.5 28.7 31.4
29.2 28.5 31.4 34.6 30.9 35.5 28.6 29.5 29.5 29.8
28.2 30.5 29.8 31.3 32.7 30.5 33.4 28.2 27.6 32.5

a) Construct a frequency distribution using 1 mm intervals.


b) Calculate relative frequencies (%) in each 1 mm interval.
c) Calculate cumulative relative frequencies (%) for each 1 mm interval.

2. In a caries prevention trial in the elderly, the 150 subjects had the following numbers of teeth
present:
40 subjects had fewer than 5 teeth
60 subjects had 5– teeth
30 subjects had 10– teeth
10 subjects had 15– teeth
10 subjects had more than 25 teeth

Draw a histogram to illustrate the data using 5 teeth as the unit class interval. Describe the
shape of the distribution.

The best measure of the typical number of teeth present for these subjects is

3. The following data represent the number of CNS symptoms reported by 25 dentists who used
a squeeze cloth technique for mercury-rich amalgam mixtures:
1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 8, 9, 9, 10.
Calculate the mean, median and mode as well as the standard deviation for the number of
CNS symptoms reported.

27
© Dr. Herenia P. Lawrence
4. The following figures are the maximum pocket depths recorded in 12 women referred for
dental treatment during pregnancy. The depths are in mm.

1.5 1.5 3.0 5.0 9.0 0.5 0.5 5.0 4.0 4.0 6.5 3.0

a) What is the range of recorded pocket depths?


b) Calculate the mean, variance and standard deviation of these maximum pocket depths.
c) Construct a box plot representing the five number summary of the distribution, i.e., the
smallest value, the lower quartile (Q25), the median (Q50), the upper quartile (Q75) and the
largest value.
d) Do the data appear to be skewed?
e) Do the data contain any outlying observations?
f) What is the geometric mean of this distribution?

5. An absent-minded instructor calculated the following statistics for an examination: mean=50,


range=50, number of cases=99, minimum=20, and maximum=70. She then found an
additional examination with a score of 50. Recalculate the statistics, including the additional
exam score.

6. A sample of pocket depths has sample mean 4 mm and sample standard deviation 1 mm. If
the units of the measurements are changed to cm, the new sample mean is

The new sample standard deviation is

7. In the above question, what is the coefficient of variation of the pocket depths?

To answer this question, did you use mm or cm (or does it matter)?

28
© Dr. Herenia P. Lawrence
DEN 1015H Solutions to Review Exercises Session 1

1. Arch Length Arch Length Frequency Relative Frequency Cumulative


(mm) (Class) (%) Relative Freq.
(%)
25-25.9 1 1 0.9 0.9
26-26.9 2 2 1.8 2.7
27-27.9 3 8 7.3 10.0
28-28.9 4 14 12.7 22.7
29-29.9 5 19 17.3 40.0
30-30.9 6 22 20.0 60.0
31-31.9 7 18 16.4 76.4
32-32.9 8 15 13.6 90.0
33-33.9 9 7 6.4 96.4
34-34.9 10 3 2.7 99.1
35-35.9 11 1 0.9 100.0

2. Teeth Class interval Frequency Relative Frequency 5-teeth Adjusted


(%) Relative Freq. (%)
0-4 5 40 26.7 26.7
5-9 5 60 40.0 40.0
10-14 5 30 20.0 20.0
15-24 10 10 6.7 (6.7/10) 5=3.35
25-29 8 10 6.7 (6.7/8) 5=4.2
30-32 (6.7/8) 3=2.5

The best measure of the typical number of teeth present for these subjects is the median.

% Subjects
45
40
40 Positively skewed
35 distribution
0-4
30 26.7 5-9
25 10-14
20
20 15-19
20-24
15
25-29
10 30-32
3.35 3.35 4.2
5 2.5
0
# Teeth

29
© Dr. Herenia P. Lawrence
3. Mean = 3.76
Median = 3.00
Mode = 1 and 3
Standard Deviation = 2.71

4. a) Range = 8.5 mm
b) Mean = 3.625 mm
Variance = 6.415 mm2
Standard deviation = 2.533 mm
c)
Descriptives

Stat is tic Std. Error


POCKET Mean 3. 625 .731
95% Conf idence Lower Bound 2. 016
Interv al f or Mean Upper Bound
5. 234

5% Trimmed Mean 3. 500


Median 3. 500
Variance 6. 415
Std. Dev iation 2. 533
Minimum .5
Max imum 9. 0
Range 8. 5
Interquart ile Range 3. 500
Skewness .696 .637
Kurt osis .340 1. 232

10

-1
-2
N= 12

POCKET

d) The data are slightly positively skewed.


e) The expecting mother with 9.0 mm of maximum pocket depth appears to be an outlier.
However, the sample size is too small to make conclusions.
30
© Dr. Herenia P. Lawrence
f) The mean log maximum pocket depth = 0.9735 (you should have logarithmically
transformed all the data first). The geometric mean is the anti-log of this mean = 2.65
mm.

5. Mean=50

xi
Since x
n

xi
50
99

xi = 50 99 = 4950

Recalculated x 4950 + 50 = 50
100

Range=50
Number of cases=100
Minimum=20
Maximum=70

8. A sample of pocket depths has sample mean 4 mm and sample standard deviation 1 mm. If
the units of the measurements are changed to cm, the new sample mean is
0.4 cm

The new sample standard deviation is


0.1 cm

9. In the above question, what is the coefficient of variation of the pocket depths?
CV = s / x = 1 / 4 = 0.25

To answer this question, did you use mm or cm (or does it matter)?


It does not matter because CV is unitless.

31
© Dr. Herenia P. Lawrence

You might also like