Session 1 DEN1015H 2013 Lecture Notes
Session 1 DEN1015H 2013 Lecture Notes
INTRODUCTION
The field of Statistics is the subject that deals with the collection, analysis and
interpretation of numerical information. Statistics is applied to the planning and
analysis of research studies in virtually all subject areas. Biostatistics is a branch
of statistics that deals primarily with the health sciences and the biological
sciences.
Statistical and biostatistical methods are used for summarizing and organizing data
to allow efficient reporting and presentation of the results and also to provide
simple interpretations of the findings. This lecture gives many examples of the
graphical and tabular techniques of descriptive statistics that are used for
describing and summarizing data. The next lecture introduces concepts of
probability. Probability models describe the variability in data and also form the
logical basis of statistical inference, by which conclusions are drawn from data.
The remainder of the course will focus on several methods of statistical inference
that have important applications in biomedical research. Many of the examples
used to illustrate the concepts involve data from recent dental research.
DESCRIPTIVE STATISTICS
“Raw” data are observations derived from people, lab animals, lab specimens,
hospitals, etc.
The number of individuals (or animals, specimens, cells, hospitals, etc) is called
the sample size.
In a data set the variables are the columns and the observations are the rows (Fig.
1).
1
© Dr. Herenia P. Lawrence
Implant Subject Type Irradiation Graft Failure
1 1 1 1 1 1
2 1 2 1 1 0
3 2 1 1 1 1
4 2 2 1 1 1
5 3 1 1 1 1
6 3 2 1 1 1
7 4 1 1 0 0
8 4 2 1 0 0
9 5 1 1 0 1
10 5 2 1 0 1
11 6 1 1 0 1
12 6 2 1 0 1
13 7 1 0 1 1
14 7 2 0 1 0
15 8 1 0 1 0
16 8 2 0 1 0
17 8 2 0 1 0
18 9 1 0 1 1
19 9 2 0 1 1
20 10 1 0 0 0
21 10 2 0 0 0
22 11 1 0 0 0
23 11 2 0 0 0
24 12 1 0 0 0
25 12 2 0 0 0
‘Type’: 1 = maxilla; 2 = mandible. For ‘irradiation’, ‘graft’, and
‘failure’: 1 = yes;; 0 = no.
Fig. 1. Hypothetical data set from a study investigating the failure rate of 25
implants placed in grafted and non-grafted mandibular/maxillary bone in 6
irradiated and 6 non-irradiated subjects.
TYPES OF VARIABLES
The first step, before performing statistical tests, is to decide what type of variables
(or data) one is dealing with, since different statistical analyses are needed for each
type of variable. Nominal and ordinal variables consist of counts in categories and
are analyzed using “non-parametric” statistics. Interval and ratio variables consist
of actual quantitative measurements and are analyzed using “parametric” statistics.
2
© Dr. Herenia P. Lawrence
There are a number of typologies, but one that has proven useful is given in Table
1 and Fig. 2. The basic distinction is between quantitative data (for which one asks
“how much?”) and qualitative or categorical data (for which one asks “what
type?”). For a review, see suggested readings for today’s class.
In general, the amount of information increases as one goes from nominal to ratio
variables. Classifying interval measures into large categories is akin to throwing
away data.
3
© Dr. Herenia P. Lawrence
There are other ways of defining types of variables. In an experiment, the
independent variables are those that are varied by and under the control of the
experimenter; the dependent variables are those that respond to experimental
manipulation. For example, in a clinical trial to determine the effect of periodontal
therapies on attachment gain, the independent variable is the type of therapy and
the dependent variable is the gain in attachment measured in millimeters.
Dependent variables should be clinically important and related to the independent
variables.
Data Types
Qualitative Quantitative
Dependent Independent
(periapical (CHX irrigation
lesion) versus saline)
Categorical/Nominal Discrete
[sex (dichotomous), (# teeth)
marital status]
Continuous
Categorical/Ordinal
(stage of cancer, pain
rating, Likert scale) Interval Ratio
(36º-38º C, (age, pulse rate,
probing pocket vital capacity,
depth) VAS)
4
© Dr. Herenia P. Lawrence
FREQUENCIES, FREQUENCY DISTRIBUTIONS AND GRAPHS
The next step in the process of analyzing data is to describe the data by using a
frequency distribution to reflect the probability of the occurrence of an event. A
frequency distribution consists of a set of frequencies for all possibilities.
Frequency (or count) the number of people belonging to a category, e.g. number
of students who present moderate malocclusion = 61
Relative frequency the proportion of people belonging to a category (frequency
divided by the total sample size). It can be expressed as a
percentage, e.g. the percentage of pupils who have moderate
degree of malocclusion = 61/200 100 = 30.5%.
5
© Dr. Herenia P. Lawrence
Severe Malocclusion 39
Moderate Malocclusion 61
Normal Occlusion
50%
Moderate
Malocclusion
30.5%
6
© Dr. Herenia P. Lawrence
Frequency distributions for quantitative variables
Relative Cumulative
Frequency (no. Cumulative
DMF score Frequency Relative
of children) Frequency
(%) Frequency (%)
0 4 8 4 8
1 2 4 6 12
2 3 6 9 18
3 6 12 15 30
4 7 14 22 44
5 10 20 32 64
6 9 18 41 82
7 3 6 44 88
8 1 2 45 90
9 1 2 46 92
10 4 8 50 100
Total 50 100
Cumulative Frequency the number of people in the sample with values less than
or equal to a specified value, e.g. 32 of the children have
a DMF score of 5 or less.
Relative Frequency (%) the number of people in the sample taking each value
divided by the total number of people studied and
multiplied by 100, i.e., the proportion of the total sample
in each category. This proportion can be interpreted as
the probability that an individual chosen at random from
the original sample may fall in a particular category or
within a range of categories.
Cumulative Relative the percentage of people in the sample with values less
Frequency (%) than or equal to a specified value, e.g. 64% of the sample
have a DMF score of 5 or less.
7
© Dr. Herenia P. Lawrence
Histograms
The histogram is an appropriate method for depicting a frequency distribution for
discrete or continuous data. Values are grouped into intervals, generally of equal
size. These intervals are then represented by bars with (if intervals have equal
width) heights proportional to the frequency of observations contained within
them.
The following frequency distribution shows the age at onset of edentia in a sample
of 200 edentulous persons.
80
70 10-19 yr
70
20-29 yr
F 60 56
r 30-39 yr
e 50
q 43 40-49 yr
u 40
e 50-59 yr
n 30
c
18 60-69 yr
y 20
70-79 yr
10 8
4 1
0
Age (years)
8
© Dr. Herenia P. Lawrence
Example: Histogram with unequal class intervals
The following data are abstracted from a paper on the age at onset of edentia in
another sample of edentulous persons. The percentage distribution is shown.
25 11-15 yr
% 20 16-20 yr
20 21-25 yr
p 17.5
e
26-30 yr
o 14.5 14.5
r 31-35 yr
f 15
36-40 yr
5 10.5
a 41-45 yr
g 10
y 7 7 46-50 yr
e
e
51-55 yr
a 5 4.4
r 56-60 yr
s 1.25 1.25 1.25 1.25 61-65 yr
0.5
0 65-70 yr
Age (years)
71-75 yr
To standardize the data into equal intervals of 5 years, you will need to divide the
relative frequency by the width of the interval and then multiply it by 5 (see
example above*).
9
© Dr. Herenia P. Lawrence
Frequency Polygon
The frequency polygon is similar to the histogram. It is constructed by placing a
point at the center of each interval of the histogram such that the height of the point
is equal to the frequency or relative frequency associated with that interval.
30
15
0
1 to 10 11 to 20 21 to 30 31 to 40 41 to 50 51 to 60
Number of Hours
30
15
0
1 to 10 11 to 20 21 to 30 31 to 40 41 to 50 51 to 60
Number of Hours
10
© Dr. Herenia P. Lawrence
Stem-and-leaf plot
It resembles a histogram, with the first digit(s) of each datum along the “stem” and
the last digit(s) forming the “leaves”.
Stem Leaf
0 6 1 4 8
1 1 3 2 5 7 9 9
2 6 0 2
3 2
Stem Leaf
0 1 4 6 8
1 1 2 3 5 7 9 9
2 0 2 6
3 2
11
© Dr. Herenia P. Lawrence
Numerical data can be further summarized by measures that describe where the
center of the distribution lies – mean, median, and mode – and measures of how
wide the distribution is – range, percentile, and standard deviation.
MEDIAN the middle value of the distribution, i.e., the value for which 50% of
the sample have values less than or equal to the median and 50% have
values greater than or equal to the median (50th percentile). It is
calculated by rank ordering (from lowest to highest) the values and
then determining the value corresponding to the middle rank, i.e., the
rank order (n+1)/2. Thus, if the sample contains an odd number of
subjects, the median will be the value of the subject with the middle
rank. If the sample contains an even number of subjects, the median
value will fall half-way between the values of the two midmost
subjects.
MODE the most common single value, i.e., the peak of the frequency
distribution. A distribution with two or more modes is referred to as
bimodal, trimodal, etc.
Median = 7.5 teeth [the average of the (n/2)th and (n/2 + 1)th observations if n is
even].
12
© Dr. Herenia P. Lawrence
Mode = 6 teeth
N.B.: The mode is seldom used. If the sample is small, either it may not be
possible to estimate the mode (e.g. when all the values are different), or the
estimate obtained may be misleading.
1. If the sample is shifted by a constant c, i.e., c is added to all data values, then
the mean and median are also changed by this same amount.
2. If the sample is rescaled by a constant c, i.e., all data values are multiplied
by c, then the mean and median are also multiplied by c.
Example: If the units of serum glucose are changed from mg/dl to g/l
(multiply by 10/1000=1/100), then the mean is changed from 110.1 mg/dl to
1.101 g/l.
2. In some cases, the mean must be used for interpretative reasons since it takes
into account each individual observation.
13
© Dr. Herenia P. Lawrence
Possible distributions of data values
Symmetric vs. asymmetric (skewed) distributions
Two asymmetric curves, one with positive skew (Curve A - skewed to the right)
and one with negative (Curve B - skewed to the left) skew.
Bimodal
14
© Dr. Herenia P. Lawrence
The mean, median, and mode in a symmetric distribution
15
© Dr. Herenia P. Lawrence
Data Transformations
Example: The following are the number of days spent in hospital by 17 subjects
following an operation:
0 10 20 30 40 50
a) Raw Data
0 1 2 3 4 5
b) Ln Data
Fig. 1. One-way scatter plots of length of hospital stay showing a) raw data and b)
data on a logarithmic scale.
16
© Dr. Herenia P. Lawrence
Thus, the geometric mean can be calculated by the following three steps:
3. the geometric mean is the anti-logarithm of the sample mean found in step 2.
antilog of ln x = eln x
The geometric mean is used only with data which are heavily positively skewed.
Examples of variables for which logarithms and geometric means are sometimes
useful include concentrations and bacterial counts.
17
© Dr. Herenia P. Lawrence
MEASURES OF DISPERSION or VARIATION, i.e., STATISTICS USED
TO DESCRIBE THE SPREAD OF A DISTRIBUTION
Range: the interval between the lowest and highest value in the distribution.
Percentile ranges
A percentile range is an interval between two specified percentile points, e.g. the
interquartile range includes those between the 25th (Q25) and 75th (Q75)
percentiles; the median is equivalent to the 50th (Q50) percentile point, since
(n+1)/2 = (n+1) 50/100. For a distribution with a large number of observations
the quartiles are most easily found from the cumulative relative frequency by
reading off the values that correspond to 25%, 50%, and 75% (see cumulative
distribution plot below illustrating the data on page 7).
120
100
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
DMF
18
© Dr. Herenia P. Lawrence
A box plot (or box and whiskers plot) displays the 1st and 3rd quartiles as a box with
the median at the centre. Lines are drawn from the box to the extreme
observations (although values lying too far from the box are sometimes identified
separately as outliers). Like the histogram, the box plot is useful for checking the
symmetry of a distribution (i.e., whether it has the same shape on either side of the
median).
Example: Using the data from the previous example, i.e., the number of days spent
in hospital by 17 subjects following an operation:
DAYS 0 10 20 30 40 50
BOX PLOT
Min Max
10
19
© Dr. Herenia P. Lawrence
Another important use of the box plot is for comparing distributions. For example,
the figure below displays box plots of changes in number of decayed, missing, or
filled surfaces (DMFS) in a clinical trial to compare caries-preventive effects of
various chewing gums. The plots are on the same scale, allowing easy comparison
between treatment groups.
Fig. Box plots of change in DMFS by treatment group in a chewing gum study.
20
© Dr. Herenia P. Lawrence
Sample Variance: the average square ‘distance’/difference of each observation
from the mean. By squaring the difference, all terms will be positive (see your
algebra notes).
2
2 xi x
s
n 1
The quantity n-1 is called the number of degrees of freedom of the variance. The
formula uses n-1 because the variance of the sample calculated in this way better
approximates the variance of its target population.
Sample Standard Deviation (abbreviated SD, s.d., or s): the square root of the
variance. The standard deviation will have the same unit of measurement as the
original data. The smaller the standard deviation, the less each score varies from
the mean. The larger the spread of scores, the larger the SD becomes.
Algebraically, the formula looks like this:
( xi x) 2
s.d. (or s) =
n 1
5, 6, 6, 7, 8, 8, 10, 10, 12
The range is 12 - 5 = 7
N.B.: Many calculators have built-in functions for the mean and standard
deviation. The keys are commonly labeled x and n 1 , respectively, where is
the lower case Greek letter sigma.
21
© Dr. Herenia P. Lawrence
Calculation of the Variance and Standard Deviation
22
© Dr. Herenia P. Lawrence
Properties of the standard deviation and inter-quartile range:
2. If the sample is rescaled by a constant c, i.e., all data values are multiplied
by c, then the standard deviation and inter-quartile range are also multiplied
by c.
Example: In mg/dl, serum glucose has a mean of 110.1 mg/dl and a standard
deviation of 30.0 mg/dl; in g/l, the mean is 1.101 g/l with a standard
deviation of 0.3 g/l.
3. The range (largest value minus smallest) of a set of data is not a good
indicator of spread because...
23
© Dr. Herenia P. Lawrence
GENERAL RULES FOR REPORTING SUMMARIES
Always report the number of observations (n) on which the summary is based.
For binary responses (e.g. A, B) report the percentage of As or Bs but not both.
24
© Dr. Herenia P. Lawrence
CALCULATING THE MEAN AND STANDARD DEVIATION FROM A
FREQUENCY DISTRIBUTION
The table below shows the distribution of the number of previous pregnancies of a
group of women aged 30-34 taking part in a study of the association between
periodontal disease in expecting women and reproductive outcomes. Eighteen of
the 100 women had no previous pregnancies, 27 had one, 31 had two, 19 had three,
and five had four previous pregnancies.
As, for example, adding 2 thirty-one times is equivalent to adding the product (2 ×
31), the total number of previous pregnancies is calculated by
x = (0 × 18) + (1 × 27) + (2 × 31) + (3 × 19) + (4 × 5)
= 0 + 27 + 62 + 57 + 20 = 166
If a variable has been grouped when constructing a frequency distribution, its mean
and standard deviation should be calculated using the original values, not the
frequency distribution. There are occasions, however, when only the frequency
distribution is available. In such a case, approximate values for the mean and
standard deviation can be calculated by using the values of the mid-points of the
groups and proceeding as above.
25
© Dr. Herenia P. Lawrence
Suggested readings for today’s class:
Norman GR, Streiner DL. Biostatistics. The bare essentials (2nd ed.). Hamilton,
ON: B.C. Decker Inc., 2000. Chapters 1, 2, and 3.
Weintraub JA, Douglass CW, Gillings DB. Biostats. Data analysis for dental
health care professionals (2nd ed.). Research Triangle Park, NC: CAVCO Inc.,
1985. Chapters 4 and 5.
Kim JS, Dailey RJ. Biostatistics for oral healthcare (1st ed.). Ames, IA: Blackwell
Pub. Professional, 2008. Chapters 2 and 3.
26
© Dr. Herenia P. Lawrence
DEN 1015H Review Exercises Session 1
1. The data below show the arch lengths of the maxilla of 110 boys aged 6 years. Arch length
is defined as the perpendicular distance between lines tangent to the labial surfaces of the
central incisors and the distal portions of the second primary molars or their permanent
successors. The measurements are in mm.
25.5 31.0 30.0 33.4 30.6 32.0 32.6 30.3 31.6 30.7
31.1 28.0 32.7 32.1 30.1 30.5 29.4 27.6 34.7 29.5
30.6 32.8 32.2 30.2 33.2 30.4 31.0 29.6 31.3 28.9
31.2 31.8 26.4 32.7 30.5 32.3 28.0 31.9 31.2 33.4
27.5 32.9 30.9 31.7 27.3 34.3 28.1 33.6 27.7 30.8
31.3 30.8 32.8 28.3 31.6 28.2 32.4 27.8 33.5 28.9
30.7 32.9 27.4 31.5 29.1 29.7 29.3 30.0 32.5 29.4
29.1 30.5 29.3 30.4 29.2 31.5 26.6 30.2 28.8 29.7
27.2 29.0 33.5 28.4 29.9 30.3 29.6 31.5 28.7 31.4
29.2 28.5 31.4 34.6 30.9 35.5 28.6 29.5 29.5 29.8
28.2 30.5 29.8 31.3 32.7 30.5 33.4 28.2 27.6 32.5
2. In a caries prevention trial in the elderly, the 150 subjects had the following numbers of teeth
present:
40 subjects had fewer than 5 teeth
60 subjects had 5– teeth
30 subjects had 10– teeth
10 subjects had 15– teeth
10 subjects had more than 25 teeth
Draw a histogram to illustrate the data using 5 teeth as the unit class interval. Describe the
shape of the distribution.
The best measure of the typical number of teeth present for these subjects is
3. The following data represent the number of CNS symptoms reported by 25 dentists who used
a squeeze cloth technique for mercury-rich amalgam mixtures:
1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 8, 9, 9, 10.
Calculate the mean, median and mode as well as the standard deviation for the number of
CNS symptoms reported.
27
© Dr. Herenia P. Lawrence
4. The following figures are the maximum pocket depths recorded in 12 women referred for
dental treatment during pregnancy. The depths are in mm.
1.5 1.5 3.0 5.0 9.0 0.5 0.5 5.0 4.0 4.0 6.5 3.0
6. A sample of pocket depths has sample mean 4 mm and sample standard deviation 1 mm. If
the units of the measurements are changed to cm, the new sample mean is
7. In the above question, what is the coefficient of variation of the pocket depths?
28
© Dr. Herenia P. Lawrence
DEN 1015H Solutions to Review Exercises Session 1
The best measure of the typical number of teeth present for these subjects is the median.
% Subjects
45
40
40 Positively skewed
35 distribution
0-4
30 26.7 5-9
25 10-14
20
20 15-19
20-24
15
25-29
10 30-32
3.35 3.35 4.2
5 2.5
0
# Teeth
29
© Dr. Herenia P. Lawrence
3. Mean = 3.76
Median = 3.00
Mode = 1 and 3
Standard Deviation = 2.71
4. a) Range = 8.5 mm
b) Mean = 3.625 mm
Variance = 6.415 mm2
Standard deviation = 2.533 mm
c)
Descriptives
10
-1
-2
N= 12
5. Mean=50
xi
Since x
n
xi
50
99
xi = 50 99 = 4950
Recalculated x 4950 + 50 = 50
100
Range=50
Number of cases=100
Minimum=20
Maximum=70
8. A sample of pocket depths has sample mean 4 mm and sample standard deviation 1 mm. If
the units of the measurements are changed to cm, the new sample mean is
0.4 cm
9. In the above question, what is the coefficient of variation of the pocket depths?
CV = s / x = 1 / 4 = 0.25
31
© Dr. Herenia P. Lawrence