Biostatistics
By
Hagazi Gebre (MSc.), Assistant Professor
of Biostatistics
Department of Biostatistics
Mekelle University
Descriptive statistics
Descriptive statistics
Before performing any analyses, you must first get
to know your data
Descriptive statistics are used to summarize data in
the form of tables, graphs and numerical measures
The summary technique used depends on the data
type under consideration
5 Apr 2025 Letekirstos 3
Frequency Distributions
• After collecting data, the first task for a
researcher is to organize and simplify the data so
that it is possible to get a general overview of the
results.
• This is the goal of descriptive statistical
techniques.
• One method for simplifying and organizing data
is to construct a frequency distribution.
5 Apr 2025 Letekirstos 4
Frequency Distributions…
Types:
• Absolute/frequency/number
• Relative/percent
• Cumulative
– Absolute
– Relative
5 Apr 2025 Letekirstos 5
Frequency Distribution for Discrete Random Variables
Cumula Cumul Fre
Example: tive
Relativ
e ative que
No.
of
Relativ Frequ ncy
• Suppose that we take a sample of e
Freque
ency
deca
ncy yed
size 16 from children in a primary Freque
teeth
ncy
school and get the following data
about the number of their decayed
0.0625 0.0625 1 1 0
teeth, 0.1875 0.125 3 2 1
• 3,5,2,4,0,1,3,5,2,3,2,3,3,2,4,1 0.25 7 4 2
0.3125 12 5 3
• To construct a frequency table: 0.125 16 14 2 4
1- Order the values from the smallest 0.4375 0.125 2 5
0.75
to the largest. 0.875
• 0,1,1,2,2,2,2,3,3,3,3,3,4,4,5,5 1
2- Count how many numbers are the
same.
16 Tota
l
School of Public Health
5 Apr 2025 Letekirstos 7
Statistical data summarization techniques
5 Apr 2025 Letekirstos 8
5 Apr 2025 Letekirstos 9
Cont…
• discrete or continuous data
• do NOT use for ordinal or nominal data,
Advantage:
– Simplicity : easy to calculate and understand
– It always exists (can be calculated for any set of numerical data)
– It is unique. For a given set of numerical data there is one and only
one mean.
– It is makes use of all values in the data.
– It is stable (means of different samples of the same population do not
flactuate considerably)
Disadvantage:
– extremely sensitive to unusual values (outliers)
– It cannot be calculated for data, which are not quantifiable.
– It cannot be calculated for continuous grouped data with open end
classes
5 Apr 2025 Letekirstos 10
5 Apr 2025 Letekirstos 11
Conti…
It is the number separating the higher half of a sample, a population, or a population from
the lower half.
Advantage:
It always exists
Robust or insensitive to outliers (extreme values)
It is unique (there is only one median for a given set of data)
It can be computed for a distribution with open class limit.
It can be computed for ratio, interval and ordinal data.
Disadvantage:
It doesn’t take each and every value into consideration.
Arrangement of the data in order
Statistical properties are not as easy to derive
No algebraic manipulation e.g it is not possible to calculate combined median of two or
more groups.
5 Apr 2025 Letekirstos 12
5 Apr 2025 Letekirstos 13
Exercise 1
Suppose the following data show the maximal static inspiratory
pressure (PI max in cmH2O) of patients with cystic fibrosis
admitted in a certain hospital during one month duration.
80 100 85 110 75
85 45 70 125 110
110 95 80 75 150
95 130 100 100 75
90 75 120 40 95
Compute the arithmetic mean, median, and mode.
Exercise 1 with answer.docx
5 Apr 2025 Letekirstos 14
5 Apr 2025 Letekirstos 15
5 Apr 2025 Letekirstos 16
5 Apr 2025 Letekirstos 17
5 Apr 2025 Letekirstos 18
5 Apr 2025 Letekirstos 19
Variance
average of squares of deviation from the mean
• Variance of a sample: usually subtract 1 from n in the
denominator
n
( Xi X ) 2
effective sample
s2= i 1
size, also called
n 1
degree of freedom
School of Public Health
Standard deviation
• Problem with variance: its unit of measurement as
value are squared
– Solution: taking square root of variance
=> standard deviation
• Sample standard deviation ( s)
( X
i 1
i X ) 2
n 1
School of Public Health
What do You Think?
7 8
7 7 3 2
7 77
7 77 7 8 13
6
7 9
Mean = 7
SD=0.63
Mean = 7 Mean = 7
SD=0 SD=4.04
School of Public Health
cs
5 Apr 2025 Letekirstos 23
Coefficient of variation…
• Caution must be exercised when using standard
deviation as a comparative index of dispersion
Weights of newborn Weights of newborn
elephants (kg) mice (kg)
929 853 0.72 0.42
878 939 0.63 0.31
895 972 0.59 0.38
937 841 0.79 0.96
801 826 1.06 0.89 So, What we
do??
n=10 X =887.1 n=10 X = 0.68
sd = 56.50 sd = 0.255
Incorrect to say that elephants show greater variation for
birth-weights than mice because of higher standard deviation
School of Public Health
Coefficient of variation…
• Coefficient of variance expresses standard deviation
relative to its mean s
Weights of newborn Weights of newborn cv
elephants (kg) mice (kg) X
929 853 0.72 0.42
878 939 0.63 0.31
895 972 0.59 0.38 Mice show
937 841 0.79 0.96 greater birth-
801 826 1.06 0.89 weight
variation
n=10, X = 0.68
n=10, X = 887.1
s = 0.255 cv = 0.375
s = 56.50 cv = 0.0637
School of Public Health
EXERCISE 2
Consider the following two samples that represent
cholesterol measurements (mg/100ml), each on the
same person, but using different measurement
techniques.
Method Measurements Mean
Auto-analyzer(AA) 177 193 195 209 226 200
Micro-enzymatic(ME) 192 197 202 209 200 200
5 Apr 2025 Letekirstos 26
EXERCISE 2…
1. Compute the range and standard deviations for both
methods.
2. Compute the 5th and 95th percentiles for the PI max data.
3. Compute the CV for the age and weight of two groups of
students.
Variable Mean Standard
deviation
Age 20.63 years 3.15 years
Weight 58.89 kg 8.10 kg
Exercise 2 with answer.docx
5 Apr 2025 Letekirstos 27
Individual Assignment
COMPUTATION OF SUMMARY VALUES FOR
DIFFERENT TYPES OF DISTRIBUTION
• Simple frequency distribution
• Grouped frequency distribution
– Mean
– Median
– Percentiles
– Standard deviation
Summary stat_simple
and grouped frequency.docx
School of Public Health
Individual Assignment…
1. Consider the data on number of children per
family and compute the mean, median and
standard deviation
No. of children (x) 0 1 2 3 4 5 6
No. of families (f) 7 35 67 43 32 10 3 197(n)
School of Public Health
Individual Assignment…
2. Consider the following data on serum uric acid
(SUA) in mg per 100ml of 267 healthy male
subjects
SUA(x) f
3.0-3.4 2
3.5-3.9 15
4.0-4.4 33
4.5-4.9 40
5.0-5.4 54
5.5-5.9 47
6.0-6.4 38
6.5-6.9 16
7.0-7.4 15
7.5-7.9 3
8.0-8.4 1
8.5-8.9 3
Total 267
School of Public Health
School of Public Health
Extreme Values
Symmetric: Mean = Median
Skewed right: Mean > Median
Skewed left: Mean < Median
5 Apr 2025 Letekirstos 32
Diagrammatic Presentation of
data
Class exercise
• Identify diagrams and corresponding data
types (submit in piece of paper)
School of Public Health
School of Public Health
Diagrams…
• Box-and-Whisker plot
– Special use: exploratory data analysis
– Used to see relationship between numerical and
categorical variable
• Line graph
–Mainly for time series data
• Scatter plot
–To see the relationship between two numerical
variables
School of Public Health
Bar chart:
• A bar chart provides a picture of data that could also be
reasonably displayed in tabular format.
• Bar charts can be created for nominal, ordinal, or continuous
data, although they are most frequently used with nominal data.
• If used with continuous data, the chart could be called a
histogram instead of a bar chart.
• The bar chart can show the number or proportion of people
(observations) by levels of a nominal or ordinal variable.
5 Apr 2025 Hagazi 36
Bar chart:
One separate bar for each category
Height of each bar is the frequency of that category
The widths of the bar are the same for all the levels of
the nominal or ordinal variable, and the width has no
meaning.
The levels of the nominal or ordinal variable are usually
separated by several spaces that make it easier to view
the data.
The bars are usually presented vertically, although they
could also be presented horizontally.
5 Apr 2025 Hagazi 37
Representing the simple frequency table
using the bar chart
We can represent
the above simple 6
frequency table
using the bar 5
5
chart. 4
4
Ordinal or nominal 3
data
2
Height of each bar 2 2 2
is the frequency of
Frequency
1
1
that category 0
.00 1.00 2.00 3.00 4.00 5.00
Number of decayed teeth
5 Apr 2025 Hagazi 38
Pie Chart
instead of “stacks” rising up from
horizontal (bar chart), we could plot instead
the shares of a pie
Recalling that a circle has 360 degree
50% means 180 degrees
25% means 90 degrees
5 Apr 2025 Hagazi 39
Histogram
• Discrete or continuous data divided into
categories
• Graphical representation of frequency distribution
• Height of each bar is the frequency of that
category
• Assess skewness and modality of the data
5 Apr 2025 Hagazi 40
Cont…
A histogram is similar to a bar chart but is used with
interval/ratio variables.
The values are grouped into intervals (often called bins
or classes) that are usually of equal width.
Rectangles are drawn above each interval, and the
height of rectangle represents the number of
observations in that interval.
If all the intervals are of equal width, then the height of
the interval, as well as its area, represents the frequency
of the interval.
In contrast to bar charts, there are no spaces between the
rectangles unless there are no observations in some
interval.
5 Apr 2025 Hagazi 41
5 Apr 2025 Letekirstos 42
Stem and Leaf Plot
• A Stem and Leaf Plot is a special table where
each data value is split into a "stem" (the first
digit or digits) and a "leaf" (usually the last
digit).
• Example:
• 32" is split into "3" (stem) and "2" (leaf).
5 Apr 2025 Letekirstos 43
Box plots
• discrete or continuous data
• displays the 25th, 50th and 75th percentiles of
the data also known as the first, second and
third quartiles respectively
• whiskers extend to adjacent values which are
not outliers
• outliers indicated as circles
• box shows the interquartile range of the data
• can be used to assess skewness
5 Apr 2025 Hagazi 44
School of Public Health
School of Public Health
‘?
School of Public Health
Line graph
• a line graph can be used to show the value of a
variable over time.
• the values of the variable are given on the vertical
axis, and the horizontal is the time variable.
• two continuous measures
• each x value has only one corresponding y value
• useful for looking at patterns over time
• can be used to compare 2 or more groups
5 Apr 2025 Hagazi 48
Line Graph
Year MMR
MMR/1000 1960 50
60
50 1970 45
40
30 1980 26
20
10
1990 15
0
Year 2000 12
1960 1970 1980 1990 2000
Figure (1): Maternal mortality rate of (country), 1960-2000
5 Apr 2025 Hagazi 49
Conti…
5 Apr 2025 Hagazi 50
Two-way scatter plots
used to assess the relationship between two discrete or
continuous measures .
The scatter plot pictorially represents the relation between two
continuous variables.
nature of the relationship described as positive, negative or no
relationship
In a scatter plot, a plotted point represents the values of two
variables for an individual.
(2, 5)
y
y=5
x
5 Apr 2025 xLetekirstos
=2 51
jm/
School of Public Health
SPSS Practice
School of Public Health