Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
38 views53 pages

Lecture 2 - Descriptive Statistics

The document provides an overview of descriptive statistics, including frequency distributions, measures of central tendency (mean, median, mode), and measures of variability (variance, standard deviation). It discusses the advantages and disadvantages of these statistical measures and presents various methods for data visualization, such as bar charts, histograms, and box plots. Additionally, it includes exercises for practical application of the concepts discussed.

Uploaded by

terefe degaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views53 pages

Lecture 2 - Descriptive Statistics

The document provides an overview of descriptive statistics, including frequency distributions, measures of central tendency (mean, median, mode), and measures of variability (variance, standard deviation). It discusses the advantages and disadvantages of these statistical measures and presents various methods for data visualization, such as bar charts, histograms, and box plots. Additionally, it includes exercises for practical application of the concepts discussed.

Uploaded by

terefe degaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Biostatistics

By
Hagazi Gebre (MSc.), Assistant Professor
of Biostatistics
Department of Biostatistics
Mekelle University
Descriptive statistics
Descriptive statistics
Before performing any analyses, you must first get
to know your data

Descriptive statistics are used to summarize data in


the form of tables, graphs and numerical measures

The summary technique used depends on the data


type under consideration

5 Apr 2025 Letekirstos 3


Frequency Distributions
• After collecting data, the first task for a
researcher is to organize and simplify the data so
that it is possible to get a general overview of the
results.

• This is the goal of descriptive statistical


techniques.

• One method for simplifying and organizing data


is to construct a frequency distribution.

5 Apr 2025 Letekirstos 4


Frequency Distributions…
Types:
• Absolute/frequency/number
• Relative/percent
• Cumulative
– Absolute
– Relative

5 Apr 2025 Letekirstos 5


Frequency Distribution for Discrete Random Variables
Cumula Cumul Fre
Example: tive
Relativ
e ative que
No.
of
Relativ Frequ ncy
• Suppose that we take a sample of e
Freque
ency
deca
ncy yed
size 16 from children in a primary Freque
teeth
ncy
school and get the following data
about the number of their decayed
0.0625 0.0625 1 1 0
teeth, 0.1875 0.125 3 2 1
• 3,5,2,4,0,1,3,5,2,3,2,3,3,2,4,1 0.25 7 4 2
0.3125 12 5 3
• To construct a frequency table: 0.125 16 14 2 4
1- Order the values from the smallest 0.4375 0.125 2 5
0.75
to the largest. 0.875
• 0,1,1,2,2,2,2,3,3,3,3,3,4,4,5,5 1
2- Count how many numbers are the
same.
16 Tota
l

School of Public Health


5 Apr 2025 Letekirstos 7
Statistical data summarization techniques

5 Apr 2025 Letekirstos 8


5 Apr 2025 Letekirstos 9
Cont…
• discrete or continuous data
• do NOT use for ordinal or nominal data,
Advantage:
– Simplicity : easy to calculate and understand
– It always exists (can be calculated for any set of numerical data)
– It is unique. For a given set of numerical data there is one and only
one mean.
– It is makes use of all values in the data.
– It is stable (means of different samples of the same population do not
flactuate considerably)
Disadvantage:
– extremely sensitive to unusual values (outliers)
– It cannot be calculated for data, which are not quantifiable.
– It cannot be calculated for continuous grouped data with open end
classes
5 Apr 2025 Letekirstos 10
5 Apr 2025 Letekirstos 11
Conti…
It is the number separating the higher half of a sample, a population, or a population from
the lower half.

Advantage:
 It always exists
 Robust or insensitive to outliers (extreme values)
 It is unique (there is only one median for a given set of data)
 It can be computed for a distribution with open class limit.
 It can be computed for ratio, interval and ordinal data.
Disadvantage:
 It doesn’t take each and every value into consideration.
 Arrangement of the data in order
 Statistical properties are not as easy to derive
 No algebraic manipulation e.g it is not possible to calculate combined median of two or
more groups.

5 Apr 2025 Letekirstos 12


5 Apr 2025 Letekirstos 13
Exercise 1
Suppose the following data show the maximal static inspiratory
pressure (PI max in cmH2O) of patients with cystic fibrosis
admitted in a certain hospital during one month duration.

80 100 85 110 75
85 45 70 125 110
110 95 80 75 150
95 130 100 100 75
90 75 120 40 95
Compute the arithmetic mean, median, and mode.
Exercise 1 with answer.docx
5 Apr 2025 Letekirstos 14
5 Apr 2025 Letekirstos 15
5 Apr 2025 Letekirstos 16
5 Apr 2025 Letekirstos 17
5 Apr 2025 Letekirstos 18
5 Apr 2025 Letekirstos 19
Variance
average of squares of deviation from the mean

• Variance of a sample: usually subtract 1 from n in the


denominator
n

 ( Xi  X ) 2
effective sample
s2= i 1
size, also called
n 1
degree of freedom

School of Public Health


Standard deviation
• Problem with variance: its unit of measurement as
value are squared
– Solution: taking square root of variance
=> standard deviation
• Sample standard deviation ( s)

 ( X
i 1
i  X ) 2

n 1

School of Public Health


What do You Think?

7 8
7 7 3 2
7 77
7 77 7 8 13
6
7 9

Mean = 7
SD=0.63
Mean = 7 Mean = 7
SD=0 SD=4.04

School of Public Health


cs

5 Apr 2025 Letekirstos 23


Coefficient of variation…
• Caution must be exercised when using standard
deviation as a comparative index of dispersion
Weights of newborn Weights of newborn
elephants (kg) mice (kg)
929 853 0.72 0.42
878 939 0.63 0.31
895 972 0.59 0.38
937 841 0.79 0.96
801 826 1.06 0.89 So, What we
do??
n=10 X =887.1 n=10 X = 0.68
sd = 56.50 sd = 0.255
Incorrect to say that elephants show greater variation for
birth-weights than mice because of higher standard deviation
School of Public Health
Coefficient of variation…
• Coefficient of variance expresses standard deviation
relative to its mean s
Weights of newborn Weights of newborn cv 
elephants (kg) mice (kg) X

929 853 0.72 0.42


878 939 0.63 0.31
895 972 0.59 0.38 Mice show
937 841 0.79 0.96 greater birth-
801 826 1.06 0.89 weight
variation
n=10, X = 0.68
n=10, X = 887.1
s = 0.255 cv = 0.375
s = 56.50 cv = 0.0637
School of Public Health
EXERCISE 2
Consider the following two samples that represent
cholesterol measurements (mg/100ml), each on the
same person, but using different measurement
techniques.
Method Measurements Mean

Auto-analyzer(AA) 177 193 195 209 226 200

Micro-enzymatic(ME) 192 197 202 209 200 200

5 Apr 2025 Letekirstos 26


EXERCISE 2…
1. Compute the range and standard deviations for both
methods.
2. Compute the 5th and 95th percentiles for the PI max data.
3. Compute the CV for the age and weight of two groups of
students.
Variable Mean Standard
deviation
Age 20.63 years 3.15 years

Weight 58.89 kg 8.10 kg

Exercise 2 with answer.docx


5 Apr 2025 Letekirstos 27
Individual Assignment
COMPUTATION OF SUMMARY VALUES FOR
DIFFERENT TYPES OF DISTRIBUTION
• Simple frequency distribution
• Grouped frequency distribution
– Mean
– Median
– Percentiles
– Standard deviation

Summary stat_simple
and grouped frequency.docx
School of Public Health
Individual Assignment…
1. Consider the data on number of children per
family and compute the mean, median and
standard deviation
No. of children (x) 0 1 2 3 4 5 6

No. of families (f) 7 35 67 43 32 10 3 197(n)

School of Public Health


Individual Assignment…
2. Consider the following data on serum uric acid
(SUA) in mg per 100ml of 267 healthy male
subjects
SUA(x) f
3.0-3.4 2
3.5-3.9 15
4.0-4.4 33
4.5-4.9 40
5.0-5.4 54
5.5-5.9 47
6.0-6.4 38
6.5-6.9 16
7.0-7.4 15
7.5-7.9 3
8.0-8.4 1
8.5-8.9 3
Total 267

School of Public Health


School of Public Health
Extreme Values

Symmetric: Mean = Median

Skewed right: Mean > Median

Skewed left: Mean < Median

5 Apr 2025 Letekirstos 32


Diagrammatic Presentation of
data
Class exercise

• Identify diagrams and corresponding data


types (submit in piece of paper)

School of Public Health


School of Public Health
Diagrams…
• Box-and-Whisker plot
– Special use: exploratory data analysis
– Used to see relationship between numerical and
categorical variable
• Line graph
–Mainly for time series data
• Scatter plot
–To see the relationship between two numerical
variables

School of Public Health


Bar chart:

• A bar chart provides a picture of data that could also be


reasonably displayed in tabular format.

• Bar charts can be created for nominal, ordinal, or continuous


data, although they are most frequently used with nominal data.

• If used with continuous data, the chart could be called a


histogram instead of a bar chart.

• The bar chart can show the number or proportion of people


(observations) by levels of a nominal or ordinal variable.

5 Apr 2025 Hagazi 36


Bar chart:
 One separate bar for each category
 Height of each bar is the frequency of that category
 The widths of the bar are the same for all the levels of
the nominal or ordinal variable, and the width has no
meaning.
 The levels of the nominal or ordinal variable are usually
separated by several spaces that make it easier to view
the data.
 The bars are usually presented vertically, although they
could also be presented horizontally.
5 Apr 2025 Hagazi 37
Representing the simple frequency table
using the bar chart

We can represent
the above simple 6

frequency table
using the bar 5
5

chart. 4
4

Ordinal or nominal 3

data
2

Height of each bar 2 2 2

is the frequency of
Frequency

1
1

that category 0
.00 1.00 2.00 3.00 4.00 5.00

Number of decayed teeth

5 Apr 2025 Hagazi 38


Pie Chart
instead of “stacks” rising up from
horizontal (bar chart), we could plot instead
the shares of a pie

Recalling that a circle has 360 degree


50% means 180 degrees
25% means 90 degrees

5 Apr 2025 Hagazi 39


Histogram
• Discrete or continuous data divided into
categories

• Graphical representation of frequency distribution

• Height of each bar is the frequency of that


category

• Assess skewness and modality of the data

5 Apr 2025 Hagazi 40


Cont…
 A histogram is similar to a bar chart but is used with
interval/ratio variables.
 The values are grouped into intervals (often called bins
or classes) that are usually of equal width.
 Rectangles are drawn above each interval, and the
height of rectangle represents the number of
observations in that interval.
 If all the intervals are of equal width, then the height of
the interval, as well as its area, represents the frequency
of the interval.
 In contrast to bar charts, there are no spaces between the
rectangles unless there are no observations in some
interval.
5 Apr 2025 Hagazi 41
5 Apr 2025 Letekirstos 42
Stem and Leaf Plot
• A Stem and Leaf Plot is a special table where
each data value is split into a "stem" (the first
digit or digits) and a "leaf" (usually the last
digit).
• Example:
• 32" is split into "3" (stem) and "2" (leaf).

5 Apr 2025 Letekirstos 43


Box plots
• discrete or continuous data
• displays the 25th, 50th and 75th percentiles of
the data also known as the first, second and
third quartiles respectively
• whiskers extend to adjacent values which are
not outliers
• outliers indicated as circles
• box shows the interquartile range of the data
• can be used to assess skewness

5 Apr 2025 Hagazi 44


School of Public Health
School of Public Health
‘?

School of Public Health


Line graph
• a line graph can be used to show the value of a
variable over time.
• the values of the variable are given on the vertical
axis, and the horizontal is the time variable.
• two continuous measures
• each x value has only one corresponding y value
• useful for looking at patterns over time
• can be used to compare 2 or more groups

5 Apr 2025 Hagazi 48


Line Graph
Year MMR
MMR/1000 1960 50
60
50 1970 45
40
30 1980 26
20
10
1990 15
0
Year 2000 12
1960 1970 1980 1990 2000

Figure (1): Maternal mortality rate of (country), 1960-2000

5 Apr 2025 Hagazi 49


Conti…

5 Apr 2025 Hagazi 50


Two-way scatter plots
 used to assess the relationship between two discrete or
continuous measures .
The scatter plot pictorially represents the relation between two
continuous variables.
nature of the relationship described as positive, negative or no
relationship
In a scatter plot, a plotted point represents the values of two
variables for an individual.

(2, 5)
y

y=5

x
5 Apr 2025 xLetekirstos
=2 51
jm/

School of Public Health


SPSS Practice

School of Public Health

You might also like