BASIC CONCEPTS IN
STATISTICS
Dr. Omosivie Maduka
Consultant Public Health Physician
Senior Lecturer,
Department of Preventive and Social Medicine
University of Port Harcourt
BASIC CONCEPTS
What is Statistics?
• Statistics is the scientific method of collecting,
organizing, summarizing, analyzing, interpreting
and presenting data.
• Research is therefore the process of arriving at
dependable solutions to problems through
statistics”.
• The branch of statistics that deals primarily with
the biological sciences and medical/health-
related disciplines is the Biostatistics or Medical
Statistics or Biometrics.
DATA AND VARIABLE
Data is a Latin word and plural form of datum, but since statistics is
about groups of person or objects, the word data predominates.
Data is information with unit of analysis, variable and value.
Unit of Analysis is the subject of interest
Variable is the characteristic of the subject of interest, and has
the tendency to vary or change or fluctuate
Value is the reading, recording or measurement obtained
from a data set.
Most times data and variable are used interchangeably, but
variable is however a component of data that has the ability
to take on different values
Classification of Data/Variable
1. Quantitative (Numerical)
• Continuous (Infinite)
• Discrete (Finite)
2. Qualitative (Categorical)
• Dichotomous (Binary)
• Nominal
• Ordinal
POPULATIONS AND SAMPLES
Population
• Population is a set of persons (or objects) having a common
observable characteristic (popular population).
• Population can also be referred to as the observable
characteristics of persons or things (statistical population).
• In statistics two types of populations can be distinguished
based of the size populations (infinite and finite).
• Infinite populations can be thought of as large populations
while finite populations are those that are smaller.
• The distinction is arbitrary, although some researchers
regards populations that are 10,000 or more as large
populations, while those that are less than 10,000 are
referred to as small populations.
POPULATIONS AND SAMPLES
Sample A sample is a subset of a population.
The distinction between population and sample
is crucial to understanding of research.
This is because more often than not, the researcher is
not able to carry out observation on all the units
constituting a population for cost and logistic reasons.
He/she can still conduct the research by observing a
subset of the population by taking a representative
sample after which an extrapolation is made from the
results gotten from the sample to the population.
PARAMETER AND STATISTIC
Parameter
• A parameter is defined as any summarization of
the elements of a population
Statistic
• Statistic is any summarization of the elements of
a sample
• The distinction between parameters and statistics
is so fundamental to statistical thinking that two
different conventions are commonly employed
for their representation.
• The most popularly use alphabets as shown
below.
Symbols for representing parameter and
statistic
Summary of characteristic Parameter Statistic
Mean μ X
Standard deviation σ S
Variance σ2 S2
Proportion ᴧ p
SCALES OF MEASUREMENT IN
STATISTICS
• The scales of measurement were first described
by Stanley S. Stevens in his book entitled “On the
theory of scales of measurement” in 1946.
• According to Stevens the measurement process
can be conceived of as existing on four different
levels which he referred to as the nominal,
ordinal, interval, and ratio scales.
(Mnemonics: “NOIR”)
SCALES OF MEASUREMENT IN STATISTICS
Nominal Scale
• This assigns variables into categories without ranking
• Examples of variables measured: sex (male/female), treatment type
(surgery/chemotherapy), blood group (A, B, AB, O).
Ordinal Scale
• This assigns variables into categories with ranking, but no attribute of how much more
and how much less
• Examples of variables measured: severity of disease (mild/moderate/severe), BMI
(underweight, normal weight, overweight, obese).
SCALES OF MEASUREMENT IN STATISTICS
Interval Scale
• This assigns variables into categories with ranking and with the attribute of how much
more and how much less
• Examples of variables measured: temperature measured in Centigrade or Fahrenheit
Ratio Scale
• This assigns variables into categories with ranking and with the attribute of how much
more and how much less, and with true zero origin
• Examples of variables measured: weight, height blood sugar, temperature measured in
thermodynamic or Kelvin scale.
Descriptive and Inferential
Statistics
Dr. Omosivie Maduka
Consultant Public Health Physician
Senior Lecturer,
Department of Preventive and Social Medicine
University of Port Harcourt
Descriptive and Inferential
Statistics
• Descriptive statistics is made up of various techniques used
to summarize the information contained in a set of data.
• Thus, descriptive statistics, as the name implies deals with
description of data.
• Inferential statistics is made up of various techniques used
to provide information about parameter values based on
observations made on the values of statistics.
Population
(Parameter)
μ=?
Inferential Fig 1: The relationship between population and sample,
parameter and statistic, and inferential and descriptive
statistics
statistics
= 73kg
(Statistic)
Sample Descriptive statistics
The relationship between descriptive and inferential statistics
Descriptive
Probability Inferential
statistics
statistics
Reminder: There are two
types of data
Each type of data behaves differently with implications for
distribution and summarization
• Qualitative data
• Quantitative data
The Behavioural Characteristics of Data
Sets that encourages summarization
• Values follow some form of distribution and this
can be presented numerically, in tabular form and
in graphical form.
• Values tend to cluster around a central point
• Values also exhibit variability from each other and
from the central point
Understanding the concept of
distribution
A distribution is the pattern observed in a
collection of values for a variable.
• Frequency
• Relative frequency
• Cumulative frequency
• Cumulative relative frequency
Parity Frequency Relative Cumulative Cumulative
Frequency (%) Frequency Relative
Frequency (%)
0 15 30 15 30
1 11 22 26 52
2 8 16 34 68
3 6 12 40 80
4 5 10 45 90
≥5 5 10 50 100
Total 50 100
Distribution of parity among women with ovarian cancer
Displaying frequency
distributions
1. Graphical form
2. Tabular form
Comparing Qualitative and Quantitative
Data for Distributions
Frequency Distribution of Qualitative Frequency Distribution of Quantitative
Data Data
Graphically Graphically
- Frequency polygon
- Bar chart
- Histogram
- Pie chart Tabular
Tabular - Frequency distribution
- Frequency distribution - Relative Frequency distribution
- Cumulative frequency distribution
- Relative Frequency distribution
- Cumulative relative frequency
distribution
Bar chart
Pie chart
Distrubution of males and females
100, 40%
Male
Female
150, 60%
Histogram
Summarizing quantitative data
• Measures of central tendency
• Measures of dispersion (variability)
• Measures of relative position: quantiles
• Measures of distribution shape: skewness & kurtosis
Measures of
Central Tendency
Summarizing qualitative data
Numerator related to
denominator
Is numerator
included in
denominator?
No Yes
Is time included
in denominator?
No Yes
Measure: Ratio Proportion Rate
Example
: Maternal mortality ratio Prevalence Incidence rate
Summarizing Quantitative Data
• Mean: Many Entries, Average Number
• Median – number in the middle
• Mode – most frequently occurring
Mean
Advantages of mean Disadvantages of mean
• Simple to calculate • Takes longer time to be calculated
• Used for further statistical • Not easily understood by non-
calculations statisticians
• All the values in the data are taken • Does not always represent actual
into consideration, therefore more scores belonging to some members
representative than mode and of the population
median
• Relatively reliable because it does not • When used with discrete variables,
vary much when repeated samples it often yields unrealistic values
are taken from the same population. • Affected by extreme values
i.e. smaller sample error (outliers)
Median
Advantages median Disadvantage of median
• It is easy to calculate • Not representative because not
• It often represents actual score all data are considered in the
belonging to some members of calculation
the population
• Not affected by extreme values
(outliers)
• Easily understood by many
people
Mode
Advantages of mode Disadvantages of mode
• It is easy to calculate • It represents a misleading
• It often represents actual score picture of a distribution that
belonging to some members of does not have a regular shape
the population • Not representative because not
• Not affected by extreme values all data are considered in the
(outliers) calculation
• For qualitative data, only mode • It may not exist
can be meaningfully employed • It may not be unique e.g.
• multimodal distribution
Measures of
Dispersion
Measures of
dispersion
(variability)
• Range
• Deviation
• Variance
• Standard deviation
• Inter-quartile range
• Coefficient of variation
Range
• This is the simplest
measure of variation.
• It is the difference between
the largest and smallest
observations in a sample
hence unreliable.
• It is not sensitive to other
characteristics of data
variability.
• It tends to be larger with
sample size.
Interquartile
range
• This is the difference
between the upper and
lower quartiles.
• It is not sensitive to
extreme outlying
observations.
• It increases as variability
increases.
Deviation
• The deviation of the ith observation xi from the sample mean is the
difference between them.
• For any sample, the sum of all deviations about the mean =,
equals 0.
• For this reason, summary measures of variation use either absolute
values or squares of the deviations.
• Variance is the average of
the squared deviations of
the data from the mean.
The variance of n
Variance observations is:
• The units of measurement
are the squares of those of
the original data. Variance
is difficult to interpret.
• Note that in calculation
variance, n – 1 rather than
n is used. The n – 1 is
called the degree of
freedom.
Standard • This is the positive
Deviation square root of the
variance. Sample
standard deviation is
denoted by s.
Coefficient of variation
This expresses s as a percentage of the sample mean.
Independent of the units of observation
(CV) = s/ (100)
How to compute mean, variance and standard
deviation
2
x (x – ) (x – )
2
y (y – y) (y – y)
3 -2 4 4 -4 16
4 -1 1 6 -2 4
5 0 0 8 0 0
6 1 1 10 2 4
7 2 4 12 4 16
2
Σx = 25 Σ(x – ) = 0
2
Σ(x – ) = 10 Σy = 40 Σ(y – y) = 0 Σ(y – y) = 40
= 25/5 = 5, sx2 = 10/4 = 2.5, sx = √2.5 = 1.6 = 40/5 = 8, sy2 = 40/4 = 10, sy = √10 = 3.2