MLS 314
BIOSTATISTICS I
By
MEDUGU, JESSY THOMAS
[email protected]
DEPT. OF MEDICAL LABORATORY SCIENCE,
UNIVERSITY OF MAIDUGURI.
Course outline
What is Biostatistics?
The role of Statistics in Medical and Health
Sciences.
Definitions and Terminologies in Biostatistics
Data Collection Methods
Descriptive and Inferential Statistics
Measures of Central Tendency
Measures of Dispersion
What is Statistics?
Statistics is the scientific study of methods of collection,
collation, analysis, presentation and interpretation data.
Biostatistics is the application of statistical methods and
procedures in the study and understanding of medicine, health
and biological sciences.
The terms Biostatistics and Medical Statistics are often
interchangeably used.
Statistics in Medical Sciences & Public Health
Medicine and medical sciences are becoming increasingly
quantitative rather than qualitative.
The planning, conduct and interpretation of medical and public
health research are dependent on statistical methods.
Statistics influences public health decisions and actions.
Statistics pervades the medical and health literature.
Statistics in Medical Sciences & Public Health cont’d…
Determining the magnitude of disease burden in a population.
Identifying risk factors and population at risk.
Assessing the impact of public health interventions.
Modeling and hypothesizing to enable prediction of future health-
related events.
Biostatistics concepts are adopted as basis to define treatment
protocols.
Biostatistics concepts are adopted in trials of new drugs,
pharmaceuticals, vaccines .
Definitions and Terminologies in Biostatistics
Variable
A variable is any entity or item or characteristic that is able or liable to
change or vary.
Depending on the characteristic of the variable, it can be classified as:
Quantitative variable
Qualitative variable
A quantitative (numerical) variable is either continuous or discrete.
A qualitative (categorical) variable is put as nominal or ordinal.
Continuous quantitative variable take any range of numerical values e.g.
Height, weight, Glucose level and your account balance.
Discrete quantitative variables these are integers, typically counts e.g. family
size, coliform counts, number of rooms and number of wives.
Nominal qualitative variables these are characters that can not be described
numerically e.g. colour of skin (black, brown yellow), height (short or tall) and
gender (male or female).
Ordinal qualitative variables are mutually exclusive, ordered and can be
ranked e.g. Social class (Low, medium and high), Disease severity (Mild,
Moderate and severe) and Level of education
Population
In research or biostatistics, population is used in a wider sense than usual.
It refers to a group of people, animals or objects of research or statistical
interest. E.g. Diabetics, hypertensive individuals e.t.c.
Sample
This is a subset of population that is systematically chosen in a way to serve
as true representative of the population.
Universal sample (sampling): this occur when the population is small enough
to be exhausted in a research.
Why do we sample?
Limited time
To avoid too much expenses or resource requirement
To avoid cumbersomeness
Data: data is a collection of related variables or observations
which when analyzed will give useful information for decision
making.
Data point- refers to one single observation.
Based on scale of measurement, data are either quantitative
or qualitative.
Based on source, data are primary or secondary.
Validity of data:
Accuracy and reliability of a test; the extent to which the test
measures what is supposed to measure.
Researchers depend on various types of validity to verify the
effectiveness of measurement procedures used.
Types of validity: External, construct, criterion, content and
face validity.
Threats to validity of data
Confounders
Selection bias
Observers variation
Data Collection Methods
Data collection techniques must be employed to enable us
systematically collect information on our study units (people,
objects, events or phenomena).
If data are collected haphazardly, it becomes difficult to
answer our research questions.
It is the first step that must be taken seriously and be
conducted accurately before collation of data for onward
analysis, presentation and interpretation.
Data Collection Methods Cont’d…
The following are data collection methods:
1. Observation/measurements
2. Interviews (SIS or SSIS)
3. Self-administered questionnaires
4. Focus group discussions
5. Key Informant Interview (In-depth interviews)
6. Use of documentary sources
Observation
This is a technique that involves systematically selecting,
watching and recording of events, objects or phenomena (e.g.
blood pressure, weight, use of protective devices, student’s
behaviour).
Participant observation: the observer takes part in the situation
he/she observes.
Non-participant observation (concealed): the observer watches
the situation, does not participate and remains concealed.
Non-participant observation (open, non-concealed): the
observer watches the situation openly (does not hide his/her
presence) but does not participate.
Measurements
If observations are made using a defined scale or equipment,
the technique is referred to as measurement. E.g. Using
weighing scale for weight, New improved Neubauer counting
chamber for white cells count e.t.c.
Advantages of Observation Techniques
Gives more detailed and context-related information
Permits collection of information on facts not mentioned in the
questionnaire.
Permits tests reliability of responses to questionnaires
Measurement allows actual quantification of targets.
Measurements cont’d…
Disadvantages of Observation Techniques
Ethical issues concerning confidentiality.
Observer bias may occur
The presence of the data collector can influence the situation
observed.
Thorough training of research assistants is required.
Interviews
An interview is a data collection technique that involves oral
questioning of respondents (using SIS or SSIS).
SIS (Structured Interview Schedule): involves the use of fixed
list of questions to be asked in standard sequence, with fixed or
pre-categorized responses.
SSIS (Semi-structured Interview Schedule): this technique allows
for flexibility in ordering of the questions. An interviewer may ask
additional questions to gain more useful information.
Self-administered Questionnaires
This is a technique in which written questions are presented and
are to be answered by the respondents in writing.
The types of questionnaires used are:
Unstructured
Semi-structured
Structured
Written questionnaires can be administered in different ways,
including:
Through the mail
Gathering respondents in group (s)
Hand delivering to respondents and collecting them later.
Focus Group Discussion (FGD)
In FGD, discussion of 6-10 persons (with similar characteristics)
is organized.
The discussion is usually guided by a facilitator during which
members talk freely and spontaneously about a certain topic.
The aim of the discussion is to capture perceptions, attitudes and
ideas of the participants.
The discussions are either tapped or captured by note-takers.
Key Informant Interview (In-depth Interview)
KII allows people that have monopoly of knowledge/information to
tell stories, provide insight about an issue.
A key informant could be: a knowledgeable community leader, health
expert or experienced driver or cleaner.
The interview may include one or two informative members of a
target group.
In this technique, the data collector must be diplomatic, blend to
culture and adopt use of euphemism.
Use of Documentary Sources
This method involves retrieving data already collected (but not
analyzed) from existing sources.
Examples of such sources include: medical records, archives,
records from college of medical sciences and state LGA records
(e.g. data on disease surveillance).
A documentary source can be programme-specific data e.g.
immunization coverage, HIV/AIDS clients receiving services from
SIDHAS project.
Use of Documentary Sources cont’d…
Advantages of using documentary sources
Inexpensive (data had already been collected)
Permits examination of past trends
Disadvantages of using documentary sources
Issue of access
Confidentiality
Biased information
Missing information
Descriptive and Inferential Statistics
Descriptive and Inferential Statistics
Descriptive Statistics simply describes the attributes of a data
set:
Frequency of occurrence of certain values
Typical or representative values
Degree of spread or scatter e.t.c.
Inferential Statistics refers making generalizations about a
population based on the attributes of a sample; using the
knowledge of probability.
Measures of Central Tendency
Measures of Central Tendency (Location) cont’d…
Measures of CT
Arithmetic mean
Median
Mode
Arithmetic mean (AM)
AM is customarily just called Mean.
It is the simple average of a given set of values (numbers).
It is the sum of given values divided by the total number of the
values.
The mean is calculates by the formula:
Arithmetic mean (AM) cont’d…
Arithmetic mean (AM) cont’d…
Arithmetic mean (AM) cont’d…
Arithmetic mean (AM) cont’d…
Advantages of AM
It is based on all the values in the series
It is easy to understand and simple to calculate
It is not influenced by the position of values in the series.
The mean is used in a number of inferential statistics.
Limitations of AM
The mean is easily influenced by extreme values.
It is not a good measure of location when the data are skewed.
Assignment 1
Write a short note on any two other types of mean other than the
arithmetic mean.
• Note: this should not be more than a page.
Median
This is the item that divides a set of data into two equal halves.
If a given series of measurements or observations is arranged
in increasing order, the median is the middle one.
It is usually calculated by the formula (n + 1)th/2; if the total
number of data set is an odd.
Or nth/2; if the total number of data set is an even.
Median cont’d…
For grouped, discrete data, data with class interval, the
formula is;
Where L1= Lower class boundary of the median class.
N = number of given data.
(∑f)1` = sum of all frequencies of the class/classes lower than
the median class.
fmedian = frequency of the median class.
c = size of the median class.
Median cont’d…
Median cont’d…
Median cont’d…
Median cont’d…
Advantages of Median
Median is not easily affected by extreme values.
It can be obtained graphically.
It is a measure of rank or position.
It gives clear idea of the distribution of the data.
Disadvantages of Median
It may not be representative if there are few data.
Beyond descriptive statistics, median is rarely used in inferential
statistics.
It may require rearrangement of data involved. This may be
cumbersome if large sample is involved.
Mode
The mode is that value in a data set which occurs most frequently.
It is identified by counting the number of times each value occurs in
the set and selecting that value which occurs most often.
E.g. the mode of this set of 7 observations: 3,4,7,9,15,4,5 is 4.
But if two or more numbers have highest occurrence (frequency), the
mode is obtained by taking the AM of the values.
Mode cont’d…
Mode cont’d…
Mode for grouped, discrete data with class intervals
To determine a value for mode here, an interval with highest
frequency is identified first.
It is called modal interval or modal class.
The formula for mode is then used;
Where: L1 = lower class boundary of modal class
∆1= excess of modal frequency over the
frequency of immediate lower class (∆1= fm –fm-1)
∆2 = excess of modal frequency over the
frequency of immediate upper class (∆2 = fm –fm+1)
C = class width or size
Mode cont’d…
Mode cont’d…
Advantages of Mode
It can be obtained graphically.
It is not affected by extreme values.
It is easy to understand and compute.
The data does not need any ordering before mode can be
obtained.
Disadvantages of Mode
It is not an ideal measure of CT.
It is not useful in further statistical processing.
It does not make use of all the values in the distribution.
It may not be unique because 2 or more values may be equally
frequent.
Measures of Dispersion
Measures of Dispersion
Apart from knowing the typical or representative values (CT), it
is often of interest to know the extend of variability exhibited
by the various values.
The dispersion of a set of observation refers to the variety that
the values of the observation exhibit.
If all the values are the same , there is no dispersion; if they are
not the same, dispersion is present in the data.
The study of Statistics is all about variability.
Measures of Dispersion
Range
Mean deviation
Variance
Standard deviation
Coefficient of variance
Standard error
Range
Range is the simplest measure in Statistics.
It is the difference between the largest (Xh) and smallest (XL)
values; usually denoted as R = Xh – XL
E.g. the Range of 5,7,8,9,2 is 9- 2 = 7.
Range has two main disadvantages:
It only takes into account two extreme values. Variability of
immediate values is ignored.
It tends to increase as the number of observations increases.
Mean Deviation (MD)
MD measures the average spread of values from the AM.
MD is referred to as the average amount by which a value in a given
data set differs from the AM.
It is otherwise known as Mean Absolute Deviation.
For ungrouped data; MD = ∑Ix-ẋI/n.
For grouped data; MD = ∑fIx-ẋI/∑f.
Features of MD
It is straightforward to calculate.
It cannot be distorted by extreme values
It makes use of every value in the data set.
It cannot be used for further statistical processing.
Mean Deviation (MD) cont’d…
Variance and Standard Deviation
The variance and standard deviation are estimates that assess how
scattered the individual measurements are from the mean.
They are defined in terms of the deviations (x-ẋ) of the observation
from mean.
The variance can be described as the estimated average of the
squared deviations.
Standard deviation is the square root of variance.
the standard deviation is the most important measure of dispersion
used in statistical analysis.
Variance and Standard Deviation cont’d…
When dealing statistically with analyzing sample, in most cases
we do not have information on the whole population.
But if we have, the formula for variance and standard deviation
called population variance and population standard deviation,
respectively are:
Variance and Standard Deviation cont’d…
When dealing statistically with analyzing sample and we do not
have information on the whole population, the formula for
variance is:
Variance and Standard Deviation cont’d…
When dealing statistically with analyzing sample and we do
not have information on the whole population, the formula
for standard deviation is:
Coefficient of Variation (CV)
CV is the ratio of sample standard deviation to sample mean
multiplied by 100.
It expresses the SD as a percentage of the sample mean.
It answers the question: “what percentage of the sample mean is
the SD”?
Used for the comparison of variability in different data sets
measured in different units (it is unit-less).
It is denoted by: CV = S/ẋ X 100%, where S = SD and ẋ = sample
mean.
Standard Error (SE)
SE is a measure of how precisely the population mean is estimated by the
sample mean.
It is a measure of the precision of a sample in estimating population
parameter.
The size of SE depends both on how much variation there is in the
population and on the size of the sample.
The larger the sample size n, the smaller the SE.
It is given by a formula: se = s/√n, where s = sample SD and n = sample
size.
Assignment 2
End of Module 1