5/20/2021
Dr. Faran Emmanuel
University of Manitoba
1
5/20/2021
THE POWER OF STATISTICS
Statistical Thinking for Success in Life and Career
By Michael I. Parzen and Emanuel Parzen
May 30, 2013
Statistical thinking helps one’s success in life and career
by quantifying uncertainty using probability. It is
important to distinguish between outcomes that
are conceivable (i.e. zero probability), possible
(i.e. positive probability for an interval of similar
outcomes), and probable (i.e. positive probability).
Statistical thinking is used to answer questions about
what one knows and how one knows it, based on analysis
of data more than expert opinions.
2
5/20/2021
FROM EPIDEMIOLOGY TO BIOSTATISTCS
▪ Statistics is the body of technique and procedures dealing with the collection,
organization, analysis, interpretation, and presentation of information that
can be stated numerically. The science and art of understanding and analyzing
data to obtain reliable results and conclusions
Biostatistics is the application of statistics to problems in the
biological sciences, health, and medicine
SO WHAT DOES STATISTICS DO???
▪ Statistics uses sample statistics to estimate population parameters, also termed
population characteristics.
▪ One common example is the population mean… we calculate sample mean to
estimate population mean. Likewise we look at the proportion of people having a
specific disease within our sample and estimate population prevalence
▪ Although population parameters are sometimes considered unobservable, they
are taken to be fixed and potentially measurable quantities using survey statistics.
▪ Sample statistics vary… that’s why they are called variables
3
5/20/2021
THIS IS WHAT RESEARCHERS DO
Finalize data
Identify an issue or a
collection process Field Data collection
problem
and study teams
Usually a statistic
Pre-test the
Literature review to
questionnaire & Data management
understand the issue
research process and analysis
Usually review data… Preliminary analysis Many steps…
Develop rationale to
Ethical approval from Conclusions
support the
investigation a ERB Recommendations
Numbers
Develop Research
Develop a Hypothesis Dissemination and
Protocol &
or Research question Response
Questionnaires,
Hypothesis testing Sample Size, Variables, Questions Crisp Numbers
FROM EPIDEMIOLOGY TO BIOSTATISTOCS
Data vs Information
ANALYSIS
DATA Results and
questionnaire info conclusions
consists of numbers INFORMATION
▪ The methods and tools of biostatistics are used to analyze the data for
decision making
▪ make valid inferences from known samples about the populations from
which they were drawn.
4
5/20/2021
DATA ANALYSIS
There are various types of Statistical Analysis:
▪ Descriptive analysis: used to describe the data set
▪ Inferential analysis: used to generate conclusions about the
population’s characteristics based on the sample data
o Differences analysis: used to compare the mean of the responses
of one group to that of another group
o Associative analysis: determines the strength and direction of
relationships between two or more variables
▪ Predictive analysis: allows one to make forecasts for future events
5
5/20/2021
UNDERSTANDING VARIABLES
▪ “any population/sample characteristic that we want to study in a study
is called a “variable” e.g., age, sex, years of education, HIV status,
income etc.,
▪ The term "variable" makes sense because the value of the characteristic
varies from one subject to another…. Because of inherent variation
among individuals and from errors, called measurement errors, made in
measuring and recording a subject's value on a characteristic.
▪ Dependent (outcome) and Independent (predictors) variables
▪ We start by looking at the types of variables we have in our data set
TYPES OF DATA TYPES OF
VARIBALES/DATA
2 major types
QUALITATIVE QUANTITATIVE
Categorical Numerical/Scale
NOMINAL ORDINAL DISCRETE CONTINOUS
A score or value within a scale.
Order of value exists. The Difference between each
Identification of subjects Absolute Values… no
e.g., income value has a real meaning
Study ID, Gender, continuity e.g., No of
categories,
children, No of
Address educational
patients seen,
categories, state of
Name of the school health
6
5/20/2021
mean
median
mode
Measure of central
tendency
Variance
NUMERIC Standard Deviation
APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis
QUALITATIVE
(Categorical) Frequencies/Proportions
7
5/20/2021
DESCRIPTIVE – NUMERIC – MEASURES OF CENTRAL TENDENCY
▪ Measures of central tendency are measures of the location of the middle
or the center of a distribution
▪ Why describe Central Tendency
Data often cluster around a central value that lies between the two
extremes. This single number can describe the value of scores in the
entire data set.
▪ Three measures of central tendency usually used.
1) Mean
2) Median
3) Mode
DESCRIPTIVE – NUMERIC – MEASURES OF CENTRAL TENDENCY - MEAN
▪ THE MEAN is the most ▪ MEDIAN is the ▪ MODE is the most
commonly used, also known middle score when frequently occurring
as “average” all scores in the number in a set of
data set are data.
▪ Population mean , and arranged in order.
sample meanx ▪ If there are two
▪ Half the scores lie modes, the data set is
▪ Sum of all the scores divided above and half lie bimodal.
by the number of scores below the median.
▪ If there are more than
▪ Weighted mean - weight two modes, the data
given to each value according set is said to be
to its importance multimodal.
• Most commonly used • Used when distribution is • Usually used with
• When the variable is skewed (e.g.,) income Categorical data
normally distributed • Median (range)
• Mean (sd)
8
5/20/2021
MEASURES OF CENTRAL TENDENCY
mean
median
mode
Measure of central
tendency
Variance
NUMERIC Standard Deviation
APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis
QUALITATIVE
(Categorical) Frequencies/Proportions
9
5/20/2021
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION
▪ Lets look at an example………
▪ Group A: 0, 5, 10, 15, 20, 25, 30
▪ Group B: 11, 14, 14, 15, 16, 16, 19
▪ Group C: 0, 0, 15, 15, 15, 30, 30
▪ All 03 groups have a similar mean….. So can we say , that all 03
groups are the same?
▪ Other than mean/average, another variable determines the
characteristics of a group….
▪ Let us look at another example…………..
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION
10
5/20/2021
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION
▪ A variable’s spread is the degree to which values on the variable differ
from each other.
▪ More the values are different from each other, more is the spread of data.
▪ If every score on the variable were about equal, the variable would have
very little spread.
▪ Variability and dispersion are synonyms for spread.
▪ There are various measures of dispersion e.g., Range, Variance, Standard
Deviation, IQ Range,
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION - RANGE
▪ The range is the simplest measure of spread or dispersion.
▪ It is equal to the difference between the largest and the smallest values.
e.g., 100, 74, 68, 68, 57, 56
Range = H - L = 100 - 56 = 44
▪ Range is very sensitive to extreme scores since it is based on only two
values.
▪ The range should almost never be used as the only measure of spread,
but can be informative if used as a supplement to other measures of
spread
11
5/20/2021
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION – VARIANCE
• Variance is the sum of the squared deviations from the mean divided by N.
2
=
(x -)2
• Population variance is given by 2
N
(x - x)2
• Sample variance is given by s2 s 2
=
n-1
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION
STANDARD DEVIATION
▪ Commonest measure of dispersion used
▪ To calculate Standard Deviation, simply calculate the square root of the
variance.
▪ Population Standard deviation is given by
▪ Sample Standard deviation is given by s
▪ Average or mean is always presented along with standard deviation
12
5/20/2021
DESCRIPTIVE – NUMERIC – MEASURES OF DISPERSION – INTERQUARTILE RANGE
▪ The interquartile range is a measure of where the “middle fifty” is in a data set.
▪ Where range is a measure of where the beginning and end are in a set, an
interquartile range is a measure of where the bulk of the values lie. That’s why it’s
preferred over many other measures of spread when reporting things like school
performance or SAT scores.
▪ The interquartile range formula is the first quartile subtracted from the
third quartile:
IQR = Q3 – Q1.
mean
median
mode
Measure of central
tendency
Variance
NUMERIC Standard Deviation
APPROACH
Data is usually Range
described in form QUANTITATIVE
Measure of dispersion
of numbers – Tables Inter-quartile Range
are the most
common
presentation of
data Skewness
Distribution Kurtosis
QUALITATIVE
(Categorical) Frequencies/Proportions
13
5/20/2021
DESCRIPTIVE – NUMERIC – DISTRIBUTION
▪ When talk of the distribution of data it
basically means “how a variable or a
characteristic is distributed in the
population ???
▪ For example, “ what is the distribution of
age among this class”
▪ we can develop a histogram showing how
many people are within each age category
(each year or class intervals)
20 24 28 32 36 42 46
▪ This histogram shows the age distribution
NORMAL DISTRIBUTION
▪ the most famous probability distribution in statistics.
▪ also called “Gaussian distribution” or “bell shaped curve”
▪ It is continuous, smooth, bell shaped and having only one peak (unimodal)
▪ The curve is symmetrical about the mean (shape is same on both sides)
▪ The mean, median and mode are equal and located at the center of the distribution
▪ Two parameters define the normal distribution, the mean () and the standard
deviation ().
▪ Since it is a probability distribution, total area under the curve is 1.00 or 100%
14
5/20/2021
NORMAL DISTRIBUTION
15
5/20/2021
WHY IS NORMAL DISTRIBUTION IMPORTANT
▪ Countless phenomena follow (or closely approximate) normal
distribution e.g height, weight, serum cholesterol, body temp of healthy
persons
▪ Much statistical theory and methodology developed on this assumption
and is basis for inferential statistics
SKEWNESS OF DATA
▪ Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the same to
the left and right of the cent
▪ The skewness for a normal distribution is zero. Negative values for the
skewness indicate data that are skewed left and positive values for the
skewness indicate data that are skewed right.
16
5/20/2021
TYPES OF DISTRIBUTION/CURVE
Normal Positive Skew Negative Skew
Distribution
Median Mean
Mean & Median
Mean arethe
& Median same
e Mean Mean Median
▪ A distribution is skewed if one of its tails is longer than the other.
▪ The distribution can be positively skewed. This means that it has a long tail in the
positive direction… "skewed to the right"
▪ The distribution can be negatively skewed if it has a long tail in the negative direction….
"skewed to the left"
17
5/20/2021
OUTLIERS
▪ An outlier is any observation, which falls more than 3 std deviation away
from the mean.
▪ Outliers are extremely important because they can significantly skew
distributions, which otherwise are normal.
▪ Decision need to be taken, about how to deal with outliers.
18
5/20/2021
APPLICATION OF NORMAL DISTRIBUTION
▪ Knowing a distribution alone is not enough… Scientists use this to answer
research questions
▪ … we just said that countless phenomena follow (or closely approximate)
normal distribution e.g., height, weight, serum cholesterol, body temp of
healthy persons
▪ E.g., If systolic blood pressure is normally distributed with a mean of 120 and
standard deviation of 12, what proportion of people will have a normal systolic
blood pressure. (hypertensive is >135, while low blood pressure <95).
STANDARD NORMAL DISTRIBUTION
▪ Normal distribution is transformed to standard normal distributions by the formula:
x− X is a score from the original normal distribution,
z= μ is the mean of the original normal distribution,
and σ is the standard deviation
▪ The standard normal distribution is also called the z distribution.
▪ A z score is the number of standard deviations an observation is away from the mean.
By using this formula we can convert any
variable’s distribution into a standard
normal distribution with a mean of zero
and standard deviation of 1.
19
5/20/2021
NORMAL DISTRIBUTION
Z SCORE TABLE
AREA UNDER THE CURVE
One half of the distribution is a
mirror image of the other half.
20
5/20/2021
Z SCORE TABLE – USING THE TABLE
For instance, if you scored 120 on a test with a
mean of 100 and a standard deviation of 10.
What is your z score
= (120 – 100) / 10 = 20/10 = 2
The z score tells you how many standard
deviations from the mean your score is. In this
example, your score is 2 standard
deviations above the mean.
How many people scored above 120
How many people scored between 90 to 120 120 2
How many scored above 100
100
Z SCORE TABLE
AREA UNDER THE CURVE
▪ What is the 90th percentile
(X = μ + Zσ)
▪ To solve this…. Look for 90% in the table
and check the z score for it.
21
5/20/2021
Z DISTRIBUTION : THE CATCH
▪ The z distribution works on 2 major assumptions:
▪ The sample size is more than 30 i.e., N>30
▪ We know the value of .
IS THERE A PROBLEM WITH THAT ??
▪ Even if the sample size if high, in reality, there will never be a situation
where you know the true population variance or standard deviation…
22
5/20/2021
T DISTRIBUTION
▪ The t distribution (Student’s t-distribution) is a probability distribution that is used
to estimate population parameters when
i. the sample size is small
ii. and/or when the population variance is unknown.
▪ distribution of the t statistic (also known as the t score), is given by:
where x is the sample mean, μ is the population mean, s is the standard deviation of the
sample, and n is the sample size. The distribution of the t statistic is called the t
distribution or the Student t distribution.
T DISTRIBUTION
▪ There are many different “t” distributions.
▪ The particular form of the “t” distribution
is determined by its degrees of freedom.
▪ Degrees of freedom refers to the number of independent observations in a set of data.
▪ When estimating a mean score or a proportion from a single sample, the number of
independent observations is equal to the sample size minus one. Hence, the
distribution of the t statistic from samples of size 8 would be described by a t
distribution having 8 - 1 or 7 degrees of freedom. Similarly, a t distribution having 15
degrees of freedom would be used with a sample of size 16.
23
5/20/2021
T DISTRIBUTION
Following are the characteristics of the t distribution:
(1) The t statistic lies between −∞ < t < ∞.
(2) The probability distribution appears to be symmetric about t = 0.
(3) The probability distribution appears to be bell-shaped.
(4) The density curve looks like a standard normal curve, but the tails of the t-
distribution are "heavier" than the tails of the normal distribution. That is, we are
more likely to get extreme t-values than extreme z-values. There are no outliers.
(5) As the degrees of freedom increases, the t-distribution appears to approach the
standard normal z-distribution.
T SCORE TABLE
AREA UNDER THE CURVE
▪ What would be the t score if the area
under the curve is 90% with n=24
and at n=1000
▪ What would be the t score if the area
under the curve is 95% with n=100
▪ It is interesting to note that when
sample size is large, the distribution
of t statistic, matches distribution of z.
24
5/20/2021
25
5/20/2021
CHI SQUARE DISTRIBUTION
▪ When the data is categorical, Chi Square statistic is most commonly used for
testing relationships between categorical variables.
▪ Chi square distribution (χ2) is a probability distribution widely used in statistical
inference
▪ The chi-squared distribution (also chi-square or χ2-distribution) with k degrees of
freedom is the distribution of a sum of the squares of “k” independent standard
normal random variables.
CHI SQUARE DISTRIBUTION
▪ The distribution shape
depends on the degree of
freedom.
▪ Higher the degree of
freedom, Chi square
distribution (χ2) resembles a
normal distribution.
▪ DF= (r - 1) * (c - 1). Thus in
a 2X2 table the DF is 1 i.e.,
(2-1)*(2-1) = 1
26
5/20/2021
CHI SQUARE
TABLE
▪ No negative values
▪ The value of χ2 is
different with each
degree of freedom
▪ What will be the value
of χ2 statistic if df=1
and the confidence
level is 95%
CHI SQUARE AND STANDARD NORMAL DISTRIBUTION
▪ If a random variable (Z) has a standard normal distribution, the Z2 will follow a (χ2)
distribution with one degree of freedom
▪ The χ2 distribution is related to a standard normal distribution. The simplest chi-squared
distribution is the square of a standard normal distribution.
χ2 at 0.05 = 3.84 z at 0.05 = 1.96
27
5/20/2021
PROBABILITY DISTRIBUTIONS
DISCRETE CONTINUOUS
• Binomial distribution • The “F” Distribution
(Logistic regression) (ANOVA)
WHICH DISTRIBUTION HAS A LARGER VARIANCE?
28
5/20/2021
Bar Charts
Histograms/Frequency Polygons
Pie Charts
mean
Scatter Plots
median
Descriptive statistics
Measure of central mode
GRAPHICAL APPROACH tendency
Visual presentation of data
in form of figures and Variance
illustrations … more
commonly done for reports Standard
Deviation
Range
Measure of
dispersion Inter-quartile
NUMERIC APPROACH Range
Data is usually described in
form of numbers – Tables
are the most common Skewness
presentation of data
Distribution
Kurtosis
29
5/20/2021
DESCRIPTIVE – GRAPHICAL – BAR CHARTS
100%
80%
60%
40%
20%
0%
BHWL BNU DGK GJRN GJRT HYD KHI KSUR LRK MPK NWB PSH QTA RWP SHKP SLKT SKKR TRBT OVERALL
Home KK Brothel Hotel/Msg Street Phone
Best use : whole numbers, Nominal or discrete data
DESCRIPTIVE – GRAPHICAL – HISTOGRAM
Best use : Continuous data
30
5/20/2021
DESCRIPTIVE – GRAPHICAL – PIE CHARTS
Cell Phone*
17% Home
29%
Street
14%
Hotel/Msg
3%
KK
Brothel 36%
1%
Home KK Brothel Hotel/Msg Street Cell Phone*
Best use : Percentages
DESCRIPTIVE – GRAPHICAL – BOX PLOTS
Best use : more descriptive, shows measure of Central tendency + dispersion
31
5/20/2021
DESCRIPTIVE – GRAPHICAL – SCATTERPLOTS
Best use : relationship between two variables
HIV PREVALENCE TRENDS – THE POWER OF GRAPHICS
45%
IDUs
40% 38.4%
TGs 36.7%
2005 2007 2008 2011 2016 35% MSM
IDUs 10.8% 15.8% 20.8% 36.7% 38.4% FSWs
30%
TGs 0.8% 2.1% 6.4% 7.3% 7.2%
25%
MSM 0.4% 1.5% 0.9% 3.1% 5.6% 20.8%
20%
FSWs 0.4% 0.2% 0.5% 0.8% 2.2% 15.8%
15%
10% 10.8% 7.2%
7.3%
6.4% 5.6%
5% 3.1%
2.1%
0.4% 0.5%
1.5% 0.9% 0.8% 2.2%
0% 0.8% 0.2%
0.4%
2005 2007 2008 2011 2016
32
5/20/2021
DESCRIPTIVE – GRAPHICAL – BUBBLE CHARTS
DESCRIPTIVE – GRAPHICAL – BUBBLE CHARTS
33
5/20/2021
34
5/20/2021
DESCRIPTIVE ANALYSIS
▪ Understand our data set
▪ Scale/Numerical data :
▪ Check for Distribution (Skewness & kurtosis (symmetrical or skewed.. Remember z distribution, t distribution)
▪ Measures of central tendency (Mean, Median, Mode)
▪ Measures of Dispersion (Standard deviation, variance, range, IQR)
▪ DECIDE – Do you want to make categories or present as such
▪ Qualitative/Categorical data:
▪ Check for proportions within each category
▪ The distribution of categorical variables follow Chi square distribution
▪ How do you present your descriptive data
▪ Numeric presentation (Tables)
▪ Graphical presentation (Figures/graphs/illustrations0
DESCRIPTIVE ANALYSIS
35
5/20/2021
DESCRIPTIVE
ANALYSIS
THANKS
PLEASE READ THESE BASIC CONCEPTS….
will be happy to answer any questions
[email protected] 36