Statistics for Experimental Design
Topic 1:
A Review of Basic Statistical Concepts
Mohammad Darainy
we want to study pop by using sample that is representative of pop (no biased sampling)
Population and Sample
§Population: The entire
set of things of interest.
§Parameter: A property
descriptive of the
population Ex: mean, median
§Population mean
§Sample: A subset of
population. Typically this
provides the data we
will look at.
or statistic §Estimate: A property of
a sample
§Sample mean
property of sample
Descriptive vs. Inferential Statistics
§Descriptive Statistics:
§Summarize/describe the properties of
samples (or populations when they are
completely known)
§Inferential Statistics:
§Draw conclusions/make inferences about
the properties of populations from sample
data
Descriptive vs. Inferential Statistics
Mean(X) = 167
descriptive
Descriptive vs. Inferential Statistics
Hypothesis!
Mean(μ) = 150
? inferential
Mean( X ) = 167
descriptive
Variable: Represent a characteristic of individual in
a sample or population
How many of people in How many of people
this sample are female ? in this sample are
overweight ?
What is the IQ level of What was the grade
the people in this of this sample in
sample ? PSYC 204?
Variables (measurement level)
Qualitative Quantitative
Nominal Ordinal Interval Ratio
Gender IQ level Age
Walking
-Female 80 23
speed
-Male 85 25
-Very slow 90 36
-Slow 95 Weight
Occupation -Normal
-Student 100 44
-Fast 58
-Teacher
102
rank
no rank no meaningful zero meaningful 0
Types of Variables
• Dependent variables (Y):
– Outcomes/Responses
– Predicted variables
• Independent variables (X):
– Aka factors in experimental designs
– Aka predictors/covariates
Types of Variables
• Dependent variables (Y):
Walking speed
• Independent variables (X):
Age
A marketing researcher wants to test the effect of
a new ad on consumers’ preference ratings.
Random Random
sampling assignment
Group 1
(treatment)
Ad
Group 2 No Ad
(control)
Y = Consumer preference (1-10)
X = Ad (0 = no, 1 = yes)
In this course
§ We focus on the relationships between
one dependent variable and
one/multiple independent variables.
n DV – Continuous (typically, normally
distributed)
n IVs – Categorical/continuous
n Ad Example:
n DV = Continuous (Preference: 1-10)
n IV = Categorical (Ad: 0/1)
Descriptive Statistics
get mean and standard deviation to fully describe
§ Summarize/describe the properties of
samples (or populations when they are
completely known)
§ How are the data distributed?
– Where is the center? (central tendency)
– What is the range? (variation)
– What is the shape of the distribution?
(shape)
Descriptive Statistics
To summarize/describe the samples
Central tendency Variation Shape
0 for normal distribution
Mean Range Skewness
Median Variance Kurtosis
pointy or flat curve
Mode Standard deviation
Measures of Central Tendency
§ Mean
§ Median
§ Mode
nce =!"#$%&"$'()'*"+,&#-'."+/"+01'2'!"#+
mean of squared deviation scores (mean square)
The(X
average;
2 Sum of values divided by sample size(N)
=
∑ − µ) µ = population mean
σ 2 = population variance
Np N
åX i
X 1 a+ descriptive
X 2 + ! + Xstatistic
=
∑ (X − X ) 2
X= i =1
N
only
= N
N on average
biased, because
N ( ∑ )x 2
N < σ 2
For Data points: 6,8,8,10,12,12,15,50
2 6+8+8+10+12+12+15+50
both a descriptive statistic
=
∑ (X − X )= =15.12of
and an unbiased estimate
82
σ
N −1 N – 1 = degrees of freedom
Measures of Central Tendency : Mean
§ Mean is affected by extreme values (outliers).
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
!"#$%&"$'()'*"+,&#-'."+/"+01'2'!"/3#+
The exact middle value
§ Calculation:
– If there are an odd number of observations,
find the middle value.
– If there are an even number of observations,
find the middle two values and average them.
(N +1)
Median Rank=
2
6, 8, 8, 10, 12, 12, 15, 50
(10 +12)
Median = = 11
2
!"#$%&"$'()'*"+,&#-'."+/"+01'2'!"/3#+
§ Median is NOT affected by extreme
values (outliers). take out outliers to take accurate mean
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
!"#$%&"$'()'*"+,&#-'."+/"+01'2'!(/"
§ The most frequently observed value
6,8, 8,10,12,12,15,50
Modes of this distribution are 8 and 12
§ Not affected by extreme values
§ Used for either numerical or categorical data
§ There may be no mode
§ There may be several modes
Which measure of central tendency is
the best?
• Mean is generally used, unless extreme
values (outliers) exist.
• Median is often used, since the median is
not sensitive to extreme values.
– Example: Median home prices may be
reported for a region – less sensitive to
outliers.
Measures of Variation
Measures of variation
give information on the
spread or variability of
data values.
• Range
• Variance
• Standard Deviation
Same center,
different variation
4#+5"
-Considers only starting point and end point
-Does not show how the data is spread
i.e, range of 10 numbers between 2 to 26
2 3 4 … … 26 …
OR
2 … … 24 25 26
6#&3#+0"
• Average (approximately) of ‘squared’
deviations of values from the mean
unbiased
N
2
Sample variance: (
∑ iX - X )
S2 = i=1
N -1
where X = mean
N = sample size
Xi = i th value of the variable X
Standard Deviation
§ Most commonly used measure of variation
§ A statistic that measures the dispersion of a
dataset relative to its mean.
§ Has the same units as the original data
– Sample standard deviation:
N
2
(X
∑ i − X)
S= i=1
N-1
Comparing Standard Deviations
Data A
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21
S = 3.338
Data B
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 0.926
Data C
Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 S = 4.570
Shape of a Distribution
• Describes how data are distributed
• Measures of shape
– Symmetric or skewed
Mean, Median, Mode
Mode Mode
Median Median
Mean
Mean
pos skewed neg skewed
Normal (Gaussian or bell-shaped)
Distribution
• In most statistical techniques for
experimental designs, the dependent
variable (Y) is assumed to be continuous
and normally distributed.
• If normally distributed,
• Mean=Median=Mode
• Mean(μ) and Standard Deviation(σ) are sufficient to describe a
normal distribution.
1 æ ( X - µ)2 ö
like a skewness test
Y= expçç - ÷÷
s2 p è 2s 2
ø
Normal distribution
µ ±σ
68% of the values in the population or the sample
Normal distribution
µ ± 2σ µ ± 3σ
95% 99.7%
Standard score (Z-score)
§ Definition: number of standard deviations from mean.
X−X X −µ
z= or z=
s σ
§ standard score (z-score) follows the standard normal
distribution if your original data is normally
distributed. μ = 0, σ = 1, z ~ N(0,1)
The purpose of the z statistic is to transform any
normal distribution to the standard normal s =5 s =5
distribution the shape of our
157 162 167 172
curve177doesn’t
(X)
182 187
change,
only the units
s =5 s =5 s= 1 s= 1
157 162 167 172 177 182 187 -3 -2 -1 0 +1 +2 +3
(X)
§ This transformation is useful because we can
easily examine how extreme our sample score
(X) is by simply looking at the corresponding z
s= 1 s= 1
score.
-3 -2 -1 0 +1 +2 +3
Example: You get a GRE score of 800. Is that
good? Will you get into grad school?
X-µ
z=
s
300 400 500 600 700 800 900
• Let’s say µ = 600, s = 100
• z = (800 - 600)/100 = 2
Example: You get a GRE score of 800. Is that
good? Will you get into grad school?
This area is
2.28% of the
distribution
300 400 500 600 700 800 900
With the help of z table the area above z=2.0 is only
0.0228.That means that you scored in the top 2.28%!
Example: You get a GRE score of 800. Is that
good? Will you get into grad school?
This area is
2.28% of the
distribution
300 400 500 600 700 800 900
With the help of z table the area above z=2.0 is only
0.0228.That means that you scored in the top 2.28%!
Sampling Distribution of the Mean
• Three Types of Distributions
• Population Distributions
The distribution of all scores in the population. Imagine we are
interested in the height of all currently enrolled McGill students. The
resulting frequency distribution will be our population distribution.
Male ~ N(175.1 , 4.5)
Female ~ N(162.3 , 4.0)
Combined ~ (μ = 168.7 , σ = 7.7)
* The true standard deviations are
male = 7.42 and female = 7.11 which
were changed for effect.
Sampling Distribution of the Mean
• Sample Distributions
Draw a McGill student at random and measure his/her
height.
Put him/her back (replacement)
Draw again and measure the height of the student
Do you expect the two heights to be the identical?
Suppose we repeat this procedure, but draw 50 students
each time instead of one
Do you expect the two sets of heights to be identical?
Random Variation: Two samples drawn randomly from
the same population will practically never be identical.
Height Distribution: Samples (n=50)
Sampling Distribution of the Mean
• Sampling Distributions
Draw two McGill students at random and measure their
heights
Put them back (replacement)
Draw two students again and measure their heights
Do you expect the means of these two samples to be
identical?
Again, what about for two samples of 50 students?
Random Variation in Sample Statistics:
Just like individual observations vary randomly between
samples, so do the statistics generated from those samples.
And just like the variation among observations can be
described by probability distributions, so can the variation
in the samples statistics.
Sampling Distributions: The distribution of a
statistic generated from samples.
Why are Sampling Distributions Important?
They are the foundation for statistical inference and
hypothesis testing
Every statistic has a sampling distribution:
Means, standard deviations, medians, maxima/minima,
etc.
In this course we are interested in Sampling Distribution
of the Mean.
Sampling Distribution of the Mean
To explore sampling distributions, let’s use the following online
applet.
http://onlinestatbook.com/stat_sim/sampling_dist/
Explore on your own: What aspects of the population and sample
distributions affect the resulting sampling distribution of the Mean?
Central Limit Theorem
sampling distribution
of the mean
sX
µX
Normality of the Sampling Distribution of the Mean
Central Limit Theorem
When the sample size is large (i.e., > 30):
Even if the variable is not normally distributed, the
sampling distribution of the mean approaches
σ
normality, with Xµ = µ and σ X
= .
N