Sampling distribution
By: Amare M (MPH/Biostatistics)
[email protected] January 2024
Objectives
Define sampling distributions (mean and proportion)
Understand the process of sampling distribution
Appreciate the properties of sampling distribution
Understand the Central Limit Theorem
Sampling distribution 2
Sampling distribution
• A sampling distribution is a distribution of all possible values
of a statistic computed from samples of the same size
randomly selected from the same population
• Serves to answer probability questions about sample statistics
• A sampling distribution can be constructed when the
population is finite
• However, the construction of sampling distribution is difficult
with a large population and impossible with an infinite
population
Sampling distribution 3
Sampling distribution ….
• Take a sample (n) from a population (N) and calculate the
statistic i.e mean or proportion
• Take another similar sample size (n) from and calculate
sample statistic (mean or proportion)
• Repeat many times (all possible samples)
Sampling distribution 4
Sampling distribution ….
• Do you expect all the sample means the same?
• NO! Because there is sampling error (variability)
• Sampling variability: the value of any statistic varies in
repeated random sampling
Sampling distribution 5
Sampling distribution ….
• Statistical inference: Where we make inferences or generalizations
about population parameters are based on observed sample
statistics.
• Suppose we want to generate an estimate of a continuous variable
in a population (e.g., weight, HDL cholesterol level).
• It is very typical to estimate the mean of a continuous variable in a
population.
• The mean of a representative sample is a very good estimate of the
unknown population mean.
• If a second sample is selected, that sample might produce a slightly
different estimate (i.e., the mean of the second sample might be
slightly different than the mean of the first).
• Whenever we perform statistical inference, we must recognize that
we are essentially working with incomplete information specifically,
only a fraction of the population.
Sampling distribution 6
Sampling distribution ….
• When we make estimates about population parameters based on
sample statistics, it is extremely important to quantify the precision
in our estimates.
• This is done using probability and, in particular, the probability
models we have just discussed (eg. Normal probability distribution)
• Consider the following small population consisting of N=6 patients
who recently underwent total hip replacement.
• We are interested in a patient’s self-reported pain-free function,
rated on a scale of 0 to 100, with higher scores representing better
function (e.g., 0=severely limited and painful functioning to
100=completely pain-free functioning), measured 3 months post-
procedure.
• The data are shown below and are ordered from smallest to largest:
Sampling distribution 7
Sampling distribution ….
25 50 80 85 90 100
• A box-whisker plot of the population data is presented
below (figure 1)
Sampling distribution 8
Sampling distribution ….
• The distribution of pain-free function scores is
slightly skewed, with the majority of patients
reporting high scores.
• Suppose we did not have the population data and
instead were interested in estimating the mean pain-
free function score based on a sample.
• Suppose we planned to take a sample of size n=4.
Sampling distribution 9
Sampling distribution ….
• The table below shows all possible samples of size n=4
from the population of N=6 when sampling without
replacement.
• (Sampling without replacement means that we select an
individual and with that person aside, we select a second
from those remaining, and so on.
• In contrast, when sampling with replacement, we make a
selection, record that selection, and place that person
back before making a second selection.
• When sampling with replacement, the same individual
can be selected into the sample multiple times.)
• The right column shows the sample mean based on the
four observations contained in that sample
Sampling distribution 10
All samples of size n=4
Sampling distribution 11
Sampling distribution ….
• The probability of selecting any particular sample in
Table above is 1 / 15=0.07.
• Suppose by chance we select Sample 1.
• The mean of Sample 1 is 60.
• If we based our estimate of the unknown population mean
on Sample 1, and particularly on the sample mean of
Sample 1, we would underestimate the true population
mean (µ= 71.7).
• If we selected Sample 15, we would overestimate the
population mean because the mean of Sample 15 is
Sampling distribution 12
Sampling distribution ….
• The collection of all possible sample means (in this example,
there are 15 distinct samples that are produced by sampling
four individuals at random without replacement) is called the
sampling distribution of the sample means.
• We consider it a population because it includes all possible
values produced in this case by a specific sampling scheme.
• If we compute the mean and standard deviation of this
population of sample means, we get the following:
and a standard deviation of
• The subscripts here are to distinguish these parameters from
those based on the population data (x).
• To be consistent, the parameters based on the population
data could include a subscript x.
Sampling distribution 13
Sampling distribution ….
• Notice that the mean of the sample means is which is
precisely the value of the population mean (µ).
• This will always be the case.
• Specifically, the mean of the sampling distribution of the sample
means will always be equivalent to the population mean.
• This is important as it indicates that, on average, the sample mean
is equal to the population mean.
• This is the definition of an unbiased estimator.
• Unbiasedness is a desirable property in an estimator.
Sampling distribution 14
Sampling distribution ….
• Notice also that the variability in the sample means is much
smaller than the variability in the population; this will also
always be the case.
• A box-whisker plot of the population of sample means is
shown in Figure 2.
Sampling distribution 15
Sampling distribution ….
• Notice that the distribution of the sample means is
more symmetric and has a much more restricted
range (60 to 88.8) than the distribution of the
population data (25 to 100) shown in Figure 1.
• The importance of these observations is stated
formally in the Central Limit Theorem
Sampling distribution 16
Central Limit Theorem
• Suppose we have a population with known mean and
standard deviation, µ and σ, respectively
• The distribution of the population can be normal or
it can be non-normal (e.g., skewed toward the high or low
end, or flat)
• If we take simple random samples of size n from the
population with replacement, then for large samples
(usually defined as samples with n > 30), the sampling
distribution of the sample means is approximately normally
distributed with a mean of and a standard deviation
Sampling distribution 17
Central Limit Theorem…
• Regardless of the distribution of the population (normal
or not), as long as the sample is sufficiently large
(usually n≥30), then the distribution of the sample
means is approximately normal
• If the outcome in the population is normal, then the
result holds for samples of any size (i.e., the sampling
distribution of the sample means is approximately
normal even for samples of size less than 30)
Sampling distribution 18
Central Limit Theorem…
• If the outcome in the population is dichotomous, then
the result holds for samples that meet the following criterion:
o np≥5 and nq≥5, where n is the sample size and p is the
probability of success and q is the probability of failure
in the population; q=1-p
Sampling distribution 19
Example-1
• Suppose we measure a characteristic in a population and that
this characteristic follows a normal distribution with a mean
of 75 and standard deviation of 8. If we take simple random
samples with replacement of size n=10 from this population;
a) Compute the mean of the sample means and the standard
deviation of the sample means
b) Is the distribution of sample means normal?
Answers:
a) the mean of sample means is 75 and standard deviation of
sample means is 2.5
i.e
Sampling distribution 20
b) The distribution of sample means also becomes normal
The population distribution The sampling distribution (n=10)
Sampling distribution 21
• If we take simple random samples with replacement of
size n=5, we get a similar distribution
• The distribution of sample means becomes approximately
normal
• The mean of sample means is 75 and standard deviation of
sample means is 3.6
• NB: the variability in sample means is larger for samples of
size 5 as compared to samples of size 10
Sampling distribution 22
• The distribution of the sample means (n=5)
Sampling distribution 23
Example-2
• Suppose we measure a characteristic in a population and that
this characteristic is dichotomous with 30% of the population
classified as a success (i.e., p=0.30). The characteristic might
represent disease status, the presence or absence of a genetic
abnormality, or the success of a medical procedure.
• If we take simple random samples with replacement of size
n=20 from this binomial population and for each sample we
compute the sample mean, the distribution of sample means
should be approximately normal because min [(np), n(1-p)] =
min (6, 14)
• The mean of the sample means is 6 and the standard
deviation of the sample means is 0.4
Sampling distribution 24
• and
• The Central Limit Theorem np≥5, and nq≥5
• Suppose we take simple random samples with replacement of
size n=10.
• In this scenario, we do not meet the sample size requirement
for the results of the Central Limit Theorem to hold i.e min
[(np), n(1-p)] = min (3, 7)
• The distribution of sample means based on samples of size
n=10 is not quite normally distributed.
Sampling distribution 25
The sampling distribution (n=20)
Sampling distribution 26
Non-normal sampling distribution
• The sample size must be larger for the distribution to
approach normality
The sampling distribution (n=10)
Sampling distribution 27
Thanks!
Sampling distribution 29