Introduction to Statistical Inference
Edwin Leuven
Introduction
I Define key terms that are associated with inferential statistics.
I Revise concepts related to random variables, the sampling
distribution and the Central Limit Theorem.
2/39
Introduction
Until now we’ve mostly dealt with descriptive statistics and with
probability.
In descriptive statistics one investigates the characteristics of the
data
I using graphical tools and numerical summaries
The frame of reference is the observed data
In probability, the frame of reference is all data sets that could have
potentially emerged from a population
3/39
Introduction
The aim of statistical inference is to learn about the population
using the observed data
This involves:
I computing something with the data
I a statistic: function of data
I interpret the result
I in probabilistic terms: sampling distribution of statistic
4/39
Introduction
Probability
Population Sample (Data)
fX (x ) (x1 , . . . , xn )
Parameter Statistic
x̄ = n1 ni=1 xi
P
E [X ] = µ
Inference
5/39
Point estimation
We want to estimate a population parameter using the observed
data.
I f.e. some measure of variation, an average, min, max, quantile,
etc.
Point estimation attempts to obtain a best guess for the value of
that parameter.
An estimator is a statistic (function of data) that produces such a
guess.
We usually mean by “best” an estimator whose sampling
distribution is more concentrated about the population parameter
value compared to other estimators.
Hence, the choice of a specific statistic as an estimator depends on
the probabilistic characteristics of this statistic in the context of the
sampling distribution.
6/39
Confidence Interval
We can also quantify the uncertainty (sampling distribution) of our
point estimate.
One way of doing this is by constructing an interval that is likely to
contain the population parameter.
One such an interval, which is computed on the basis of the data, is
called a confidence interval.
The sampling probability that the confidence interval will indeed
contain the parameter value is called the confidence level.
We construct confidence intervals for a given confidence level.
7/39
Hypothesis Testing
The scientific paradigm involves the proposal of new theories that
presumably provide a better description of the laws of Nature.
If the empirical evidence is inconsistent with the predictions of the
old theory but not with those of the new theory
I then the old theory is rejected in favor of the new one.
I otherwise, the old theory maintains its status.
Statistical hypothesis testing is a formal method for determining
which of the two hypothesis should prevail that uses this paradigm.
8/39
Statistical hypothesis testing
Each of the two hypothesis, the old and the new, predicts a different
distribution for the empirical measurements.
In order to decide which of the distributions is more in tune with the
data a statistic is computed.
This statistic t is called the test statistic.
A threshold c is set and the old theory is reject if t > c
Hypothesis testing consists in asking a binary question about the
sampling distribution of t
9/39
Statistical hypothesis testing
This decision rule is not error proof, since the test statistic may fall
by chance on the wrong side of the threshold.
Suppose we know the sampling distribution of the test statistic t
We can then set the probability of making an error to a given level
by setting c
The probability of erroneously rejecting the currently accepted
theory (the old one) is called the significance level of the test.
The threshold is selected in order to assure a small enough
significance level.
10/39
Multiple measurements
The method of testing hypothesis is also applied in other practical
settings where it is required to make decisions.
Consider a random trial of a new treatment to a medical condition
where the
I treated get the new treatment
I controls get the old treatment
and measure their response
We now have 2 measurements that we can compare.
We will use statistical inference to make a decision about whether
the new treatment is better.
11/39
Statistics
Statistical inferences, be it point estimation, confidence intervals, or
hypothesis tests, are based on statistics computed from the data.
A statistic is a formula which is applied to the data
and we think of it as a statistical summary of the data
Examples of statistics are
I the sample average and
I the sample standard deviation
For a given dataset a statistic has a single numerical value.
it will be different for a different random sample!
The statistic is therefore a random variable
12/39
Statistics
It is important to distinguish between
1. the statistic (a random variable)
2. the realisation of the statistic for a given sample (a number)
we therefore denote the statistic with capitals, f.e. the sample mean:
1 Pn
I X̄ = n i=1 Xi
and the realisation of the statistic with small letters:
1 Pn
I x̄ = n i=1 xi
13/39
Example: Polling
14/39
Example: Polling
Imagine we want to predict whether the left block or the right block
will get a majority in parliament
Key quantities:
I N = 4,166,612 - Population
I p = (# people who support the right) / N
I 1 − p = (# people who support the left) / N
We can ask the following questions:
1. What is p?
2. Is p > 0.5?
3. We estimate p but are we sure?
15/39
Example: Polling
We poll a random sample of n = 1,000 people from the population
without replacement:
I choose person 1 at random from N, choose person 2 at random
from N-1 remaining, etc.
or, choose a random set of n people from all Nn = n!(N−n)!
N!
I
possible sets
Let (
1 if person i support the right
Xi =
0 if person i support the left
and denote our data by x1 , . . . , xn
Then we can estimate p by
p̂ = (x1 + . . . + xn )/n
16/39
Example: Polling
To construct the poll we randomly sampled the population
With a random sample each of the n people is equally likely to be
the ith person, therefore
E [Xi ] = 1 · p + 0 · (1 − p) = p
and therefore
E [p̂] = E [(X1 + . . . + Xn )/n]
= (E [X1 ] + . . . + E [Xn ])/n = p
The “average” value of p̂ is p, and we say that p̂ is unbiased
Unbiasedness refers to the average error over repeated sampling,
and not the error for the observed data!
17/39
Example: Polling
Say 540 support the right, so p̂ = 0.54
Does this mean that in the population:
I p = 0.54?
I p > 0.5?
The data are a realization of a random sample and p̂ is therefore a
random variable!
For a given sample we will therefore have estimation error
estimation error = p̂ − p 6= 0
which comes from the difference between our sample and the
population
18/39
Example: Polling
When sampling with replacement the Xi are independent, and
p(1−p)
I Var [p̂] = n
When sampling without replacement the Xi are not independent
N1 − 1 N1
= Pr(Xi = 1|Xj = 1) 6= Pr (Xi = 1|Xj = 0) =
N −1 N −1
and we can show that
p(1−p) n−1
I Var [p̂] = n 1− N−1
For N = 4, 166, 612, n = 1, 000, and p = 0.54, the standard
deviation of p̂ ≈ 0.016.
But what is the distribution of p̂?
19/39
The Sampling Distribution
Statistics vary from sample to sample
The sampling distribution of a statistic
I is the nature of this variability
I can sometimes be determined and often approximated
The distribution of the values we get when computing a statistic in
(infinitely) many random samples is called the sample distribution of
that statistic
20/39
The Sampling Distribution
We can sample from
I population
I eligible voters in norway today
I model (theoretical population)
I Pr(vote right block) = p
The sampling distribution of a statistic depends on the population
distribution of values of the variables that are used to compute the
statistic.
21/39
Sampling Distribution of Statistics
Theoretical models describe the distribution of a measurement as a
function of one or more parameters.
For example,
I in n trials with succes probability p, the total number of
successes follows a Binomial distribution with parameters n and
p
I if an event happens at rate λ per unit time then the probability
that k events occur in a time interval with length t follows a
Poission distribution with parameters λt and k
22/39
Sampling Distribution of Statistics
More generally the sampling distribution of a statistic depends on
I the sample size
I the sampling distribution of the data used to construct the
statistic
can be complicated!
We can sometimes learn about the sampling distribution of a
statistic by
I Deriving the finite sample distribution
I Approximation with a Normal distribution in large samples
I Approximation through numerical simulation
23/39
Finite sample distributions
Sometimes we can derive the finite sample distribution of a statistic
Let the fraction of people voting right in the population be p
Because we know the distribution of the data (up to the unknown
parameter p) we can derive the sampling distribution
In a random sample of size n the probability of observing k people
voting on the right can be derived and follows a binomial distribution
!
n k
Pr(X = k) = p (1 − p)n−k
k
This depends on p which is unknown.
This approach is however often not feasible
The statistic may be complicated, depend on different variables, the
population distribution of these variables is unknown
24/39
Theoretical Distributions of Observations (Models)
Distribution Sample Space f (x )
n k n−k
Binomial 1, . . . , n k p (1 − p)
Poisson 1, 2, . . . λk exp(−λ)/k!
Uniform [a, b] 1/(b − a)
Exponential [0, ∞) λ exp(−λx )
Normal (−∞, ∞) √ 1 exp(− 1 ( x −µ )2 )
2πσ 2 σ
Distribution E [X ] Var (X ) R
Binomial np np(1 − p) d,p,q,rbinom
Poisson λ λ d,p,q,rpoisson
1 1 2
Uniform 2 (a + b) 12 (b − a) d,p,q,runif
Exponential λ−1 λ −2 d,p,q,rexp
Normal µ σ2 d,p,q,rnorm
25/39
Example: Polling
hist(
replicate(
10000,mean(rbinom(1000, 1, .54)))
, main="", xlab="p_hat",prob=TRUE,breaks=50)
25
20
Density
15
10
5
0
0.48 0.50 0.52 0.54 0.56 0.58 0.60
26/39
The Normal Approximation
In general, the sampling distribution of a statistic is not the same as
the sampling distribution of the measurements from which it is
computed.
If the statistic is
1. (a function of) a sample average and
2. the sample is large
then we can often approximate the sampling distribution with a
Normal distribution
27/39
Example: Polling
In the graph p̂ looked like it had a Normal distribution with mean
0.54 and s.d. 0.16
If N n then Xi are approximately independent, and if n is large
then
√
n(p̂ − p) ∼ N(0, p(1 − p))
or equivalently
p(1 − p)
p̂ ∼ N(p, )
n
by the Central Limit Theorem
28/39
Example: Polling
curve(dnorm(x, mean=.54, sd=0.016),
col="darkblue", lwd=2, add=TRUE, yaxt="n")
25
20
Density
15
10
5
0
0.48 0.50 0.52 0.54 0.56 0.58 0.60
p_hat
29/39
Approximation through numerical simulation
Computerized simulations can be carried out to approximate
sampling distributions.
With a model we can draw many random samples, compute the
statistic, and characterize it’s sampling distribution.
Assume price ∼ Expontential(λ)
Consider samples of size n = 201
E [price] = λ−1 and Var [price] = λ−2
and therefore
q
Var (price) = (1/λ2 )/201 ≈ 0.0705/λ
30/39
Approximation through numerical simulation
Remember that 95% of the probability density of a normal
distribution is within 1.96 s.d. of its mean.
The Normal approximation for the sampling distribution of the
average price suggests
√
1/λ ± 1.96 · 1/(λ n)
should contain 95% of the distribution.
31/39
Approximation through numerical simulation
We may use simulations in order to validate this approximation
Assume λ = 1/12, 000
X.bar = replicate(10^5,mean(rexp(201,1/12000)))
mean(abs(X.bar-12000) <= 1.96*0.0705*12000)
## [1] 0.95173
Which shows that the Normal approximation is adequate in this
example
How about other values of n or λ?
32/39
Approximation through numerical simulation
Simulations may also be used in order to compute probabilities in
cases where the Normal approximation does not hold.
Consider the following statistic
(min(xi ) + max(xi ))/2
where Xi ∼ Uniform(3, 7) and n = 100
What interval contains 95% of the observations?
33/39
Approximation through numerical simulation
Let us carry out the simulation that produces an approximation of
the central region that contains 95% of the sampling distribution of
the mid-range statistic for the Uniform distribution:
mid.range <- rep(0,10^5)
for(i in 1:10^5) {
X <- runif(100,3,7)
mid.range[i] <- (max(X)+min(X))/2
}
quantile(mid.range,c(0.025,0.975))
## 2.5% 97.5%
## 4.9409107 5.0591218
Observe that (approximately) 95% of the sampling distribution of
the statistic are in the range [4.941680, 5.059004].
34/39
Approximation through numerical simulation
Simulations can be used in order to compute any numerical
summary of the sampling distribution of a statistic
To obtain the expectation and the standard deviation of the
mid-range statistic of a sample of 100 observations from the
Uniform(3, 7) distribution:
mean(mid.range)
## [1] 4.9998949
sd(mid.range)
## [1] 0.027876151
35/39
Approximation through numerical simulation
Computerized simulations can be carried out to approximate
sampling distributions.
1. draw a random sample of size n with replacement from our data
2. compute our statistic
3. do 1. & 2. many times
The resulting distribution of statistics across our resamples is an
approximation of the sampling distribution of our statistic
The idea is that a random sample of a random sample from the
population, is again a random sample of the population
This is called the bootstrap and computes the sampling distribution
without a model!
36/39
Approximation through numerical simulation
n = 1000
data = rbinom(n, 1, .54) # true distr, usually unknown
estimates = rep(0,999)
for(i in 1:999) {
id = sample(1:n, n, replace=T)
estimates[i] = mean(data[id])
}
sd(estimates)
## [1] 0.015946413
sqrt(.54*(1-.54)/1000) # true value, usually unknown
## [1] 0.015760711
37/39
Summary
Today we looked at the elements of statistical inference
I Estimation:
I Determining the distribution, or some characteristic of it.
(What is our best guess for p?)
I Confidence intervals:
I Quantifying the uncertainty of our estimate. (What is a range
of values to which we’re reasonably sure p belongs?)
I Hypothesis testing:
I Asking a binary question about the distribution. (Is p > 0.5?)
38/39
Summary
In statistical inference we think of data as a realization of a random
process
There are many reasons why we think of our data as (ex-ante)
random:
1. We introduced randomness in our data collection (random
sampling, or random assigning treatment)
2. We are actually studying a random phenomenon (coin tosses or
dice rolls)
3. We treat as random the part of our data that we don’t
understand (errors in measurements)
The coming weeks we will take a closer look at how this randomness
affects what we can learn about the population from the data
39/39