Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
55 views73 pages

Probability and Statistics, Slides

The document outlines the foundational concepts of probability and statistics as they relate to econometric analysis, emphasizing the importance of understanding uncertainty in empirical data. It covers key topics such as random variables, distributions, means, variances, covariance, and the Central Limit Theorem, providing mathematical proofs and examples. The document also discusses the application of these concepts in estimating population parameters using sample data, particularly in the context of earnings and education levels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views73 pages

Probability and Statistics, Slides

The document outlines the foundational concepts of probability and statistics as they relate to econometric analysis, emphasizing the importance of understanding uncertainty in empirical data. It covers key topics such as random variables, distributions, means, variances, covariance, and the Central Limit Theorem, providing mathematical proofs and examples. The document also discusses the application of these concepts in estimating population parameters using sample data, particularly in the context of earnings and education levels.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Probability and Statistics

Econ 2560, Spring 2024

Prof. Josh Abel

(Chapters 2 and 3 [esp. 3.1-3.4])


Motivation

Empirical/econometric analysis requires grappling with uncertainty


Samples are too small to pin down objects with certainty
Even a census doesn’t remove all doubt
Probability and statistics are the formal studies of uncertainty
2 sides of the same coin
Probability: given model with uncertainty, what data might I see?
Statistics: given observed data, what model is operative?
Lessons from the “simple” setting of estimating a mean carry over to
more complex problems later in the semester
Motivation (2)

This slide deck is motivated by the following thought experiment:


“Suppose we observe the average income of a small sample drawn
from the population.
“What is our best guess for the average income of the population?
“What range of numbers besides our best guess could also be
reasonable?”
We will build up answers to these questions methodically
Random variables

Random variable (RV): numerical function of an uncertain outcome


Suppose we flip a coin 3 times. Some RVs are:
# of Heads (0, 1, 2, 3)
# of Heads on 2nd flip (0, 1)
Ratio of # of Heads on 3rd flip to # of Heads on 1st flip (0, 1, ∞)
Distribution

A distribution of a “discrete” RV gives the probability that each


potential value of the RV will indeed be realized
Will discuss “continuous” RVs shortly
These probabilities:
Must be no smaller than 0 and no larger than 1 (pi ∈ [0, 1])
P
Must sum to 1 ( pi = 1)
Distribution with fair coin, independent flips
Distribution with 80-20 coin, independent flips
Distribution with fair coin, perfectly correlated flips
Distribution with fair coin, perfectly correlated flips
Mean

A distribution can be a complex object


Imagine if we did 1,000 coin flips...
An object commonly used to summarize a distribution is its mean
Probability-weighted average (also, “expected value”)
P
E [X ] = i xi · pi
Often denoted µX
Most econometric analyses try to estimate means
Key mathematical fact: means are linear
E [a · X + b · Y + c] = a · E [X ] + b · E [Y ] + c
(E [3 · X + 4 · Y − 9] = 3 · E [X ] + 4 · E [Y ] − 9)
(Proof)

X
E [a · X + b · Y + c] = pi · (a · Xi + b · Yi + c)
X X X
= pi · a · Xi + pi · b · Yi + pi · c
X X X
=a· pi · Xi + b · pi · Yi + c · pi
= a · E [X ] + b · E [Y ] + c
Distribution with fair coin, independent flips

E [X ] = 0.125 · 0 + 0.375 · 1 + 0.375 · 2 + 0.125 · 3 = 1.5


Distribution with 80-20 coin, independent flips

E [X ] = 0.008 · 0 + 0.096 · 1 + 0.384 · 2 + 0.512 · 3 = 2.4


Distribution with fair coin, perfectly correlated flips

E [X ] = 0.5 · 0 +0·1 +0·2 + 0.5 · 3 = 1.5


Mean only captures central tendency
Variance

Mean gives central tendency but no sense of “spread”


Variance is a separate object, summarizing spread
Probability-weighted average squared deviation deviation from mean
pi · (xi − µX )2
P
Var (X ) = i

Often denoted σX2


Standard deviation is just the square root of the variance
σX
Mean only captures central tendency
Variance

Mean gives central tendency but no sense of “spread”


Variance is a separate object, summarizing spread
Probability-weighted average squared deviation deviation from mean
pi · (xi − µX )2
P
Var (X ) = i

Often denoted σX2


Standard deviation is just the square root of the variance
σX
Key mathematical fact:
var (a · X + b · Y + c) = a2 · var (X ) + b 2 · var (Y ) + 2 · a · b · cov (X , Y )
(var (6 · X − 3 · Y + 8) = 36 · var (X ) + 9 · var (Y ) + 2 · 6 · −3 · cov (X , Y ))
(Proof)

X  2
var (a · X + b · Y + c) = pi · (a · Xi + b · Yi + c) − (aµX + bµY + c)
X  2
= pi · (a · (Xi − µX ) + b · (Yi − µY ) + (c − c)
X  
= pi · a2 (Xi − µX )2 + b 2 (Yi − µY )2 + 2 · a · (Xi − µX ) · b · (Yi − µY )
X X X
= a2 pi (Xi − µX )2 + b 2 pi (Yi − µY )2 + 2 · a · b pi (Xi − µX ) · (Yi − µY )
X
= a2 · var (X ) + b 2 · var (Y ) + 2 · a · b pi (Xi − µX ) · (Yi − µY )
| {z }
covariance
Covariance

Covariance measures whether 2 RVs move together


1
P
cov (X , Y ) = n · i pi · (xi − µX ) · (yi − µY )
Denoted σXY
If X and Y are “typically” above (or below) their respective means at
the same times, cov (X , Y ) > 0
If X is “typically” above its mean when Y is below its mean (and vice
versa), cov (X , Y ) < 0
Otherwise, we say X and Y are uncorrelated
Key mathematical facts:
cov (a + b · X , c + d · Y ) = b · d · cov (X , Y )
(cov (1 + 6 · X , 7 − 3 · Y ) = 6 · −3 · cov (X , Y ))
cov (X , X ) = var (X )
(Co)variance, visualized
(Co)variance, visualized
Conditional distribution

Can also ask, “What outcomes will I see for Y, given that I see
X=x?”
E.g. # of Heads (Y), given flip 1 yielded 0 Heads (X)
This is a conditional distribution
Conditional distribution
Conditional distribution

Can also ask, “What outcomes will I see for Y, given that I see
X=x?”
E.g. # of Heads (Y), given flip 1 yielded 0 Heads (X)
This is a conditional distribution
The conditional mean is just the mean of the conditional
distribution
Denoted E [Y |X ]
Conditional distribution
“Continuous” RVs

Often we work with variables defined on continuous intervals


E.g. income
Earlier concepts for discrete RVs still apply...
...but have to think about probabilities a little differently
Probability of any single outcome (e.g. $96, 724.426580235...) is zero!
Distributions for continuous RVs are represented with density
functions
Used to find probabilities for ranges of outcomes
Probability density function
Probability density function
Probability density function
“Continuous” RVs

Often we work with variables defined on continuous intervals


E.g. income
Earlier concepts for discrete RVs still apply...
...but have to think about probabilities a little differently
Probability of any single outcome (e.g. $96, 724.426580235...) is zero!
Distributions for continuous RVs are represented with density
functions
Used to find probabilities for ranges of outcomes
R ∞
E [X ] = x · f (x) · dx
−∞
R ∞
var (X ) = (x − µX )2 · f (x) · dx
−∞
Normal distribution

Previous slide showed a Normal distribution


Defined by just 2 parameters: mean and variance
N(µ, σ 2 )
X −µ
If X ∼ N(µ, σ 2 ), then Z = σ ∼ N(0, 1)
N(0, 1) is the standard Normal distribution and is very well
understood
Normal distribution
Normal distribution
Normal distribution
Normal distribution

Previous slide showed a normal distribution


Defined by just 2 parameters: mean and variance
N(µ, σ 2 )
X −µ
If X ∼ N(µ, σ 2 ), then Y = σ ∼ N(0, 1)
N(0, 1) is the standard Normal distribution and is very well
understood
The Normal distribution shows up everywhere, as we will see
(Proof)

Y√−2

Suppose, Y ∼ N(2, 4). Define X = 4
↔Y = 4 · X + 2.
E [X ] = E [ Y√−2
4
]= √1 (E [Y ]
4
− 2) = 0

var (X ) = var ( Y√−2


4
) = 41 var (Y ) = 1.
So X ∼ N(0, 1)
2.5% = Pr (X < −1.96) = Pr (2 · X + 2 < 2 · −1.96 + 2) = Pr (Y <
−1.92)
2.5% = Pr (X > 1.96) = Pr (2 · X + 2 > 2 · 1.96 + 2) = Pr (Y < 5.92)
Statistics

Suppose you have data on earnings and years of education


Want to compare average earnings of high school grads (Educ = 12)
to those who fell just short (Educ = 11)
E [Yi |Educi = 12] − E [Yi |Educi = 11]
µ12 − µ11
Current Population Survey data
Estimators

CPS collects data from random sample of population


People are chosen randomly (i.e. independently) from the population
at large (i.e. identically distributed)
i.i.d.
An estimator is a rule/function that takes data as an input and
generates an estimate as an output
Because data is drawn at random, estimator is a RV
Denoted µ̂(X )
Sample mean

To start, let’s try to estimate µ12


We don’t have the population to compute mean, so let’s use sample
mean as our estimator
1
P
X̄ = n i Xi
Sample mean is appealing because under weak assumptions, it is:
Unbiased estimator of population mean
Consistent estimator of population mean
And under some slightly stronger assumptions, we know its sampling
distribution
Sample mean is unbiased

Bias = E [µ̂(X ) − µ]

Bias = E [µ̂(X )] − µ

h1 X i
Bias = E Xi − µ
n
i

1X
Bias = E [Xi ] − µ
n
i

1
Bias = · (nµ) − µ
n

Bias = 0
Consistency

Unbiased: Sample mean is correct “on average” in sample of any size


But in your particular draw of the data, it may be very wrong!
Informally, consistency means that as the sample size gets very large
(n → ∞), you are assured of not only being right “on average” – the
sample mean in your particular (large!) sample will be spot-on for the
true average
This is the Law of Large Numbers
Because n1 i Xi is unbiased (for the population mean), the key to
P
showing consistency is that var (X̄ ) → 0 as n → ∞.
Sample mean is consistent

1 X 
var (X̄ ) = var Xi
n
i

1 hX XX i
var (X̄ ) = 2 · var (Xi ) + cov (Xi , Xj )
n
i i j̸=i

1 h i
var (X̄ ) = 2 · n · σX2 + 0
n
σX2
var (X̄ ) =
n
limn→∞ var (X̄ ) = 0
Current Population Survey data

30
Average Hourly Earnings

25
20
15
10

0 200 400 600 800 1000

Sample Size
Distribution of the sample mean

σY2
Just solved for Ȳ ’s mean (µY ) and variance ( n )
Those are coarse summary stats – can we get the whole distribution?
It depends
If Yi is Normal (N(µY , σY2 )), then Ȳ is Normal:
Ȳ ∼ N(µY , σY2 /n)
But in general, distribution of Ȳ can be very complex
Bad news: Usually not reasonable to think our variables are Normal
Good news: As sample grows, distribution simplifies back to Normal!
This is the remarkable Central Limit Theorem.
Central Limit Theorem, illustrated
Central Limit Theorem, illustrated
Central Limit Theorem, illustrated
Summary of Ȳ ’s distribution

n small n→∞
σ2 2
σY
Yi ∼ N(µY , σY2 ) Ȳ ∼ N(µY , nY ) Ȳ ∼ N(µY , n )
2
σY
Yi not normal ?? Ȳ ∼ N(µY , n )
Large sample approximations (n → ∞)

LLN says that when sample is massive, X̄ = µX


But if sample is not massive...
Must acknowledge that probably X̄ ̸= µX
But it should be “close”
To quantify “close,” need distribution of X̄
CLT tells gives us (approximate) distribution of X̄ when n is
large-but-not-so-large-that-we-believe-LLN
Even with n at just 100, N(µ, σ 2 ) can be an excellent approximation
Spares us from having to figure out some complicated distributions!
Back to CPS data...

We are now ready to estimate µ12 , µ11 , and (µ12 − µ11 )


This has 2 parts
Give a best guess (point estimate)
Quantify the uncertainty about that guess
Point estimate: the sample mean

Seems reasonable to use the sample mean as the best guess:


n12
1 X
Ȳ12 = Yi
n12
i=1

We know this is unbiased and consistent


Can show that Ȳ minimizes this loss function:
X n
Loss(µ̂) = (Yi − µ̂)2
i=1
Point estimate: results

In the data:
µ̂12 = Ȳ12 = $16.62/hr
µ̂11 = Ȳ11 = $12.18/hr
µ12
\ − µ11 = Ȳ12 − Ȳ11 = $4.44/hr
Sample standard deviation

To quantify the uncertainty around our point estimates, we want to


know the variances of the estimators
σY2
We know var (Ȳ ) = n , but what is σY2 ?
We have to estimate it
The following is an unbiased, consistent estimator of σY2 :

n
1 X
sY2 = (Yi − Ȳ )2
n−1
i=1

sY2
Therefore, we will use it as our estimator: σ̂Y2 = sY2 , and σ̂Ȳ2 = n
Results

Ȳ σ̂Y2 n σ̂Ȳ2
µ̂12 16.62 72.95 782 0.09
µ̂11 12.18 31.44 49 0.64

To give some intuitive meaning to these results, we use 2 main


approaches:
Hypothesis testing
Confidence intervals
Hypothesis testing

Hypothesis: those with 11 years of education earn $15/hr


H0 : µ11 = µH0 = 15
Even if H0 is true, our estimate will never hit 15 on the nose due to
sampling variation
Hypothesis testing quantifies whether Ȳ ’s deviation from 15 is:
just due to random sampling;
or, alternatively, because µ11 ̸= 15
Test H0 by assuming it’s true
If Ȳ looks “extreme” assuming H0 is true, it’s probably false
Test statistic

Central to hypothesis testing is the t-stat:

Ȳ − µ0
t=
σ̂Ȳ

If Ȳ ∼ N(µ0 , σ̂Ȳ2 ), then t ∼ N(0, 1)


Ȳ −15
E.g. If Ȳ ∼ N(15, 0.64), then t = 0.8 ∼ N(0, 1)
Testing our hypothesis

12.18−15
So if µ11 = 15, we have t = 0.8 = −3.525
The probability of a N(0, 1) RV being this extreme (< −3.525 or
> 3.525) is very low
Visualizing the hypothesis test
Z table
Z table
Testing our hypothesis

12.18−15
So if µ11 = 15, we have t = 0.8 = −3.525
The probability of a N(0, 1) RV being this extreme (< −3.525 or
> 3.525) is very low
Probability is 0.00022 + 0.00022 = 0.00044
So we would reject H0 at 5% level because probability is less than 5%
If it were true, we would see this outcome less than 1% of the time
It is very unlikely that µ11 = 15
p-value

We call that probability the p-value and define it as:

p = Pr (|Z | > |t|) = 2 · Φ(−|t|),

where Z ∼ N(0, 1) and Φ(z) = Pr (Z ≤ z).


Visualizing the p-value
p-value

We call that probability the p-value and define it as:

p = Pr (|Z | > |t|) = 2 · Φ(−|t|),

where Z ∼ N(0, 1) and Φ(z) = Pr (Z ≤ z).


Typically, we reject H0 if p < 0.05.
Our p is 0.00044, so we would reject that µ11 = 15
Large sample approximation

Ȳ : n small n→∞
σ2 2
σY
Yi ∼ N(µY , σY2 ) Ȳ ∼ N(µY , nY ) Ȳ ∼ N(µY , n )
2
σY
Yi not normal ?? Ȳ ∼ N(µY , n )

t-stat n small n→∞


2
σY 2
σY
Yi ∼ N(µY , σY2 ) t ∼ tn−1(0, n ) t ∼ N(0, n )
2
σY
Yi not normal ?? t ∼ N(0, n )

Almost always more reasonable to assume n is large rather than


Y ∼ N, so we will rely on Normal approximation rather than exact
t-distribution.
Confidence intervals

A hypothesis test rules out (or doesn’t) a specific µ0 of interest


Does so at some pre-specified confidence level (e.g. 5%)
A confidence interval identifies all potential values of µH0 that
would not be ruled out
It’s like a hypothesis test on steroids
3 Equivalent statements:
“I reject H0 at 5% level.”
“p-value of H0 is less than 0.05.”
“µH0 is not contained in the 95% CI.”
Constructing a 95% confidence interval

We start with the test statistic:

Ȳ − µH0
t=
σ̂Ȳ
We reject H0 at 5% level if:

Pr (|Z | > |t|) < 0.025


Therefore, we reject H0 if t < −1.96 or t > 1.96
Z table
Constructing a 95% confidence interval

We start with the test statistic:

Ȳ − µH0
t=
σ̂Ȳ
We reject H0 at 5% level if:

Pr (|Z | > |t|) < 0.025


Therefore, we reject H0 if t < −1.96 or t > 1.96
So, we fail to reject the following µH0 values:

95%CI = [Ȳ ± 1.96 · σ̂Ȳ ] = [Ȳ − 1.96 · σ̂Ȳ , Ȳ + 1.96 · σ̂Ȳ ]


If µ = µH0 , probability of Ȳ being so far away that µH0 is rejected is
just 5%
So, Pr (µ ∈ [Ȳ ± 1.96 · σ̂Ȳ ]) = 95%
Results

Ȳ σ̂Y2 n σ̂Ȳ2 95% CI Reject z-score p-val


µ̂12 16.62 73.0 782 0.09 [16.0, 17.2] Y 5.30 < 0.0001
µ̂11 12.18 31.4 49 0.64 [10.6, 13.8] Y -3.52 0.0004
“Reject,” “z-score,” and “p-val” all refer to a null hypothesis that the
sample mean is equal to 15.
Testing the difference in means

“Do people with 11 vs. 12 years of education earn different


amounts?”
Point estimate:

µ12
\ − µ11 = µ̂12 − µ̂11 = Ȳ12 − Ȳ11 = 16.62 − 12.18 = 4.44

Uncertainty:

var (Ȳ12 − Ȳ11 ) = σ̂Ȳ2 12 + σ̂Ȳ2 11 + 2 · σ̂Ȳ12 ,Ȳ11 = 0.09 + 0.64 + 0 = 0.73

Hypothesis test for difference of 0:


4.44 − 0
t= √ = 5.20, p-value << 0.01
0.73
Confidence interval for µ12 − µ11 = [2.77, 6.11]

You might also like