Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views37 pages

Week 3-Nonparametric Estimation

Uploaded by

f.shhryrpr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views37 pages

Week 3-Nonparametric Estimation

Uploaded by

f.shhryrpr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Week 3: Nonparametric Estimation

Dongwoo Kim
Simon Fraser University
Jan 18, 2023
Quote

“Young man, in mathematics you don’t understand things. You just get
used to them.”

• John von Neumann

1
Introduction

Parametric methods assume that data are generated from a known finite
dimensional model.

• Y = X β + U, U ∼ N(0, σ 2 )
• Finite number of parameters: β and σ 2
• Hard to justify parametric assumptions
• Provide good approximation in many cases

Semiparametric models consist of known finite dimensional part and unknown


infinite dimensional part.
1. Effectively parametric case: Y = X β + U and E [XU] = 0
• β is of our interest. Can be estimated by OLS
• Distribution of U is nonparametric nuisance part
2. Important nonparametric part: Y = m(X β) + U and E [U|X ] = 0
• Parameters β and nonparametric function m are of our interest.

2
Introduction

Nonparametric methods assume that data are generated from an unknown


infinite dimensional model. Researcher may impose separability restrictions,
shape restrictions, or smoothness restrictions.

Examples:

1. Additive model: Y = m(X ) + U and E [U|X ] = 0


• m is fully nonparametric function
• Shape restriction e.g. monotonicity, concavity can be imposed

2. Nonseparable model: Y = m(X , U) and U ⊥


⊥X
• Additional assumptions are needed to identify m
• m is strictly increasing in U, U is scalar and uniformly distributed in [0, 1]

3
Useful Notation

Define that an and bn are non-stochastic sequences.


Definition (Big oh)
For a positive integer n, an = O(1) if, as n → ∞, an remains bounded for
some constant C, i.e. |an | < C and for all large values of n. Similarly,
an = O(bn ) if an /bn = O(1).

example)

1. If an = n/(n + 1), then an = O(1) since an ≤ 1 for all n


2. If an = n + 5, bn = n, then an = O(bn ) because an ≤ 2bn for n ≥ 5 or
an ≤ 6bn for all n ≥ 1.

4
Useful Notation

Definition (Small oh)


For a positive integer n, an = o(1) if, as n → ∞, an → 0. Similarly,
an = o(bn ) if an /bn → 0.

example)

1. If an = 10/(n + 1), then an = o(1) since an → 0 as n → ∞


2. If an = 1/n2 , bn = 1/n, then an = o(bn ) because an /bn = 1/n → 0.

5
Useful Notation

Definition (Bounded in Probability)


A sequence of real random variables {Xn }∞ n=1 is said to be bounded in
probability if, for every ε > 0, there exists a constant M and a positive integer
N, such that
P[||Xn || > M] ≤ ε
for all n ≥ N.

1. Xn = Op (1) if Xn is bounded in probability


2. Xn = op (1) if Xn −→p 0
3. Xn = Op (Yn ) if Xn /Yn is bounded in probability
4. Xn = op (Yn ) if Xn /Yn −→p 0

6
Example : the sample mean estimator

Let X be a scalar random variable. Define that µ := E [X ]. Then the mean


estimator is X̄ = n1 ni=1 Xi . By WLLN,
P

n
1X
X̄ − µ = (Xi − µ) −→p E (Xi − µ) = 0
n i=1

Therefore, X̄ − µ is Op (1) and op (1).

7
Example : the sample mean estimator

By CLT, it is also obvious that


n
√ 1 X
n(X̄ − µ) = √ (Xi − µ) −→p N(0, σ 2 )
n i=1

where σ 2 = Var (Xi ).


Define Z ≡ n(X̄ − µ), then E (Z ) = 0. By Chevyshev inequality,

σ2 σ σ2
P[|Z | ≥ ε] ≤ 2
→ P[|Z | ≥ √ ] ≤ 2 = 
ε  σ /
√ (X̄ −µ)
This means that n(X̄ − µ) = √
1/ n
is Op (1) which implies

 
1
∴ X̄ − µ = Op √
n

8
Nonparametric Kernel Density Estimation

A random variable X has unknown distribution. A sample of realizations of X


is available.
{X1 , · · · , Xn }

• Full distribution of X can be estimated with no distributional assumptions


1. Discrete X : P̂[X = x] = n1 ni=1 1[Xi = x]
P

2. Continuous X : P̂[X ≤ x] = F̂X (x) = n1 ni=1 1[Xi ≤ x]


P

• The empirical CDF is not smooth, hard to derive density function


• Kernel based nonparametric methods estimate smooth density function for
continuous random variables
• Building block for many advanced nonparametric methods
• Empirical application: 1) distribution of income, 2) distribution of income
conditional on education, etc.

9
Naive Density Estimator

Goal: estimate the density function, f (x) at a particular point x0


• At x0 , probability mass is zero if X is continuously distributed (no
observation at x0 )
• By using a small bin [x0 − h, x0 + h] with bandwidth h > 0,
Z x0 +h
P [X ∈ [x0 − h, x0 + h]] = f (x)dx
x0 −h

≈ 2hf (x0 )
• Therefore the density at x0 is approximated by
1
fˆ(x0 ) = P̂ [X ∈ [x0 − h, x0 + h]]
2h
n
1 1X
= 1[x0 − h ≤ Xi ≤ x0 + h]
2h n i=1
n
1 X1
= · 1[−h ≤ Xi − x0 ≤ h]
hn i=1 2
n
1 X1 Xi − x0
= · 1[−1 ≤ ≤ 1]
hn i=1 2 h
10
Naive Density Estimator: Intuition

• This is the naive estimator obtained by Fix and Hodges (1951).


• Estimator is the unweighted average over values of X close to x0 .
• Probability X = x0 is zero. But, probability that X is in the bin
[x0 − h, x0 + h] is positive.
• If f is continuous, f evaluated at points near x0 will be close to f (x0 ).
• For fixed h, the Strong Law of Large Numbers (SLLN) implies
1
fˆ(x0 ) →a.s. P [X ∈ [x0 − h, x0 + h]] .
2h
• Therefore, if h goes to 0,
1
fˆ(x0 ) = lim P [X ∈ [x0 − h, x0 + h]] .
h→0 2h

• Let the bandwidth h be hn that depends on sample size n. If hn → 0 as


n → ∞, the naive estimator fˆ(x0 ) converges to f (x0 ) (consistency).

11
General Kernel Function

By defining K (ψ) = 1
2
1[−1 ≤ ψ ≤ 1] (uniform kernel),
n
1 X Xi − x0
fˆ(x0 ) = K( ).
nhn i=1 hn

• The naive estimator uses uniform kernel which is density of uniformly


distributed random variable.
• It gives equal weight to any points within the bin.
• We could do better by giving larger weights to values closer to x0 .
• How? Choose different kernel function K !

Examples
2
1. Gaussian kernel: K (ψ) = √1

exp(− ψ2 ) = φ(ψ)
2. Epanechnikov kernel: K (ψ) = 43 (1 − ψ 2 )1[−1 ≤ ψ ≤ 1]

12
Finite Sample Properties

Assumption 1
1. The observations {X1 , · · · , Xn } are iid drawn.
2. f (x) is three-times differentiable
3. The kernel function is symmetric around zero and satisfies
R
i. R K (ψ)dψ = 1
ii. R ψ 2 K (ψ)dψ = µ2 6= 0
iii. K 2 (ψ)dψ = µ < ∞
4. hn → 0 as n → ∞
5. nhn → ∞ as n → ∞

Theorem 1
Under Assumption 1, for any x that is an interior point in the support of X ,

hn4 µf (x)
E [(fˆ(x) − f (x))2 ] = [µ2 f 00 (x)]2 + + o(hn4 + (nhn )−1 ).
4 nhn

13
Proof of Theorem 1

The mean squared error (MSE) is decomposed by

E [(fˆ(x) − f (x))2 ] = E [(fˆ(x) − E [fˆ(x)] + E [fˆ(x)] − f (x))2 ]


= E [(fˆ(x) − E [fˆ(x)])2 ] + E [(E [fˆ(x)] − f (x))2 ]
= E [(fˆ(x) − E [fˆ(x)])2 ] + (E [fˆ(x)] − f (x))2 ]
   2
= var fˆ(x) + bias fˆ(x)

Note that the interaction term disappears because

2E [ (fˆ(x) − E [fˆ(x)]) (E [fˆ(x)] − f (x)) ] = 2(E [fˆ(x)] − f (x))E [fˆ(x) − E [fˆ(x)]]


| {z } | {z }
constant constant

= 2(E [fˆ(x)] − f (x))(E [fˆ(x)] − E [fˆ(x)])


=0

14
Proof of Theorem 1: Bias

K ( Xih−x then fˆ(x) = Xi −x


1 1
Pn
Define that wi = hn n
), n i=1 wi . Let ψ be hn
.

   Z  
1 Xi − x 1 Xi − x
E (wi ) = E K = K f (Xi )dXi
hn hn hn hn
Z
1
= K (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
= K (ψ)f (x + hn ψ)dψ (by change of variables thm)

Therefore,
n
1X
E [fˆ(x)] = E [wi ] = E [wi ].
n i=1

15
Proof of Theorem 1: Bias

Change of Variables Theorem


Let I ⊆ R be an interval and φ : [a, b] → I be a differentiable function with
integrable derivative. Suppose that f : I → R is a continuous function. Then

Z φ(b) Z b
f (x)dx = f (φ(t))φ0 (t)dt
φ(a) a

In our case, we let φ(ψ) = x + hn ψ.


Z Z
1 1
K (ψ)f (xi )dxi = K (ψ)f (φ(ψ))φ0 (ψ)dψ
hn hn
Z
1
= K (ψ)f (φ(ψ))hn dψ
hn
Z
= K (ψ)f (φ(ψ))dψ
Z
= K (ψ)f (x + hn ψ)dψ

16
Proof of Theorem 1: Bias

Bias(fˆ(x)) =E (fˆ(x)) − f (x)


=E (wi ) − f (x)
Z
= K (ψ)f (x + hn ψ)dψ − f (x)
Z Z
= K (ψ)f (x + hn ψ)dψ − K (ψ)f (x)dψ (by Assumption 1.3.i)

Z
∴ Bias(fˆ(x)) = K (ψ)[f (x + hn ψ) − f (x)]dψ

The larger the bandwidth hn , the larger the bias.

17
Proof of Theorem 1: Bias

2nd order TSE of f (hn ψ + x) around x gives

hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x)
2

hn2 ψ 2 00
Z
∴ Bias(fˆ(x)) ≈ K (ψ)[hn ψf 0 (x) + f (x)]dψ
2
Z Z  2 2 
h ψ
= K (ψ)[hn ψf 0 (x)]dψ + K (ψ) n f 00 (x) dψ
2
Z 2 00 Z
h f (x)
=hn f 0 (x) K (ψ)ψdψ + n K (ψ)ψ 2 dψ
2
| {z }
=0 by symmetry
00
f (x)µ2
=hn2
2

18
Proof of Theorem 1: Variance

n
! n
1X 1 X
Var (fˆ(x)) = Var wi = 2
Var (wi ) (by A1)
n i=1 n i=1
1
= Var (wi )
n
1 1
= E (wi2 ) − [E (wi )]2
n n

   Z  
1 2 Xi − x 1 Xi − x
E (wi2 ) =E K = K 2
f (Xi )dXi
hn2 hn hn2 hn
Z
1
= 2 K 2 (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
1
= K 2 (ψ)f (x + hn ψ)dψ
hn

Z Z 2
1 1
∴ V (fˆ(x)) = K 2 (ψ)f (x + hn ψ)dψ − K (ψ)f (x + hn ψ)dψ
nhn n 19
Proof of Theorem 1: Variance

hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x) := TE
2

Z Z 2
1 1
∴ Var (fˆ(x)) ≈ K 2 (ψ) TE dψ − K (ψ) TE dψ
nhn n
2
h 2 µ2
Z 
1 1
= K 2 (ψ) TE dψ − f (x) + n f (2) (x)
nhn n 2
| {z }
more quickly converge to 0 than 1st term

h2 ψ 2
Z
1
→ K 2 (ψ)[f (x) + hn ψf (1) (x) + n f (2) (x)]dψ
nhn | {z 2 }
go faster to 0 as h→0
Z
1 2
→ f (x) K (ψ)dψ
nhn
| {z }

20
Choice of Kernel Function

• Choice of kernel usually not as crucial.


• For commonly used kernels, the 1st moment equals zero and the second
moment is finite.
• There are many different kernel functions developed in the theoretical
literature. Higher order kernels are an example.
• Higher order kernels tend to result in lower bias.
• In empirical applications, they are rarely used. The gains may be small but
costs are big.

21
Choice of Bandwidth

Bandwidth choice is of great importance for estimation performance.

• There exists huge literature on optimal bandwidth selection.


• Default kernel-based nonparametric estimators in Stata and R use rule of
thumb optimal bandwidth based on convergence rate. (homework!)
• Data driven bandwidth selection in general works better but
computationally demanding.
• Cross-validation is a popular way to find data-driven optimal bandwidth.
• Minimize mean integrated squared error (MISE)
Z 
MISE = E ˆ 2
(f (x) − f (x)) dx
Z 
=E (fˆ(x)2 − 2fˆ(x)f (x) + f (x)2 )dx
 
Z Z Z
ˆ 2 ˆ 2
 
=E  f (x) dx − 2f (x)f (x)dx + f (x) dx 

| {z }
=constant

22
Choice of Bandwidth

• We only need to minimize over the first two terms.


Z Z 
min E fˆ(x)2 dx − 2 fˆ(x)f (x)dx
hn >0

• Empirical implementation: Notice that the second term is equivalent to


EX (fˆ(X )). So we can find optimal hn that minimizes
Z n
2Xˆ
J(hn ) = fˆ(x)2 dx − f(−i) (Xi )
n i=1

where fˆ(−i) is leave-one-out kernel.

23
Summary

• As the sample size grows, hn → 0 and bias disappears.


• The variance converges to 0 as n → ∞ and hn → 0 so does MSE.
• Therefore, fˆ(x) is consistently estimate f (x).
• Under further regularity conditions, we can show asymptotic normality of
fˆ(x) such that

 Z 
ˆ 2
nhn (f (x) − f (x)) →d N 0, f (x) K (ψ)dψ .


• Note that the convergence rate is nhn which is called nonparametric
rate. Slow convergence than parametric rate, meaning that larger sample
size is required for precise estimation.

24
Multivariate Kernel Density Estimation

What we’ve learned is univariate density estimation. Easily extended to


multivariate setting where we estimate joint density of random variables.

f (X1 , X2 , · · · , Xk )

Again, the probability mass at x1 , · · · , xk is zero but within a small bin,

X1 ∈ [x1 − h, x1 + h], · · · , Xk ∈ [xk − h, xk + h]

we can find positive probability.

25
Multivariate Kernel Density Estimation

For 2 variables case,


Z x1 +h Z x2 +h
P [X1 ∈ [x1 − h, x1 + h], X2 ∈ [x2 − h, x2 + h]] = f (X1 , X2 )dX2 dX1
x1 −h x2 −h

≈ (2h)2 f (x1 , x2 )

Naive uniform kernel estimator:

1 X 1[x1 − h ≤ X1i ≤ x1 + h] 1[x2 − h ≤ X2i ≤ x2 + h]


n
fˆ(x1 , x2 ) = 2 ·
nh i=1 2 2

1 X 1[−1 ≤ 1i h 1 ≤ 1] 1[−1 ≤ 2i h 2 ≤ 1]
n X −x X −x
= ·
nh2 i=1 2 2

26
Multivariate Kernel Density Estimation

General kernel estimator:

n    
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = k K ···K
nh i=1 h h


• Convergence rate is now nhk .
• Rate becomes polynomially slower as k goes up
• Curse of dimensionality: sample size needed for precise estimation
explodes when k goes up!
• Note that bandwidth h need not be the same for all variables.

n    
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = K ···K
nh1 h2 · · · hk i=1 h1 hk

27
Kernels for Discrete Valued Random Variables

Suppose X take only discrete values

X ∈ {x1 , · · · , xJ }

Frequency estimator is consistently estimating probability mass functions.


n
1X
P̂[X = xj ] = 1[Xi = xj ]
n i=1

This is already good enough but we could do better.

28
Kernels for Discrete Valued Random Variables

Discrete kernel estimator is

n
1X
P̂[X = xj ] = Kh (Xi , xj )
n i=1

where

(
1−h if Xi = xj
Kh (Xi , xj ) = h
J−1
otherwise

J−1
 
and h ∈ 0, J
is the bandwidth.

This estimator a little bit under-weigh the correct observations.

Cross-validated optimal bandwidth results in better MSE than frequency


estimator.

29
Nonparametric Conditional Mean estimator

Simplest non-trivial nonparametric model (non-parametric generalization of


OLS), is nonparametric conditional mean regression.

Y = m(X ) + U, E [U|X ] = 0

Now the conditional mean of Y given X = x is

E [Y |X = x] = m(x) + E [U|X = x] = m(x)

Goal: approximate m arbitrarily closely with a large enough sample.

We consider univariate X case for now.

30
Nonparametric Conditional Mean estimator: Discrete Case

Conditional mean of Y is by definition


E [Y · 1[X = x]]
E [Y |X = x] = .
P[X = x]

Therefore, the sample analogue nonparametric mean estimator is

i=1 Yi 1[Xi = x] Yi 1[Xi = x]


1
Pn Pn
n Pi=1
m̂(x) = = .
i=1 1[Xi = x] i=1 1[Xi = x]
1
P n n
n

Using this estimator, you may guess that the kernel mean estimator for
continuous X would be

 
Pn Xi −x
i=1 Yi K h
m̂(x) = P   .
n Xi −x
i=1 K h

31
Nonparametric Conditional Mean estimator: Continuous Case

Conditional mean of Y is by definition


Z Z
fYX (y , x)
E [Y |X = x] = yfY |X (y |x)dy = y dy .
fX (x)

By replacing the density functions fYX (y , x) and fX (x) with kernel functions,

   
K Yi h−y K Xi h−x
Z 1
Pn
nh2 i=1
m̂(x) = y   dy
1
Pn Xi −x
nh i=1 K h
   
yK Yi h−y dy K Xi h−x
1
Pn R
h i=1
=  
Pn Xi −x
i=1 K h
 
1
Pn Xi −x
h i=1 hYi K h
=  
Pn Xi −x
i=1 K h
 
Pn Xi −x
i=1 Yi K h
=  
Pn Xi −x
i=1 K h
32
Nonparametric Conditional Mean estimator: Continuous Case

The
 last step comes
 from
 change of variables. Notice that
K Yi h−y = K y −Y h
i
. Define ψ = y −Y
h
i
and φ(ψ) = y = Yi + hψ.

Z   Z
Yi − y
yK dy = yK (ψ) dy
h
Z
= (Yi + hψ)K (ψ) φ0 (ψ)dψ
Z
= h (Yi + hψ)K (ψ) dψ
 
Z Z
 
Yi K (ψ) dψ +h ψK (ψ) dψ 
= h 
| {z } | {z }
=1 =0

= hYi

33
Nonparametric Conditional Mean estimator: Continuous Case

Kernel conditional mean estimator for multivariate X :

 
Yi KX Xi h−x
Pn
i=1
m̂(x) = P  
n Xi −x
i=1 KX h

     
Xi −x X1i −x1 Xki −xk
where KX h
=K h
···K h
.

This is called Nadaraya-Watson kernel mean estimator (also known as local


constant estimator)

• Equals the weighted average of Y over values of X where the weight is


 
KX Xi h−x
 
Pn Xi −x
i=1 KX h

• Values of xi “close” to x have high weights.


• Definition of close depends on bandwidth h.

34
Comments on Nadaraya-Watson Estimator

• The density f of X and conditional mean function m need to be


sufficiently smooth.
• Under similar conditions to kernel density estimator, NW estimator is
consistent and asymptotically normal
√ σ2
 Z 
nh(m̂(x) − EX [m̂(x)]) →d N 0, K 2 (ψ)dψ
f (x)

if the error U is homoskedastic (E [U 2 |X ] = σ 2 ).


• Much more complex to derive bias and variance.
• Around the boundary of the support of X , the kernel estimator works
poorly. Convention is to compute it on the subset of the support of X e.g.
5th percentile - 95th percentile of X .

35
Summary

Pros:

• Kernel based nonparametric estimation is computationally simple and


theoretically attractive.
• Readily available in many packages e.g. R, Stata and Matlab.
• As far as sample size large enough, weak restrictions imposed on
nonparametric estimator uncovers what data tell.

Cons:

• Curse of dimensionality, computational burden explodes as sample size or


dim X goes up.
• Results are sensitive to choice of bandwidth
• Endogeneity is hard to treat. Nonparametric IV regression is possible but
in general suffers from huge variance. Possible treatment is proposed in
Chetverikov and Wilhelm (2017) and Chetverikov, Kim and Wilhelm
(2018).

36

You might also like