Week 3-Nonparametric Estimation
Week 3-Nonparametric Estimation
Dongwoo Kim
Simon Fraser University
Jan 18, 2023
Quote
“Young man, in mathematics you don’t understand things. You just get
used to them.”
1
Introduction
Parametric methods assume that data are generated from a known finite
dimensional model.
• Y = X β + U, U ∼ N(0, σ 2 )
• Finite number of parameters: β and σ 2
• Hard to justify parametric assumptions
• Provide good approximation in many cases
2
Introduction
Examples:
3
Useful Notation
example)
4
Useful Notation
example)
5
Useful Notation
6
Example : the sample mean estimator
n
1X
X̄ − µ = (Xi − µ) −→p E (Xi − µ) = 0
n i=1
7
Example : the sample mean estimator
√
Define Z ≡ n(X̄ − µ), then E (Z ) = 0. By Chevyshev inequality,
σ2 σ σ2
P[|Z | ≥ ε] ≤ 2
→ P[|Z | ≥ √ ] ≤ 2 =
ε σ /
√ (X̄ −µ)
This means that n(X̄ − µ) = √
1/ n
is Op (1) which implies
1
∴ X̄ − µ = Op √
n
8
Nonparametric Kernel Density Estimation
9
Naive Density Estimator
≈ 2hf (x0 )
• Therefore the density at x0 is approximated by
1
fˆ(x0 ) = P̂ [X ∈ [x0 − h, x0 + h]]
2h
n
1 1X
= 1[x0 − h ≤ Xi ≤ x0 + h]
2h n i=1
n
1 X1
= · 1[−h ≤ Xi − x0 ≤ h]
hn i=1 2
n
1 X1 Xi − x0
= · 1[−1 ≤ ≤ 1]
hn i=1 2 h
10
Naive Density Estimator: Intuition
11
General Kernel Function
By defining K (ψ) = 1
2
1[−1 ≤ ψ ≤ 1] (uniform kernel),
n
1 X Xi − x0
fˆ(x0 ) = K( ).
nhn i=1 hn
Examples
2
1. Gaussian kernel: K (ψ) = √1
2π
exp(− ψ2 ) = φ(ψ)
2. Epanechnikov kernel: K (ψ) = 43 (1 − ψ 2 )1[−1 ≤ ψ ≤ 1]
12
Finite Sample Properties
Assumption 1
1. The observations {X1 , · · · , Xn } are iid drawn.
2. f (x) is three-times differentiable
3. The kernel function is symmetric around zero and satisfies
R
i. R K (ψ)dψ = 1
ii. R ψ 2 K (ψ)dψ = µ2 6= 0
iii. K 2 (ψ)dψ = µ < ∞
4. hn → 0 as n → ∞
5. nhn → ∞ as n → ∞
Theorem 1
Under Assumption 1, for any x that is an interior point in the support of X ,
hn4 µf (x)
E [(fˆ(x) − f (x))2 ] = [µ2 f 00 (x)]2 + + o(hn4 + (nhn )−1 ).
4 nhn
13
Proof of Theorem 1
14
Proof of Theorem 1: Bias
Z
1 Xi − x 1 Xi − x
E (wi ) = E K = K f (Xi )dXi
hn hn hn hn
Z
1
= K (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
= K (ψ)f (x + hn ψ)dψ (by change of variables thm)
Therefore,
n
1X
E [fˆ(x)] = E [wi ] = E [wi ].
n i=1
15
Proof of Theorem 1: Bias
Z φ(b) Z b
f (x)dx = f (φ(t))φ0 (t)dt
φ(a) a
16
Proof of Theorem 1: Bias
Z
∴ Bias(fˆ(x)) = K (ψ)[f (x + hn ψ) − f (x)]dψ
17
Proof of Theorem 1: Bias
hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x)
2
hn2 ψ 2 00
Z
∴ Bias(fˆ(x)) ≈ K (ψ)[hn ψf 0 (x) + f (x)]dψ
2
Z Z 2 2
h ψ
= K (ψ)[hn ψf 0 (x)]dψ + K (ψ) n f 00 (x) dψ
2
Z 2 00 Z
h f (x)
=hn f 0 (x) K (ψ)ψdψ + n K (ψ)ψ 2 dψ
2
| {z }
=0 by symmetry
00
f (x)µ2
=hn2
2
18
Proof of Theorem 1: Variance
n
! n
1X 1 X
Var (fˆ(x)) = Var wi = 2
Var (wi ) (by A1)
n i=1 n i=1
1
= Var (wi )
n
1 1
= E (wi2 ) − [E (wi )]2
n n
Z
1 2 Xi − x 1 Xi − x
E (wi2 ) =E K = K 2
f (Xi )dXi
hn2 hn hn2 hn
Z
1
= 2 K 2 (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
1
= K 2 (ψ)f (x + hn ψ)dψ
hn
Z Z 2
1 1
∴ V (fˆ(x)) = K 2 (ψ)f (x + hn ψ)dψ − K (ψ)f (x + hn ψ)dψ
nhn n 19
Proof of Theorem 1: Variance
hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x) := TE
2
Z Z 2
1 1
∴ Var (fˆ(x)) ≈ K 2 (ψ) TE dψ − K (ψ) TE dψ
nhn n
2
h 2 µ2
Z
1 1
= K 2 (ψ) TE dψ − f (x) + n f (2) (x)
nhn n 2
| {z }
more quickly converge to 0 than 1st term
h2 ψ 2
Z
1
→ K 2 (ψ)[f (x) + hn ψf (1) (x) + n f (2) (x)]dψ
nhn | {z 2 }
go faster to 0 as h→0
Z
1 2
→ f (x) K (ψ)dψ
nhn
| {z }
=µ
20
Choice of Kernel Function
21
Choice of Bandwidth
22
Choice of Bandwidth
23
Summary
√
• Note that the convergence rate is nhn which is called nonparametric
rate. Slow convergence than parametric rate, meaning that larger sample
size is required for precise estimation.
24
Multivariate Kernel Density Estimation
f (X1 , X2 , · · · , Xk )
25
Multivariate Kernel Density Estimation
≈ (2h)2 f (x1 , x2 )
1 X 1[−1 ≤ 1i h 1 ≤ 1] 1[−1 ≤ 2i h 2 ≤ 1]
n X −x X −x
= ·
nh2 i=1 2 2
26
Multivariate Kernel Density Estimation
n
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = k K ···K
nh i=1 h h
√
• Convergence rate is now nhk .
• Rate becomes polynomially slower as k goes up
• Curse of dimensionality: sample size needed for precise estimation
explodes when k goes up!
• Note that bandwidth h need not be the same for all variables.
n
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = K ···K
nh1 h2 · · · hk i=1 h1 hk
27
Kernels for Discrete Valued Random Variables
X ∈ {x1 , · · · , xJ }
28
Kernels for Discrete Valued Random Variables
n
1X
P̂[X = xj ] = Kh (Xi , xj )
n i=1
where
(
1−h if Xi = xj
Kh (Xi , xj ) = h
J−1
otherwise
J−1
and h ∈ 0, J
is the bandwidth.
29
Nonparametric Conditional Mean estimator
Y = m(X ) + U, E [U|X ] = 0
30
Nonparametric Conditional Mean estimator: Discrete Case
Using this estimator, you may guess that the kernel mean estimator for
continuous X would be
Pn Xi −x
i=1 Yi K h
m̂(x) = P .
n Xi −x
i=1 K h
31
Nonparametric Conditional Mean estimator: Continuous Case
By replacing the density functions fYX (y , x) and fX (x) with kernel functions,
K Yi h−y K Xi h−x
Z 1
Pn
nh2 i=1
m̂(x) = y dy
1
Pn Xi −x
nh i=1 K h
yK Yi h−y dy K Xi h−x
1
Pn R
h i=1
=
Pn Xi −x
i=1 K h
1
Pn Xi −x
h i=1 hYi K h
=
Pn Xi −x
i=1 K h
Pn Xi −x
i=1 Yi K h
=
Pn Xi −x
i=1 K h
32
Nonparametric Conditional Mean estimator: Continuous Case
The
last step comes
from
change of variables. Notice that
K Yi h−y = K y −Y h
i
. Define ψ = y −Y
h
i
and φ(ψ) = y = Yi + hψ.
Z Z
Yi − y
yK dy = yK (ψ) dy
h
Z
= (Yi + hψ)K (ψ) φ0 (ψ)dψ
Z
= h (Yi + hψ)K (ψ) dψ
Z Z
Yi K (ψ) dψ +h ψK (ψ) dψ
= h
| {z } | {z }
=1 =0
= hYi
33
Nonparametric Conditional Mean estimator: Continuous Case
Yi KX Xi h−x
Pn
i=1
m̂(x) = P
n Xi −x
i=1 KX h
Xi −x X1i −x1 Xki −xk
where KX h
=K h
···K h
.
34
Comments on Nadaraya-Watson Estimator
35
Summary
Pros:
Cons:
36