0% found this document useful (0 votes)

4 views37 pages

Week 3-Nonparametric Estimation

Uploaded by

f.shhryrpr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views37 pages

Week 3-Nonparametric Estimation

Uploaded by

f.shhryrpr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Week 3: Nonparametric Estimation

Dongwoo Kim
Simon Fraser University
Jan 18, 2023
Quote

“Young man, in mathematics you don’t understand things. You just get
used to them.”

• John von Neumann

1
Introduction

Parametric methods assume that data are generated from a known finite
dimensional model.

• Y = X β + U, U ∼ N(0, σ 2 )
• Finite number of parameters: β and σ 2
• Hard to justify parametric assumptions
• Provide good approximation in many cases

Semiparametric models consist of known finite dimensional part and unknown

infinite dimensional part.
1. Effectively parametric case: Y = X β + U and E [XU] = 0
• β is of our interest. Can be estimated by OLS
• Distribution of U is nonparametric nuisance part
2. Important nonparametric part: Y = m(X β) + U and E [U|X ] = 0
• Parameters β and nonparametric function m are of our interest.

2
Introduction

Nonparametric methods assume that data are generated from an unknown

infinite dimensional model. Researcher may impose separability restrictions,
shape restrictions, or smoothness restrictions.

Examples:

1. Additive model: Y = m(X ) + U and E [U|X ] = 0

• m is fully nonparametric function
• Shape restriction e.g. monotonicity, concavity can be imposed

2. Nonseparable model: Y = m(X , U) and U ⊥

⊥X
• Additional assumptions are needed to identify m
• m is strictly increasing in U, U is scalar and uniformly distributed in [0, 1]

3
Useful Notation

Define that an and bn are non-stochastic sequences.

Definition (Big oh)
For a positive integer n, an = O(1) if, as n → ∞, an remains bounded for
some constant C, i.e. |an | < C and for all large values of n. Similarly,
an = O(bn ) if an /bn = O(1).

example)

1. If an = n/(n + 1), then an = O(1) since an ≤ 1 for all n

2. If an = n + 5, bn = n, then an = O(bn ) because an ≤ 2bn for n ≥ 5 or
an ≤ 6bn for all n ≥ 1.

4
Useful Notation

Definition (Small oh)

For a positive integer n, an = o(1) if, as n → ∞, an → 0. Similarly,
an = o(bn ) if an /bn → 0.

example)

1. If an = 10/(n + 1), then an = o(1) since an → 0 as n → ∞

2. If an = 1/n2 , bn = 1/n, then an = o(bn ) because an /bn = 1/n → 0.

5
Useful Notation

Definition (Bounded in Probability)

A sequence of real random variables {Xn }∞ n=1 is said to be bounded in
probability if, for every ε > 0, there exists a constant M and a positive integer
N, such that
P[||Xn || > M] ≤ ε
for all n ≥ N.

1. Xn = Op (1) if Xn is bounded in probability

2. Xn = op (1) if Xn −→p 0
3. Xn = Op (Yn ) if Xn /Yn is bounded in probability
4. Xn = op (Yn ) if Xn /Yn −→p 0

6
Example : the sample mean estimator

Let X be a scalar random variable. Define that µ := E [X ]. Then the mean

estimator is X̄ = n1 ni=1 Xi . By WLLN,
P

n
1X
X̄ − µ = (Xi − µ) −→p E (Xi − µ) = 0
n i=1

Therefore, X̄ − µ is Op (1) and op (1).

7
Example : the sample mean estimator

By CLT, it is also obvious that

n
√ 1 X
n(X̄ − µ) = √ (Xi − µ) −→p N(0, σ 2 )
n i=1

where σ 2 = Var (Xi ).

√
Define Z ≡ n(X̄ − µ), then E (Z ) = 0. By Chevyshev inequality,

σ2 σ σ2
P[|Z | ≥ ε] ≤ 2
→ P[|Z | ≥ √ ] ≤ 2 =
ε σ /
√ (X̄ −µ)
This means that n(X̄ − µ) = √
1/ n
is Op (1) which implies

1
∴ X̄ − µ = Op √
n

8
Nonparametric Kernel Density Estimation

A random variable X has unknown distribution. A sample of realizations of X

is available.
{X1 , · · · , Xn }

• Full distribution of X can be estimated with no distributional assumptions

1. Discrete X : P̂[X = x] = n1 ni=1 1[Xi = x]
P

2. Continuous X : P̂[X ≤ x] = F̂X (x) = n1 ni=1 1[Xi ≤ x]

• The empirical CDF is not smooth, hard to derive density function

• Kernel based nonparametric methods estimate smooth density function for
continuous random variables
• Building block for many advanced nonparametric methods
• Empirical application: 1) distribution of income, 2) distribution of income
conditional on education, etc.

9
Naive Density Estimator

Goal: estimate the density function, f (x) at a particular point x0

• At x0 , probability mass is zero if X is continuously distributed (no
observation at x0 )
• By using a small bin [x0 − h, x0 + h] with bandwidth h > 0,
Z x0 +h
P [X ∈ [x0 − h, x0 + h]] = f (x)dx
x0 −h

≈ 2hf (x0 )
• Therefore the density at x0 is approximated by
1
fˆ(x0 ) = P̂ [X ∈ [x0 − h, x0 + h]]
2h
n
1 1X
= 1[x0 − h ≤ Xi ≤ x0 + h]
2h n i=1
n
1 X1
= · 1[−h ≤ Xi − x0 ≤ h]
hn i=1 2
n
1 X1 Xi − x0
= · 1[−1 ≤ ≤ 1]
hn i=1 2 h
10
Naive Density Estimator: Intuition

• This is the naive estimator obtained by Fix and Hodges (1951).

• Estimator is the unweighted average over values of X close to x0 .
• Probability X = x0 is zero. But, probability that X is in the bin
[x0 − h, x0 + h] is positive.
• If f is continuous, f evaluated at points near x0 will be close to f (x0 ).
• For fixed h, the Strong Law of Large Numbers (SLLN) implies
1
fˆ(x0 ) →a.s. P [X ∈ [x0 − h, x0 + h]] .
2h
• Therefore, if h goes to 0,
1
fˆ(x0 ) = lim P [X ∈ [x0 − h, x0 + h]] .
h→0 2h

• Let the bandwidth h be hn that depends on sample size n. If hn → 0 as

n → ∞, the naive estimator fˆ(x0 ) converges to f (x0 ) (consistency).

11
General Kernel Function

By defining K (ψ) = 1
2
1[−1 ≤ ψ ≤ 1] (uniform kernel),
n
1 X Xi − x0
fˆ(x0 ) = K( ).
nhn i=1 hn

• The naive estimator uses uniform kernel which is density of uniformly

distributed random variable.
• It gives equal weight to any points within the bin.
• We could do better by giving larger weights to values closer to x0 .
• How? Choose different kernel function K !

Examples
2
1. Gaussian kernel: K (ψ) = √1
2π
exp(− ψ2 ) = φ(ψ)
2. Epanechnikov kernel: K (ψ) = 43 (1 − ψ 2 )1[−1 ≤ ψ ≤ 1]

12
Finite Sample Properties

Assumption 1
1. The observations {X1 , · · · , Xn } are iid drawn.
2. f (x) is three-times differentiable
3. The kernel function is symmetric around zero and satisfies
R
i. R K (ψ)dψ = 1
ii. R ψ 2 K (ψ)dψ = µ2 6= 0
iii. K 2 (ψ)dψ = µ < ∞
4. hn → 0 as n → ∞
5. nhn → ∞ as n → ∞

Theorem 1
Under Assumption 1, for any x that is an interior point in the support of X ,

hn4 µf (x)
E [(fˆ(x) − f (x))2 ] = [µ2 f 00 (x)]2 + + o(hn4 + (nhn )−1 ).
4 nhn

13
Proof of Theorem 1

The mean squared error (MSE) is decomposed by

E [(fˆ(x) − f (x))2 ] = E [(fˆ(x) − E [fˆ(x)] + E [fˆ(x)] − f (x))2 ]

= E [(fˆ(x) − E [fˆ(x)])2 ] + E [(E [fˆ(x)] − f (x))2 ]
= E [(fˆ(x) − E [fˆ(x)])2 ] + (E [fˆ(x)] − f (x))2 ]
2
= var fˆ(x) + bias fˆ(x)

Note that the interaction term disappears because

2E [ (fˆ(x) − E [fˆ(x)]) (E [fˆ(x)] − f (x)) ] = 2(E [fˆ(x)] − f (x))E [fˆ(x) − E [fˆ(x)]]

| {z } | {z }
constant constant

= 2(E [fˆ(x)] − f (x))(E [fˆ(x)] − E [fˆ(x)])

14
Proof of Theorem 1: Bias

K ( Xih−x then fˆ(x) = Xi −x

1 1
Pn
Define that wi = hn n
), n i=1 wi . Let ψ be hn
.

Z
1 Xi − x 1 Xi − x
E (wi ) = E K = K f (Xi )dXi
hn hn hn hn
Z
1
= K (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
= K (ψ)f (x + hn ψ)dψ (by change of variables thm)

Therefore,
n
1X
E [fˆ(x)] = E [wi ] = E [wi ].
n i=1

15
Proof of Theorem 1: Bias

Change of Variables Theorem

Let I ⊆ R be an interval and φ : [a, b] → I be a differentiable function with
integrable derivative. Suppose that f : I → R is a continuous function. Then

Z φ(b) Z b
f (x)dx = f (φ(t))φ0 (t)dt
φ(a) a

In our case, we let φ(ψ) = x + hn ψ.

Z Z
1 1
K (ψ)f (xi )dxi = K (ψ)f (φ(ψ))φ0 (ψ)dψ
hn hn
Z
1
= K (ψ)f (φ(ψ))hn dψ
hn
Z
= K (ψ)f (φ(ψ))dψ
Z
= K (ψ)f (x + hn ψ)dψ

16
Proof of Theorem 1: Bias

Bias(fˆ(x)) =E (fˆ(x)) − f (x)

=E (wi ) − f (x)
Z
= K (ψ)f (x + hn ψ)dψ − f (x)
Z Z
= K (ψ)f (x + hn ψ)dψ − K (ψ)f (x)dψ (by Assumption 1.3.i)

Z
∴ Bias(fˆ(x)) = K (ψ)[f (x + hn ψ) − f (x)]dψ

The larger the bandwidth hn , the larger the bias.

17
Proof of Theorem 1: Bias

2nd order TSE of f (hn ψ + x) around x gives

hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x)
2

hn2 ψ 2 00
Z
∴ Bias(fˆ(x)) ≈ K (ψ)[hn ψf 0 (x) + f (x)]dψ
2
Z Z 2 2
h ψ
= K (ψ)[hn ψf 0 (x)]dψ + K (ψ) n f 00 (x) dψ
2
Z 2 00 Z
h f (x)
=hn f 0 (x) K (ψ)ψdψ + n K (ψ)ψ 2 dψ
2
| {z }
=0 by symmetry
00
f (x)µ2
=hn2
2

18
Proof of Theorem 1: Variance

n
! n
1X 1 X
Var (fˆ(x)) = Var wi = 2
Var (wi ) (by A1)
n i=1 n i=1
1
= Var (wi )
n
1 1
= E (wi2 ) − [E (wi )]2
n n

Z
1 2 Xi − x 1 Xi − x
E (wi2 ) =E K = K 2
f (Xi )dXi
hn2 hn hn2 hn
Z
1
= 2 K 2 (ψ)f (x + hn ψ)d(x + hn ψ)
hn
Z
1
= K 2 (ψ)f (x + hn ψ)dψ
hn

Z Z 2
1 1
∴ V (fˆ(x)) = K 2 (ψ)f (x + hn ψ)dψ − K (ψ)f (x + hn ψ)dψ
nhn n 19
Proof of Theorem 1: Variance

hn2 ψ 2 00
f (hn ψ + x) ≈ f (x) + hn ψf 0 (x) + f (x) := TE
2

Z Z 2
1 1
∴ Var (fˆ(x)) ≈ K 2 (ψ) TE dψ − K (ψ) TE dψ
nhn n
2
h 2 µ2
Z
1 1
= K 2 (ψ) TE dψ − f (x) + n f (2) (x)
nhn n 2
| {z }
more quickly converge to 0 than 1st term

h2 ψ 2
Z
1
→ K 2 (ψ)[f (x) + hn ψf (1) (x) + n f (2) (x)]dψ
nhn | {z 2 }
go faster to 0 as h→0
Z
1 2
→ f (x) K (ψ)dψ
nhn
| {z }
=µ

20
Choice of Kernel Function

• Choice of kernel usually not as crucial.

• For commonly used kernels, the 1st moment equals zero and the second
moment is finite.
• There are many different kernel functions developed in the theoretical
literature. Higher order kernels are an example.
• Higher order kernels tend to result in lower bias.
• In empirical applications, they are rarely used. The gains may be small but
costs are big.

21
Choice of Bandwidth

Bandwidth choice is of great importance for estimation performance.

• There exists huge literature on optimal bandwidth selection.

• Default kernel-based nonparametric estimators in Stata and R use rule of
thumb optimal bandwidth based on convergence rate. (homework!)
• Data driven bandwidth selection in general works better but
computationally demanding.
• Cross-validation is a popular way to find data-driven optimal bandwidth.
• Minimize mean integrated squared error (MISE)
Z
MISE = E ˆ 2
(f (x) − f (x)) dx
Z
=E (fˆ(x)2 − 2fˆ(x)f (x) + f (x)2 )dx
 
Z Z Z
ˆ 2 ˆ 2
 
=E  f (x) dx − 2f (x)f (x)dx + f (x) dx 

| {z }
=constant

22
Choice of Bandwidth

• We only need to minimize over the first two terms.

Z Z
min E fˆ(x)2 dx − 2 fˆ(x)f (x)dx
hn >0

• Empirical implementation: Notice that the second term is equivalent to

EX (fˆ(X )). So we can find optimal hn that minimizes
Z n
2Xˆ
J(hn ) = fˆ(x)2 dx − f(−i) (Xi )
n i=1

where fˆ(−i) is leave-one-out kernel.

23
Summary

• As the sample size grows, hn → 0 and bias disappears.

• The variance converges to 0 as n → ∞ and hn → 0 so does MSE.
• Therefore, fˆ(x) is consistently estimate f (x).
• Under further regularity conditions, we can show asymptotic normality of
fˆ(x) such that
√
Z
ˆ 2
nhn (f (x) − f (x)) →d N 0, f (x) K (ψ)dψ .

√
• Note that the convergence rate is nhn which is called nonparametric
rate. Slow convergence than parametric rate, meaning that larger sample
size is required for precise estimation.

24
Multivariate Kernel Density Estimation

What we’ve learned is univariate density estimation. Easily extended to

multivariate setting where we estimate joint density of random variables.

f (X1 , X2 , · · · , Xk )

Again, the probability mass at x1 , · · · , xk is zero but within a small bin,

X1 ∈ [x1 − h, x1 + h], · · · , Xk ∈ [xk − h, xk + h]

we can find positive probability.

25
Multivariate Kernel Density Estimation

For 2 variables case,

Z x1 +h Z x2 +h
P [X1 ∈ [x1 − h, x1 + h], X2 ∈ [x2 − h, x2 + h]] = f (X1 , X2 )dX2 dX1
x1 −h x2 −h

≈ (2h)2 f (x1 , x2 )

Naive uniform kernel estimator:

1 X 1[x1 − h ≤ X1i ≤ x1 + h] 1[x2 − h ≤ X2i ≤ x2 + h]

n
fˆ(x1 , x2 ) = 2 ·
nh i=1 2 2

1 X 1[−1 ≤ 1i h 1 ≤ 1] 1[−1 ≤ 2i h 2 ≤ 1]
n X −x X −x
= ·
nh2 i=1 2 2

26
Multivariate Kernel Density Estimation

General kernel estimator:

n
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = k K ···K
nh i=1 h h

√
• Convergence rate is now nhk .
• Rate becomes polynomially slower as k goes up
• Curse of dimensionality: sample size needed for precise estimation
explodes when k goes up!
• Note that bandwidth h need not be the same for all variables.

n
1 X X1i − x1 Xki − xk
fˆ(x1 , · · · , xk ) = K ···K
nh1 h2 · · · hk i=1 h1 hk

27
Kernels for Discrete Valued Random Variables

Suppose X take only discrete values

X ∈ {x1 , · · · , xJ }

Frequency estimator is consistently estimating probability mass functions.

n
1X
P̂[X = xj ] = 1[Xi = xj ]
n i=1

This is already good enough but we could do better.

28
Kernels for Discrete Valued Random Variables

Discrete kernel estimator is

n
1X
P̂[X = xj ] = Kh (Xi , xj )
n i=1

where

(
1−h if Xi = xj
Kh (Xi , xj ) = h
J−1
otherwise

J−1

and h ∈ 0, J
is the bandwidth.

This estimator a little bit under-weigh the correct observations.

Cross-validated optimal bandwidth results in better MSE than frequency

estimator.

29
Nonparametric Conditional Mean estimator

Simplest non-trivial nonparametric model (non-parametric generalization of

OLS), is nonparametric conditional mean regression.

Y = m(X ) + U, E [U|X ] = 0

Now the conditional mean of Y given X = x is

E [Y |X = x] = m(x) + E [U|X = x] = m(x)

Goal: approximate m arbitrarily closely with a large enough sample.

We consider univariate X case for now.

30
Nonparametric Conditional Mean estimator: Discrete Case

Conditional mean of Y is by definition

E [Y · 1[X = x]]
E [Y |X = x] = .
P[X = x]

Therefore, the sample analogue nonparametric mean estimator is

i=1 Yi 1[Xi = x] Yi 1[Xi = x]

1
Pn Pn
n Pi=1
m̂(x) = = .
i=1 1[Xi = x] i=1 1[Xi = x]
1
P n n
n

Using this estimator, you may guess that the kernel mean estimator for
continuous X would be

Pn Xi −x
i=1 Yi K h
m̂(x) = P .
n Xi −x
i=1 K h

31
Nonparametric Conditional Mean estimator: Continuous Case

Conditional mean of Y is by definition

Z Z
fYX (y , x)
E [Y |X = x] = yfY |X (y |x)dy = y dy .
fX (x)

By replacing the density functions fYX (y , x) and fX (x) with kernel functions,

K Yi h−y K Xi h−x
Z 1
Pn
nh2 i=1
m̂(x) = y dy
1
Pn Xi −x
nh i=1 K h

yK Yi h−y dy K Xi h−x
1
Pn R
h i=1
=
Pn Xi −x
i=1 K h

1
Pn Xi −x
h i=1 hYi K h
=
Pn Xi −x
i=1 K h

Pn Xi −x
i=1 Yi K h
=
Pn Xi −x
i=1 K h
32
Nonparametric Conditional Mean estimator: Continuous Case

The
last step comes
from
change of variables. Notice that
K Yi h−y = K y −Y h
i
. Define ψ = y −Y
h
i
and φ(ψ) = y = Yi + hψ.

Z Z
Yi − y
yK dy = yK (ψ) dy
h
Z
= (Yi + hψ)K (ψ) φ0 (ψ)dψ
Z
= h (Yi + hψ)K (ψ) dψ
 
Z Z
 
Yi K (ψ) dψ +h ψK (ψ) dψ 
= h 
| {z } | {z }
=1 =0

= hYi

33
Nonparametric Conditional Mean estimator: Continuous Case

Kernel conditional mean estimator for multivariate X :

Yi KX Xi h−x
Pn
i=1
m̂(x) = P
n Xi −x
i=1 KX h

Xi −x X1i −x1 Xki −xk
where KX h
=K h
···K h
.

This is called Nadaraya-Watson kernel mean estimator (also known as local

constant estimator)

• Equals the weighted average of Y over values of X where the weight is

KX Xi h−x

Pn Xi −x
i=1 KX h

• Values of xi “close” to x have high weights.

• Definition of close depends on bandwidth h.

34
Comments on Nadaraya-Watson Estimator

• The density f of X and conditional mean function m need to be

sufficiently smooth.
• Under similar conditions to kernel density estimator, NW estimator is
consistent and asymptotically normal
√ σ2
Z
nh(m̂(x) − EX [m̂(x)]) →d N 0, K 2 (ψ)dψ
f (x)

if the error U is homoskedastic (E [U 2 |X ] = σ 2 ).

• Much more complex to derive bias and variance.
• Around the boundary of the support of X , the kernel estimator works
poorly. Convention is to compute it on the subset of the support of X e.g.
5th percentile - 95th percentile of X .

35
Summary

Pros:

• Kernel based nonparametric estimation is computationally simple and

theoretically attractive.
• Readily available in many packages e.g. R, Stata and Matlab.
• As far as sample size large enough, weak restrictions imposed on
nonparametric estimator uncovers what data tell.

Cons:

• Curse of dimensionality, computational burden explodes as sample size or

dim X goes up.
• Results are sensitive to choice of bandwidth
• Endogeneity is hard to treat. Nonparametric IV regression is possible but
in general suffers from huge variance. Possible treatment is proposed in
Chetverikov and Wilhelm (2017) and Chetverikov, Kim and Wilhelm
(2018).

Statistical Inference 2 Note 02
100% (1)
Statistical Inference 2 Note 02
7 pages
Part 2 Estimation
No ratings yet
Part 2 Estimation
72 pages
Week 4-Nonparametric and Semiparametric Estimation
No ratings yet
Week 4-Nonparametric and Semiparametric Estimation
33 pages
Stat 450850 Notes 2012
No ratings yet
Stat 450850 Notes 2012
190 pages
Properties of The OLS Estimator: Quantitative Methods 2
No ratings yet
Properties of The OLS Estimator: Quantitative Methods 2
57 pages
Nonparametric Statistics Epiphany 2024-25
No ratings yet
Nonparametric Statistics Epiphany 2024-25
102 pages
Asymptotic Theory
No ratings yet
Asymptotic Theory
43 pages
Estimation Theory MCQ
86% (7)
Estimation Theory MCQ
8 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
Lecture 12
No ratings yet
Lecture 12
4 pages
Merged Exercises
No ratings yet
Merged Exercises
238 pages
00 Estimation
No ratings yet
00 Estimation
33 pages
RobustStats Practice Problems
No ratings yet
RobustStats Practice Problems
4 pages
Horowitz Sinander Notes
No ratings yet
Horowitz Sinander Notes
136 pages
Lec 10
No ratings yet
Lec 10
16 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
Notes Estimation Theory
100% (3)
Notes Estimation Theory
39 pages
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
No ratings yet
Estimation Theory: x, x, x ,…… ……x ,x f x,θ θ θ θ
18 pages
Introduction
No ratings yet
Introduction
11 pages
Statinf Estimation
No ratings yet
Statinf Estimation
110 pages
Nonparametric Curve Estimation Course
No ratings yet
Nonparametric Curve Estimation Course
114 pages
4 Estimation
No ratings yet
4 Estimation
33 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Week 1 1720465962 Estimation Hour 2
No ratings yet
Week 1 1720465962 Estimation Hour 2
14 pages
Unit-16 IGNOU STATISTICS
No ratings yet
Unit-16 IGNOU STATISTICS
16 pages
Demsity Estimation
No ratings yet
Demsity Estimation
11 pages
6.chapter 4
No ratings yet
6.chapter 4
9 pages
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
No ratings yet
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
34 pages
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
No ratings yet
Chapter 2: Statistical Inference, Point Estimation, and Confidence Intervals
16 pages
Parametric & Non-Parametric Test... (Stats) Part2
100% (1)
Parametric & Non-Parametric Test... (Stats) Part2
4 pages
RM Unit-4
No ratings yet
RM Unit-4
5 pages
TP Stat Inf 103957
No ratings yet
TP Stat Inf 103957
32 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
Phys 0120 Studen 20 Work Book
No ratings yet
Phys 0120 Studen 20 Work Book
318 pages
Unit 2
No ratings yet
Unit 2
41 pages
7772 LectureNotes
No ratings yet
7772 LectureNotes
120 pages
Basic Statistical Tools For Research
No ratings yet
Basic Statistical Tools For Research
53 pages
Asymptotic Theory & Inference Guide
No ratings yet
Asymptotic Theory & Inference Guide
32 pages
Theory of Estimation
No ratings yet
Theory of Estimation
11 pages
Chap 10
No ratings yet
Chap 10
7 pages
Empirical Finance1
No ratings yet
Empirical Finance1
31 pages
Advanced Statistical Theory
No ratings yet
Advanced Statistical Theory
132 pages
Audit Sampling Essentials
No ratings yet
Audit Sampling Essentials
3 pages
Functional Estimation For Density, Regression Models and Processes (Odile Pons)
No ratings yet
Functional Estimation For Density, Regression Models and Processes (Odile Pons)
205 pages
Estimating The CDF and Statistical Functionals
No ratings yet
Estimating The CDF and Statistical Functionals
13 pages
Cours 2 MVA
No ratings yet
Cours 2 MVA
5 pages
Minitab Guide for Six Sigma Students
No ratings yet
Minitab Guide for Six Sigma Students
12 pages
Biostatistics / Orthodontic Courses by Indian Dental Academy
100% (1)
Biostatistics / Orthodontic Courses by Indian Dental Academy
45 pages
Chap 3
No ratings yet
Chap 3
25 pages
Problem Sheet 3.7
No ratings yet
Problem Sheet 3.7
3 pages
Non-Parametric Methods Using Kernel Density Estimation
No ratings yet
Non-Parametric Methods Using Kernel Density Estimation
1 page
Lecture 21
No ratings yet
Lecture 21
4 pages
Topic 2a Theory of Estimation
No ratings yet
Topic 2a Theory of Estimation
12 pages
Cramer-Rao Inequality
No ratings yet
Cramer-Rao Inequality
10 pages
Statistical Inference
No ratings yet
Statistical Inference
82 pages
Estimation Techniques in Radar and Statistical Analysis
No ratings yet
Estimation Techniques in Radar and Statistical Analysis
55 pages
3rd Partial Collaborative Activity #1
No ratings yet
3rd Partial Collaborative Activity #1
2 pages
Literature Review of Tuition Impact On Learning of Students
50% (4)
Literature Review of Tuition Impact On Learning of Students
33 pages
Abel's Omega (Gay Paranomal MM Mpreg Romance) (Mercy Hills Pack Book 2) Ann-Katrin Byrde (Byrde Full Chapters Included
No ratings yet
Abel's Omega (Gay Paranomal MM Mpreg Romance) (Mercy Hills Pack Book 2) Ann-Katrin Byrde (Byrde Full Chapters Included
94 pages
Advanced Density Estimation Guide
No ratings yet
Advanced Density Estimation Guide
32 pages
Unbiased Estimator
No ratings yet
Unbiased Estimator
70 pages
Advanced Econometrics PDF
No ratings yet
Advanced Econometrics PDF
58 pages
Data Mining: Concepts and Techniques: - Chapter 1
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 1
29 pages
Impact of Library Collections On User Satisfaction A Case Study
No ratings yet
Impact of Library Collections On User Satisfaction A Case Study
6 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
MTP 22 56 Questions 1716557591
No ratings yet
MTP 22 56 Questions 1716557591
19 pages
Descriptive Statistics Set 1
No ratings yet
Descriptive Statistics Set 1
2 pages
Metrology & Measurement Reliability Guide
No ratings yet
Metrology & Measurement Reliability Guide
109 pages
Kruskal-Wallis Test: and It'S Implementation in R Programming
No ratings yet
Kruskal-Wallis Test: and It'S Implementation in R Programming
14 pages
Hypotheses Testing 2018
No ratings yet
Hypotheses Testing 2018
35 pages
Proposal Car
No ratings yet
Proposal Car
8 pages
BIO503: Introduction To Programming and Statistical Modeling in R
No ratings yet
BIO503: Introduction To Programming and Statistical Modeling in R
4 pages
Performance Task 3 Pearson Correlation
No ratings yet
Performance Task 3 Pearson Correlation
8 pages
ILS Resume TeacherBrent
No ratings yet
ILS Resume TeacherBrent
5 pages
Sol Stat Chapter2
No ratings yet
Sol Stat Chapter2
9 pages
MPH-Biostatistics Exercices
No ratings yet
MPH-Biostatistics Exercices
2 pages
Moderation Analysis for Researchers
No ratings yet
Moderation Analysis for Researchers
56 pages
Directed vs. Undirected Graphical Models
No ratings yet
Directed vs. Undirected Graphical Models
16 pages
Estimation Theory Estimation Theory
No ratings yet
Estimation Theory Estimation Theory
55 pages
Reseach Report
No ratings yet
Reseach Report
13 pages
Econometrics: Wages & Unemployment
No ratings yet
Econometrics: Wages & Unemployment
9 pages
Mixed
No ratings yet
Mixed
4 pages
Rosenthal 1979 Psych Bulletin
No ratings yet
Rosenthal 1979 Psych Bulletin
4 pages
Effective Project Collaboration Tips For Teams: Data Science
No ratings yet
Effective Project Collaboration Tips For Teams: Data Science
3 pages
Understanding Statistical Misuse
No ratings yet
Understanding Statistical Misuse
2 pages
RaoCramerans PDF
No ratings yet
RaoCramerans PDF
10 pages

Week 3-Nonparametric Estimation

Uploaded by

Week 3-Nonparametric Estimation

Uploaded by

Week 3: Nonparametric Estimation

• John von Neumann

Semiparametric models consist of known finite dimensional part and unknown

Nonparametric methods assume that data are generated from an unknown

1. Additive model: Y = m(X ) + U and E [U|X ] = 0

2. Nonseparable model: Y = m(X , U) and U ⊥

Define that an and bn are non-stochastic sequences.

1. If an = n/(n + 1), then an = O(1) since an ≤ 1 for all n

Definition (Small oh)

1. If an = 10/(n + 1), then an = o(1) since an → 0 as n → ∞

Definition (Bounded in Probability)

1. Xn = Op (1) if Xn is bounded in probability

Let X be a scalar random variable. Define that µ := E [X ]. Then the mean

Therefore, X̄ − µ is Op (1) and op (1).

By CLT, it is also obvious that

where σ 2 = Var (Xi ).

A random variable X has unknown distribution. A sample of realizations of X

• Full distribution of X can be estimated with no distributional assumptions

2. Continuous X : P̂[X ≤ x] = F̂X (x) = n1 ni=1 1[Xi ≤ x]

• The empirical CDF is not smooth, hard to derive density function

Goal: estimate the density function, f (x) at a particular point x0

• This is the naive estimator obtained by Fix and Hodges (1951).

• Let the bandwidth h be hn that depends on sample size n. If hn → 0 as

• The naive estimator uses uniform kernel which is density of uniformly

The mean squared error (MSE) is decomposed by

E [(fˆ(x) − f (x))2 ] = E [(fˆ(x) − E [fˆ(x)] + E [fˆ(x)] − f (x))2 ]

Note that the interaction term disappears because

2E [ (fˆ(x) − E [fˆ(x)]) (E [fˆ(x)] − f (x)) ] = 2(E [fˆ(x)] − f (x))E [fˆ(x) − E [fˆ(x)]]

= 2(E [fˆ(x)] − f (x))(E [fˆ(x)] − E [fˆ(x)])

K ( Xih−x then fˆ(x) = Xi −x

Change of Variables Theorem

In our case, we let φ(ψ) = x + hn ψ.

Bias(fˆ(x)) =E (fˆ(x)) − f (x)

The larger the bandwidth hn , the larger the bias.

2nd order TSE of f (hn ψ + x) around x gives

• Choice of kernel usually not as crucial.

Bandwidth choice is of great importance for estimation performance.

• There exists huge literature on optimal bandwidth selection.

• We only need to minimize over the first two terms.

• Empirical implementation: Notice that the second term is equivalent to

where fˆ(−i) is leave-one-out kernel.

• As the sample size grows, hn → 0 and bias disappears.

What we’ve learned is univariate density estimation. Easily extended to

Again, the probability mass at x1 , · · · , xk is zero but within a small bin,

X1 ∈ [x1 − h, x1 + h], · · · , Xk ∈ [xk − h, xk + h]

we can find positive probability.

For 2 variables case,

Naive uniform kernel estimator:

1 X 1[x1 − h ≤ X1i ≤ x1 + h] 1[x2 − h ≤ X2i ≤ x2 + h]

General kernel estimator:

Suppose X take only discrete values

Frequency estimator is consistently estimating probability mass functions.

This is already good enough but we could do better.

Discrete kernel estimator is

This estimator a little bit under-weigh the correct observations.

Cross-validated optimal bandwidth results in better MSE than frequency

Simplest non-trivial nonparametric model (non-parametric generalization of

Now the conditional mean of Y given X = x is

E [Y |X = x] = m(x) + E [U|X = x] = m(x)

Goal: approximate m arbitrarily closely with a large enough sample.

We consider univariate X case for now.

Conditional mean of Y is by definition

Therefore, the sample analogue nonparametric mean estimator is

i=1 Yi 1[Xi = x] Yi 1[Xi = x]

Conditional mean of Y is by definition

Kernel conditional mean estimator for multivariate X :

This is called Nadaraya-Watson kernel mean estimator (also known as local

• Equals the weighted average of Y over values of X where the weight is

• Values of xi “close” to x have high weights.

• The density f of X and conditional mean function m need to be

if the error U is homoskedastic (E [U 2 |X ] = σ 2 ).

• Kernel based nonparametric estimation is computationally simple and

• Curse of dimensionality, computational burden explodes as sample size or

You might also like