Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views17 pages

Fitting A Model Probability Distribution

Uploaded by

Anu Augustin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Fitting A Model Probability Distribution

Uploaded by

Anu Augustin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Fitting probability distributions to data

A: The normal distribution


Distributional modeling
A useful way to understand a data set:
• Fit a probability distribution to it.
• Simple and compact.
• Captures the big picture while smoothing out the wrinkles in the data.
• In subsequent application, use distribution as a proxy for the data.

Which distributions to use?


There exist a few distributions of great universality which occur in a surprisingly
large number of problems. The three principal distributions, with ramifications
throughout probability theory, are the binomial distribution, the normal distri-
bution, and the Poisson distribution. – William Feller.

We’ll see others as well. And for higher dimension, we’ll use various combinations of
1-d models: products and mixtures.

The normal distribution

The normal (or Gaussian) N(µ, σ 2 ) has mean µ, variance σ 2 , and density function
(x − µ)2
 
1
p(x) = exp − .
(2πσ 2 )1/2 2σ 2

• 68.3% of the distribution lies within one standard deviation of the mean, µ ± σ
• 95.5% lies within µ ± 2σ
• 99.7% lies within µ ± 3σ
Gaussians are everywhere
U.S. Height Distribution in Centimeters
150000

Number of People in Thousands


100000

50000

0
15

5
12

15

17

20
13

16

18

19
14
1

0–
0–

0–

0–
0–

0–

0–

0–

0–
0–

20
12

15

17
11

13

16

18

19
14
Women Men

Central Limit Theorem: Let X1 , X2 , . . . be independent with EXi = µi , var(Xi ) = vi .


Then
(X1 + · · · + Xn ) − (µ1 + · · · µn ) d
√ −→ N(0, 1)
v1 + · · · + vn

Fitting a Gaussian to data

Given: Data points x1 , . . . , xn to which we want to fit a distribution.


What Gaussian distribution N(µ, σ 2 ) should we choose?
B: The Poisson distribution

The Poisson distribution


A distribution over the non-negative integers {0, 1, 2, . . .}

Poisson(λ), with λ > 0:

λk
Pr(X = k) = e −λ
k!

• Mean: EX = λ
• Variance: E(X − λ)2 = λ
How the Poisson arises

Count the number of events (collisions, phone calls, etc) that occur in a certain
interval of time. Call this number X , and say it has expected value λ.

Now suppose we divide the interval into small pieces of equal length.

If the probability of an event occurring in a small interval is:


• independent of what happens in other small intervals, and
• the same across small intervals,
then X ∼ Poisson(λ).

Poisson: examples
Rutherford’s experiments with radioactive disintegration (1920)

radioactive
substance
• N = 2608 intervals of 7.5 seconds
• Nk = # intervals with k particles
• Mean: 3.87 particles per interval

counter

k 0 1 2 3 4 5 6 7 8 ≥9
Nk 57 203 383 525 532 408 273 139 45 43
P(3.87) 54.4 211 407 526 508 394 254 140 67.9 46.3
Flying bomb hits on London in WWII

Bundesarchiv, Bild 146-1975-117-26 / Lysiak /

CC-BY-SA 3.0

• Area divided into 576 regions, each 0.25 km2


• Nk = # regions with k hits
• Mean: 0.93 hits per region

k 0 1 2 3 4 ≥5
Nk 229 211 93 35 7 1
P(0.93) 226.8 211.4 98.54 30.62 7.14 1.57

Fitting a Poisson distribution to data

Given samples x1 , . . . , xn , what Poisson(λ) model to choose?


C: Maximum likelihood estimation

Maximum likelihood estimation

Let P = {Pθ : θ ∈ Θ} be a class of probability distributions (Gaussians, Poissons, etc).


Maximum likelihood principle: pick the θ ∈ Θ that makes the data maximally
likely, that is, maximizes Pr(data|θ) = Pθ (data).

Three steps:
1 Write down an expression for the likelihood, Pr(data|θ).
2 Maximizing this is the same as maximizing its log, the log-likelihood.
3 Solve for the maximum-likelihood parameter θ.
Maximum likelihood estimation of the Poisson
P = {Poisson(λ) : λ > 0}. We observe x1 , . . . , xn .
• Write down an expression for the likelihood, Pr(data|λ).

• Maximizing this is the same as maximizing its log, the log-likelihood.

• Solve for the maximum-likelihood parameter λ.

Maximum likelihood estimation of the normal


You see n data points x1 , . . . , xn ∈ R, and want to fit a Gaussian N(µ, σ 2 ) to them.
• Maximum likelihood: pick µ, σ to maximize
n 
(xi − µ)2
 
2
Y 1
Pr(data|µ, σ ) = exp −
(2πσ 2 )1/2 2σ 2
i=1
• Work with the log, since it makes things easier:
n
2 n 1 X (xi − µ)2
LL(µ, σ ) = ln − .
2 2πσ 2 2σ 2
i=1
• Setting the derivatives to zero, we get
n
1X
µ= xi
n
i=1
n
1 X
σ2 = (xi − µ)2
n
i=1
These are simply the empirical mean and variance.
D: The binomial distribution

The binomial distribution


Binomial(n, p): # of heads from n independent coin tosses of bias (heads prob) p.

For X ∼ binomial(n, p),

EX =

var(X ) =

Pr(X = k) =
Fitting a binomial distribution to data
Example: Survey on food tastes.
• You choose 1000 people at random and ask them whether they like sushi.
• 600 say yes.
What is a good estimate for the fraction of people who like sushi? Clearly, 60%.

More generally, say you observe n tosses of a coin of unknown bias, and k come up
heads. What distribution binomial(n, p) is the best fit to this data?
Maximum likelihood: a small caveat

You have two coins of unknown bias.

• You toss the first coin 10 times, and it comes out heads every time.
You estimate its bias as p1 =
• You toss the second coin 10 times, and it comes out heads once.
You estimate its bias as p2 =

Now you are told that one of the coins was tossed 20 times and 19 of them came out
heads. Which coin do you think it is?

• Likelihood under p1 : Pr(19 heads out of 20 tosses|bias = 1) =


• Likelihood under p2 : Pr(19 heads out of 20 tosses|bias = 0.1) =

Laplace smoothing
A smoothed version of maximum-likelihood: when you toss a coin n times and observe
k heads, estimate the bias as
k +1
p= .
n+2
We will later justify this in a Bayesian setting.

Laplace’s law of succession: What is the probability that the sun won’t rise tomorrow?
• Let p be the probability that the sun won’t rise on a randomly chosen day.
We want to estimate p.
• For the past 5000 years (= 1825000 days), the sun has risen every day.
Using Laplace smoothing, estimate
1
p= .
1825002
Normal approximation to the binomial

=⇒

When a coin of bias p is tossed n times, let Sn be the number of heads.


• We know Sn has mean np and variance np(1 − p).
• By central limit theorem: As n grows, the distribution of Sn looks increasingly like
a Gaussian with this mean and variance, i.e.,
S − np d
p n −→ N(0, 1).
np(1 − p)

Poisson approximation to the binomial

Toss coins with bias p1 , . . . , pn and let Sn be the number of heads.

Le Cam’s inequality:
∞ k n
−λ λ
X X
Pr(Sn = k) − e ≤ pi2
k!
k=0 i=1

where λ = p1 + · · · + pn .

Poisson limit theorem: If all pi = λ/n, then


d
Sn −→ Poisson(λ).

Also called “the law of rare events”.


E: The multinomial distribution

The multinomial distribution


Imagine a k-faced die, with probabilities p1 , . . . , pk .
Toss such a die n times, and count the number of times each of the k faces occurs:

Xj = # of times face j occurs

The distribution of X = (X1 , . . . , Xk ) is called the multinomial.

• Parameters: p1 , . . . , pk ≥ 0, with p1 + · · · + pk = 1.
• EX = (np1 , np2 , . . . , npk ).
• Pr(n1 , . . . , nk ) = n ,n n,...,n p1n1 p2n2 · · · pknk , where

1 2 k

 
n n!
= ,
n1 , n2 , . . . , nk n1 !n2 ! · · · nk !

the # of ways to place balls numbered {1, . . . , n} into bins numbered {1, . . . , k}.
Example: text documents
Bag-of-words: vectorial representation of text documents.

It was the best of times, it was the 1 despair


worst of times, it was the age of
wisdom, it was the age of foolishness, 2 evil
it was the epoch of belief, it was the
epoch of incredulity, it was the
0 happiness
season of Light, it was the season of
Darkness, it was the spring of hope,
it was the winter of despair, we had 1 foolishness
everything before us, we had nothing
before us, we were all going direct to
Heaven, we were all going direct the
other way – in short, the period was
so far like the present period, that
some of its noisiest authorities
insisted on its being received, for
good or for evil, in the superlative
degree of comparison only.

• Fix V = some vocabulary.


• Treat words in document as independent draws from a multinomial over V :
X
p = (p1 , . . . , p|V | ), such that pi ≥ 0 and pi = 1
i

How would we estimate the parameters of a multinomial?

F: Alternatives to maximum likelihood?


Alternatives to maximum likelihood
Choosing a model in {Pθ : θ ∈ Θ} given observations x1 , x2 , . . . , xn .
• Maximum likelihood.
The default, most common, choice.

• Method of moments.
Pick the model whose moments EX ∼Pθ f (X ) match empirical estimates.

• Bayesian estimation.
Return the maximum a-posteriori distribution, or the overall posterior.

• Maximum entropy.
We’ll see this soon.

• Other optimization-based or game-theoretic criteria.


As in generative adversarial nets, for instance.

Desiderata for probability estimators


Overall goal: Given data x1 , . . . , xn , want to choose a model Pθ , θ ∈ Θ.

• Let T (x1 , . . . , xn ) be some estimator of θ.


• Suppose X1 , . . . , Xn are i.i.d. draws from Pθ . Ideally T (X1 , . . . , Xn ) ≈ θ.

Some typical desiderata, if X1 , . . . , Xn ∼ Pθ .

1 Unbiased: ET (X1 , . . . , Xn ) = θ.
2 Asymptotically consistent: T (X1 , . . . , Xn ) → θ as n → ∞.
3 Low variance: var(T (X1 , . . . , Xn )) is small.
4 Computationally feasible: Is T (X1 , . . . , Xn ) easy to compute?

Do maximum-likelihood estimators possess these properties?


Are maximum likelihood estimators unbiased?

In general, no.

Example: Fit a normal distribution to observations X1 , . . . , Xn ∼ N(µ, σ 2 ).


• Maximum likelihood estimate:
X1 + · · · + Xn
µ
b=
n
2 (X1 − µb)2 + · · · + (Xn − µ
b)2
σ
b =
n
• Can check that E[b
µ] = µ but

n−1 2
σ2] =
E[b σ .
n

Maximum likelihood: asymptotically consistent?


Not always, but under some conditions, yes.
Rough intuition:
• Given data X1 , . . . , Xn ∼ Pθ∗ , want to choose a model Pθ , θ ∈ Θ.
• We pick the θ that maximizes
n
1 1X
LL(θ) = ln Pθ (Xi ) → EX ∼Pθ∗ [ln Pθ (X )]
n n
i=1
= EX ∼Pθ∗ [ln Pθ∗ (X )] − K (Pθ∗ , Pθ )
Postscript: some other canonical distributions

We’ve seen the normal, Poisson, binomial, and multinomial.

Some others:
1 Gamma: two-parameter family of distributions over R+
2 Beta: two-parameter family of distributions over [0, 1]
3 Dirichlet: k-parameter family of distributions over the k-probability simplex

All of these are exponential families of distributions.

You might also like