CS 215
Data Analysis and Interpretation
Estimation
Suyash P. Awate
Sample
• Definition:
If random variables X1, …, XN, are i.i.d.,
then they constitute a random sample of size N from the common
distribution
• N = “sample size”
• One set of observed data is one instance/realization of the sample
• i.e., {x1, …, xN}
• The common distribution from which data was “drawn” is usually
unknown
Statistic
• Definition:
Let X1, …, XN denote a sample associated with random variable X
(i.e., all of X1, …, XN have the same distribution as X).
Let T(X1, …, XN) be a function of the sample.
Then, random variable T is called the statistic.
• For the drawn sample {x1, …, xN},
the value t := T(x1, …, xN) is an instance of the statistic
Model
• Statistical model
• Typically, a probabilistic description of real-world phenomena
• Description involves a distribution that may involve some parameters
• e.g., P(X; θ)
• Describes/represents a data-generation process
• Designed by people
• Unlike data that is observed/measured/acquired
• Nature doesn’t generate models
Estimation
• Estimation theory
• A branch of statistics that deals with estimating the values of parameters
(underlying a statistical model) based on measured/empirical data
• While data generation starts with parameters and leads to data,
estimation starts with data and leads to parameters
• Estimation problem
• Given: Data
• Assumption: Data was generated from a parametric family of distributions
(i.e., a family of models)
• Goal: To infer the distribution parameters
(i.e., the distribution/model instance from the family of distributions/models)
that the data was generated from
Estimator, Estimate
• Estimator
• A deterministic (not stochastic) rule/formula/function/algorithm
for calculating/computing an estimate of a given quantity
(e.g., a parameter value)
based on observed data
• Sometimes the estimator is obtained as a closed-form expression
• But not always
• An estimator T(X1, …, XN) is also a statistic
• Estimate
• A value resulting from applying the estimator to data
Estimator Mean, Variance, Bias
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Let T(X1, …, XN) be a estimator for parameter whose true value is θ
• Mean of the estimator (definition):
Expected value of T, i.e., E[T]
• Bias of the estimator (definition)
Bias(T) := E[T] – θ
• Unbiased estimator (definition)
is one where Bias(T) = 0
E[T]
• Variance of the estimator (definition)
Var(T) := E[(T – E[T])2]
• Mean squared error (MSE) of the estimator (definition)
• Expected value of the squared error MSE(T) := E[(T – θ)2]
Estimator MSE, Bias, Variance
• MSE(T) := E[(T – θ)2]
= E[(T – E[T] + E[T] – θ)2]
= E[(T – E[T])2] + E[(E[T] – θ)2] + E[2(T – E[T])(E[T] – θ)]
= Var(T) + (Bias(T))2 + 0 E[T]
: Variance + Bias2
• Bias-variance
decomposition/“tradeoff”:
• If two estimators T1 and T2
have same MSE,
then
if one estimator (say, T1) has a smaller bias magnitude,
it (i.e., T1) also has a larger variance
Estimator Mean, Variance, Bias
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Let T(X1, …, XN) be a estimator for parameter whose true value is θ
• Consistent estimator (definition)
• Estimator TN = T(X1, …, XN) is consistent if ∀𝜖 > 0, lim 𝑃 𝑇𝑁 − 𝜃 ≥ 𝜖 = 0
!→#
• Thus, TN is said to “converge in probability” to 𝜃
Law of large numbers: For all ε > 0, as n→∞, P(|𝑋 – μ | ≥ ε) → 0
Likelihood Function
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Definition: Likelihood function L(θ; X1, …, XN) := ∏$!"# 𝑃(𝑋! ; 𝜃)
• We want to use the likelihood function to estimate θ from the sample
• Sometimes, analysis relies on log(L(θ; X1, …, XN)),
leveraging that log(.) is strictly monotonically increasing within (0,∞)
• Some assumptions (#)
1. Different values of θ correspond to different CDFs associated with P(X; θ)
• i.e., parameter θ identifies a unique CDF
2. All PMFs/PDFs have common support for all parameters θ
• i.e., support of X cannot depend on θ
• Under these assumptions, the likelihood function has a nice property
(as discussed next)
Likelihood Function
• Theorem: Let θtrue be the parameter value that led to sample X1, …, XN.
Assume 𝐸%(';)"#$% ) [𝑃(𝑋; 𝜃)/𝑃(𝑋; 𝜃+,-. )] exists (e.g., it is finite). Then,
lim 𝑃 𝐿 𝜃+,-. ; 𝑋# , ⋯ , 𝑋$ > 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$ ; 𝜃+,-. = 1, ∀𝜃 ≠ 𝜃+,-.
$→0
• Proof:
( ! + ,! ;.
• Event 𝐿 𝜃$%&' ; 𝑋( , ⋯ , 𝑋! > 𝐿 𝜃; 𝑋( , ⋯ , 𝑋! ≡ ∑)*( log <0
! + ,! ;."#$%
• We want to show that, as N→∞, this event (with strict inequality) has prob. 1
• Because of the law of large numbers: Law of large numbers:
( ! + ,! ;. + ,;. For all ε > 0, as n→∞,
lim ! ∑)*( log + , ;. → 𝐸+(,;."#$%) log + ,;. P(|𝑌 – μ | ≥ ε) → 0
!→# ! "#$% "#$%
• Common support implies prob-ratio is >0 and <∞. So sum & expectation exist.
Then, log(.) is strictly concave within (0,∞). Then, Jensen’s inequality makes
+ ,;. Jensen’s inequality:
above expectation strictly< log 𝐸+ ,;."#$% + ,;.
"#$% When g(.) is strictly concave,
EP(X)[g(h(X))] < g(EP(X) [h(X)])
Likelihood Function
• Theorem: Let θtrue be the parameter value that led to sample X1, …, XN.
Assume 𝐸%(';)"#$% ) [𝑃(𝑋; 𝜃)/𝑃(𝑋; 𝜃+,-. )] exists (e.g., it is finite). Then,
lim 𝑃 𝐿 𝜃+,-. ; 𝑋# , ⋯ , 𝑋$ > 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$ ; 𝜃+,-. = 1, ∀𝜃 ≠ 𝜃+,-.
$→0
• Proof:
+ ,;.
• Consider the summation/integration underlying log 𝐸+ ,;."#$% + ,;."#$%
• Expectation is summing/integrating only over support of 𝑃 𝑋; 𝜃!"#$ .
Thinking empirically, instances of x ∼ 𝑃 𝑋; 𝜃!"#$ never lie outside support of PMF/PDF.
The first 𝑃 𝑋; 𝜃!"#$ term indicates a PMF/PDF; second one indicates a transformation.
• When the support of 𝑃 𝑋; 𝜃!"#$ is a superset of the support of 𝑃 𝑋; 𝜃 ,
the summation/integral underlying the expectation evaluates to 1
% &;(
and log 𝐸% &;(!"#$ % &;( = log 1 = 0
!"#$
• If ∀𝜃 ≠ 𝜃!"#$, we want the expectation to evaluate to 1,
then all PMFs/PMFs 𝑃 𝑋; 𝜃 need to have the same support.
Maximum Likelihood (ML) Estimation
• Definition:
An estimator T = T(X1, …, XN) is a “maximum likelihood (ML) estimator”
if T := arg maxθ 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$
• “arg maxθ g(θ)”: the argument (i.e., θ) that maximizes the function g(.)
• “maxθ g(θ)”: the maximum possible value of the function g(.) across all θ
• Properties of ML estimation
• Sometimes, ML estimator may not exist, or it may not be unique
• When assumptions (#) hold, and max of likelihood function exists & is unique,
then ML estimator is a consistent estimator
• When sample size is finite, it loses convergence guarantee
• When sample size is finite, this behavior holds for most methods,
unless very strong assumptions (usually not holding in practice) are made on the data
• In practice, a large enough sample size take ML estimate T sufficiently close to
θtrue so that the ML estimate T is still useful
MLE for Bernoulli
• Let θ := probability of success
• θ must lie within [0,1]
$
• Likelihood function L(θ) := !"# θ'& 1 − θ
∏ (#1'& )
• ML estimate for θ is what ?
• At maximum of L(θ):
• First derivative must be zero
• This gives one equation in one unknown θ
• Second derivative must be negative
• ML estimate is sample mean, i.e., ∑!
)*( 𝑋) /𝑁
MLE for Binomial
• Let θ := probability of success P(X=k;θ,M) = MCk θk (1-θ)(M-k)
• θ must lie within [0,1]
• Let M := number of Bernoulli tries for each Binomial random variable
• Let { Xi : i = 1, …, N} model repeated draws from Binomial, where
Xi models number of successes in i-th draw from Binomial
$
∑
• ML estimate for θ is sample mean !"# 𝑋! /(𝑁𝑀)
• Interpretation:
• N independent Binomials draws,
where each Binomial has M independent Bernoulli draws,
is equivalent to NM independent Bernoulli draws
• Total number of successes in NM Bernoulli trials is ∑!
)*( 𝑋)
MLE for Poisson
• Parameter is average rate of arrivals/hits ƛ P(X=k; λ) = λk e-λ / k!
• ML estimate is sample mean ∑$ !"# 𝑋! /𝑁
• Note that ƛ is both mean and variance of the Poisson random variable
• So, sample variance can also estimate ƛ
• But computing sample variance needs computing sample mean anyway
• Also, sample mean is an “efficient” estimator (more on this later)
Sample-Variance Estimator
• Sample variance estimate for 𝜎2 is biased
• Asymptotically (as n→∞) unbiased
• So, (corrected) estimator of variance is Sc := S2.n/(n-1) that is unbiased
Sample-Variance Estimator
• What about estimator of standard deviation 𝜎 defined as 𝜎< ∶= 𝑆23 ?
• Is E 𝜎= = σ ?
• Sqrt(.) is a strictly concave function within (0,∞)
• Apply Jensen’s inequality:
E S)* < 𝐸 S)* = 𝜎
• Excepting the degenerate case when distribution has variance 0
Sample-Variance Estimator
• Variance of sample variance
• Variance of (uncorrected or corrected) sample-variance
tends to zero asymptotically (as N→∞)
• When (finite-variance) conditions underlying the law of large numbers hold
• https://en.wikipedia.org/wiki/Variance#Distribution_of_the_sample_variance
• https://mathworld.wolfram.com/SampleVarianceDistribution.html
• Then, (uncorrected or corrected) sample variance is a consistent estimator
Sample-Covariance Estimator
• Consider a joint PDF/PMF P(X,Y) with Cov(X,Y) = E[XY] – E[X]E[Y]
• Let E[XY] = μxy , E[X] = μx , E[Y] = μy
• Let (Xi,Yi) and (Xj,Yj) be i.i.d. (e.g., Xi independent of Xj and Yj for all i≠j)
# 5 # #
?
• Sample-covariance estimator 𝐶 = 4 ∑!"# 𝑋! 𝑌! − ∑5!"# 𝑋! ∑5!"# 𝑌!
5 5
( 4 ( 4 (
• 𝐸 3 ∑)*( 𝑋) 𝑌) = 3 ∑)*( 𝐸[𝑋) 𝑌) ] = 4 𝑛𝜇56 = 𝜇56
( 4 ( 4 ( (
• 𝐸 4 )*( 𝐸[𝑋) ] 4 )*( 𝐸[𝑌) ] = 4& ∑) 𝐸[𝑋) 𝑌) ]
∑ ∑ + 4& ∑)78 𝐸[𝑋) 𝑌9 ]
( ( ( 4:(
= & 𝑛𝜇56 + & 𝑛 𝑛 − 1 𝜇5 𝜇6 = 𝜇56 + 𝜇5 𝜇6
4 4 4 4
51#
• So, expectation of sample-covariance = 𝜇67 − 𝜇6 𝜇7
5
• Asymptotically unbiased. Corrected version will be unbiased.
• Can be shown to be consistent
MLE for Gaussian
• Parameters are mean μ and standard deviation 𝜎
• Likelihood function L(μ,𝜎) is a function of 2 variables
• Maximizing likelihood function L(μ,𝜎) is equivalent to
maximizing log-likelihood function log(L(μ,𝜎))
• Because log(.) function is a (strictly) monotonically increasing within (0,∞)
• Need to solve for 2 equations in 2 unknowns
• ML estimate for μ is sample mean
• ML estimate for 𝜎2 is sample variance
MLE for Half-Normal
• PDF:
• ML estimate is:
• This isn’t sample mean,
isn’t sample std. dev.,
isn’t sample median
MLE for Laplace
• PDF:
• ML estimates
• For location parameter:
sample median
• For scale parameter:
mean/average absolute deviation
(MAD/AAD)
from the median
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Support of PDF depends on parameters
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1 < xN
• What are ML estimates ?
• First, data must lie within [a,b]
• a ≤ x1 , else likelihood function = 0
• b ≥ xN , else likelihood function = 0
• Likelihood function L(a,b; {x1, …, xN}) := (1/(b–a))N
• Log-likelihood function log(L(a,b); {x1, …, xN}) = –N.log(b–a)
• Partial derivative w.r.t. ‘a’ is N/(b–a) > 0
• Partial derivative w.r.t. ‘b’ is (–N/(b–a)) < 0
• L(a,b) is maximum when a = x1 and b = xN
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1<xN
• Analysis of consistency
• For estimator of ‘b’: ∀𝜖 > 0 and 𝜖 < (b-a), consider 𝑃 𝑏 − max 𝑥' ≥ 𝜖
'(),⋯,,
= 𝑃 𝑏 − 𝑥) ≥ 𝜖 𝑃 𝑏 − 𝑥- ≥ 𝜖 ⋯ 𝑃 𝑏 − 𝑥, ≥ 𝜖
(/01)03 ,
= 𝑃 𝑥) ≤ 𝑏 − 𝜖 ⋯ 𝑃 𝑥, ≤ 𝑏 − 𝜖 = (/03) Estimator TN = T(X1, …, XN) is consistent if
∀𝜖 > 0, lim 𝑃 𝑇𝑁 − 𝜃 ≥ 𝜖 = 0
which → 0 as N→ ∞ !→#
• For estimator of ‘a’: ∀𝜖 > 0 and 𝜖 < (b-a), consider 𝑃 min 𝑥' − 𝑎 ≥ 𝜖
'(),⋯,,
= 𝑃 𝑥) ≥ 𝑎 + 𝜖 𝑃 𝑥- ≥ 𝑎 + 𝜖 ⋯ 𝑃 𝑥, ≥ 𝑎 + 𝜖
341 03 , (/03)01 ,
= 1 − 𝑃 𝑥) ≤ 𝑎 + 𝜖 ⋯ 1 − 𝑃 𝑥, ≤ 𝑎 + 𝜖 = 1− /03
= (/03)
which → 0 as N→ ∞
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1<xN
• Analysis of bias Bias(T) := E[T] – θ
• Without loss of generality, let a≥0 (shifted random variable)
• For non-negative random variable, apply tail-sum formula
J*#
𝐸[ max 𝑥) ] = J 1−𝑃 max 𝑥) ≤ 𝑡 𝑑𝑡
)*(,⋯,! J*K )*(,⋯,!
J*L J*M J*#
=J 1 𝑑𝑡 + J 1−𝑃 max 𝑥) ≤ 𝑡 𝑑𝑡 + J 1 − 1 𝑑𝑡
J*K J*L )*(,⋯,! J*M
J*M !
𝑡−𝑎
=𝑎+J 1− 𝑑𝑡
J*L 𝑏−𝑎
M:L M:L
=𝑎+ 𝑏−𝑎 − =𝑏− (check that makes sense for N=1)
!N( !N(
Linear Regression
• Given: Data (𝑥! , 𝑦! ) 5!"#
• Linear Model: 𝑌! = 𝛼+,-. + 𝛽+,-. X! + 𝜂! ,
where errors 𝜂! (in measuring 𝑌! ; not 𝑋! )
are zero-mean i.i.d. Gaussian random variables
• Goal: Estimate 𝛼+,-. , 𝛽+,-.
• Log-likelihood function
4
• L 𝛼, 𝛽; (𝑥) , 𝑦) ) )*( = log ∏) 𝐺 𝑦) ; 𝛼 + 𝛽𝑥) , 𝜎 O
• Partial derivative w.r.t. 𝛼 is 0 implies: 𝛼 = 𝑦J − 𝛽𝑥̅ (bar denotes mean)
• Partial derivative w.r.t. 𝛽 is 0 implies: ∑!(𝑦! − 𝛼 − 𝛽𝑥! )𝑥! = 0
• Substituting expression for 𝛼 gives:
∑) 𝑦) − 𝑦X 𝑥) 𝑥𝑦 − 𝑥̅ 𝑦X SampleCov 𝑋, 𝑌
𝛽= = O =
∑) 𝑥) − 𝑥̅ 𝑥) 𝑥 − 𝑥̅ O SampleVar(𝑋)
Linear Regression Slope m := Cov(X,Y) / Var(X)
Intercept c := E[Y] – Cov(X,Y) E[X] / Var(X)
• Analysis of estimates
PQRST'UVW ,,X
• Slope 𝛽 =
PQRST'YQ%(,)
• Unbiased (see next slide)
(ratio of sample-covariance and sample-variance is same with/without correction)
• Can be shown to be consistent (see next slide)
• Intercept 𝛼 = 𝑦X − 𝛽𝑥̅
• We already know that 𝑦6 and 𝑥̅ are unbiased and consistent estimators of E[Y] and E[X]
• Unbiased
• If 𝛽 is unbiased
• Can be shown to be consistent
• If 𝛽 is consistent
Linear Regression
5 5 5 5
∑&(6& 16)(7
̅ :
& 17) ∑& 6& 16̅ 7& 1 ∑& 6& 16̅ 7: ∑& 6& 16̅ 7&
6 6 6 6
•𝛽 = = =
;<=>?.@<,(') ;<=>?.@<,(') ;<=>?.@<,(')
• But, as per model, 𝑦! = 𝛼+,-. + 𝛽+,-. 𝑥! + 𝜂! . Substituting 𝑦! gives:
5 5
∑& 6& 16̅ A"#$% BC"#$7 6& BD& ∑& 6& 16̅ C"#$% 6& BD&
6 6
•𝛽 = =
;<=>?.@<,(') ;<=>?.@<,(')
5 5 5
∑& 6& 16̅ C"#$% (6& 16)B
̅ ∑& 6& 16̅ C"#$% 6B
̅ ∑& 6& 16̅ D&
6 6 6
•=
;<=>?.@<,(')
∑& 6& 16̅ D&
• = 𝛽+,-. +
5 ;<=>?.@<,(')
• So, E 𝛽 = 𝛽+,-. , because E 𝜂! = 0. So, unbiased.
∑& 6& 16̅ 8 @<, D& 5 ;<=>?.@<, E F 8 F8
• Var 𝛽 = (58 ) ;<=>?.@<, ' 8
= (58 ) ;<=>?.@<, ' 8
= 5 ;<=>?.@<,(')
• So, consistent (using Chebyshev’s inequality)
Linear Regression
• Interpretation of estimates
• Line passes through (𝑥,̅ 𝑦)
X
• If x ≔ 𝑥,̅ then y = 𝛼 + 𝛽𝑥̅ = 𝑦6 − 𝛽𝑥̅ + 𝛽𝑥̅ = 𝑦6
• “Residuals” 𝜂) sum to 0
• ∑+ 𝜂+ = ∑+ 𝑦+ − 𝛼 − 𝛽𝑥+ = 𝑛𝑦6 − 𝑛 𝑦6 − 𝛽𝑥̅ − 𝛽𝑛𝑥̅ = 0
• Slope 𝛽 = SampleCov(X,Y) / SampleVar(X)
j j
• “Centering” data
• Weighted average of “slope” for specific points (𝑦+ − 𝑦)/(𝑥
6 + − 𝑥)
̅
• Larger weight for datum (𝑥$ , 𝑦$ ) if 𝑥$ coordinate farther from center 𝑥̅
• Weights are non-negative and sum to 1 (convex combination)
• Intercept 𝛼 = 𝑦X − 𝛽𝑥̅
• From center (𝑥,̅ 𝑦),
6 line with estimated slope 𝛽 intersects ‘y’ axis at 𝑦6 − 𝛽𝑥̅
Linear Regression
• Effect of outliers
A Poem on MLE
• https://www.math.utep.edu/faculty/
lesser/MLE.html
On Preparation for Events (Exams) in Life
• From the Iron Man
• “I don’t really prepare for anything like an event.”
• “The goal is to be at a certain level of fitness.”
• “I should be able to run a full marathon whenever I want.”
• “That is the constant level of fitness that I aspire to.”
• “I keep my fitness level as a goal, not an event as a goal.”
• “There is no such thing as a good shortcut.”
• “If you want to be healthy,
and you want to be fit,
and you want to be happy,
you have to work hard.”
• https://youtu.be/x_96xVfdzu0?t=303