Slide set Expectation
∙ Definition and properties
∙ Correlation and covariance
∙ Linear MSE estimation
∙ Summary
© Copyright Abbas El Gamal
Definition of expectation
∙ We already introduced the notion of expectation (mean) of a r.v.
∙ We generalize this definition and discuss it in more depth
∙ Let X ∈ X be a discrete r.v. with pmf pX (x) and g(x) be a function of x
The expectation or expected value of g(X) is defined as
E(g(X)) = g(x)pX (x)
x∈X
∙ For a continuous r.v. X ∼ fX (x), the expected value of g(X) is defined as
∞
E(g(X)) = g(x)fX (x) dx
−∞
∙ Examples:
g(X) = c, a constant, then E(g(X)) = c
g(X) = X, E(X) is the mean of X
g(X) = X k , k = , , . . ., E(X k ) is the kth moment of X
g(X) = (X − E(X)) , E (X − E(X)) is the variance of X
/
Fundamental theorem of expectation
∙ Let X ∼ pX (x) and Y = g(X) ∼ pY (y), then
EY (Y) = ypY (y) = g(x)pX (x) = EX (g(X))
y∈Y x∈X
Similarly, for continuous r.v.s X ∼ fX (x) and Y = g(X) ∼ fY (y), EY (Y) = EX (g(X))
∙ Hence, E(g(X)) can be found using either fX (x) or fY (y)
It is often easier to use fX (x) than to first find fY (y) then use it to find E(Y)
∙ Proof: We prove the theorem for discrete r.v.s
EY (Y) = ypY (y)
y
= y pX (x)
y {x: g(x)=y}
= ypX (x)
y {x: g(x)=y}
= g(x)pX (x) = g(x)pX (x) = EX (g(X))
y {x: g(x)=y} x
/
Expectation is linear
∙ For any constants a and b
E ag (X) + bg (X) = a E(g (X)) + b E(g (X))
This follows from the definition of expectation as sum / integral
∙ Example: Let X be a r.v. with known mean and variance.
Find the mean and variance of aX + b
By linearity of expectation, the mean: E(aX + b) = a E(X) + b
The variance
Var(aX + b) = E ((aX + b) − E(aX + b))
= E (aX + b − a E(X) − b)
= E a (X − E(X))
= a E (X − E(X)) = a Var(X)
/
Recap
∙ Let X ∼ fX (x), Y = g(X) be a function of X; the expectation of g(X) is defined as
∞
E[g(X)] = g(x) fX (x) dx
−∞
∞
= y fY (y) dy fundamental theorem of expectation
−∞
∙ Expectation is linear, i.e.,
E[ag (X) + bg (X)] = a E[g (X)] + b E[g (X)]
/
Correlation and covariance
∙ We can define expectation for a function of two r.v.s: Let (X, Y) ∼ fX,Y (x, y)
and g(x, y) be a function of x, y, the expectation of g(X, Y) is defined as
∞ ∞
E(g(X, Y)) = g(x, y) fX,Y (x, y) dx dy
−∞ −∞
The function g(X, Y) may be X, Y, X , X + Y, max{X, Y}, . . .
∙ The correlation of X and Y is defined as E(XY)
∙ The covariance of X and Y is defined as
Cov(X, Y) = E [(X − E(X))(Y − E(Y))]
= E [XY − X E(Y) − Y E(X) + E(X) E(Y)]
= E(XY) − E[X E(Y)] − E[Y E(X)] + E(X) E(Y) linearity of expectation
= E(XY) − E(Y) E(X) − E(X) E(Y) + E(X) E(Y) linearity of expectation
= E(XY) − E(X) E(Y)
If X = Y, then Cov(X, Y) = Var(X)
∙ X and Y are said to be uncorrelated if Cov(X, Y) = , i.e., if E(XY) = E(X) E(Y)
/
Example
∙ Let
y
for x, y ≥ , x + y ≤ y − /
f (x, y) =
otherwise
Find Var(X), Var(Y), Cov(X, Y)
∙ The means of X, Y are / x − /
x
−y
/
E(X) = E(Y) = x dx dy = ( − y) dy =
The second moment of X is
−y
( − y)
E(X ) = x dx dy = dy = = E(Y )
Hence the variance is Var(X) = E(X ) − [E(X)] = = Var(Y)
The correlation of X and Y is
−y −y ( − y)
E(XY) = xy dx dy = y x dx dy = y ⋅ dy =
Hence, Cov(X, Y) = E(XY) − E(X) E(Y) = − =−
/
The correlation coefficient
∙ The correlation coefficient of X and Y is defined as
Cov(X, Y) Cov(X, Y)
ρX,Y = =
Var(X) Var(Y) σX σY
−/
For the previous example: ρX,Y = =−
/
∙ If (X, Y) are uncorrelated, i.e., Cov(X, Y) = , then ρ =
∙ − ≤ ρX,Y ≤ . To show this consider
X − E(X) Y − E(Y)
E ± ≥ , expanding and using linearity of expectation,
σX σY
E[(X − E(X)) ] E[(Y − E(Y)) ] E[(X − E(X))(Y − E(Y))]
+ ± ≥
Var(X) Var(Y) σX σY
+ ± ρX,Y ≥ ⇒ − ≤ ρX,Y ≤ ⇒ − ≤ ρX,Y ≤
(X − E(X)) (Y − E(Y))
∙ From the above, ρX,Y = ± iff =∓ , i.e.,
σX σY
iff (X − E(X)) is a linear function of (Y − E(Y))
∙ In general, ρX,Y tells us how well X can be estimated by a linear function of Y
/
Visualizing correlation in data (E(X) = , Var(X) = , E(Y) = , Var(Y) = /)
uncorrelated (ρ = ) positively correlated (ρ = /)
negatively correlated (ρ = −/) completely correlated (ρ = )
/
Independent versus uncorrelated
∙ Let X and Y be independent, then for any functions g(X) and h(Y),
E[g(X) h(Y)] = E[g(X)] E[h(Y)]
∙ Proof: Let’s assume that X ∼ fX (x) and Y ∼ fY (y), then
∞ ∞
E[g(X)h(Y)] = g(x) h(y) fX,Y (x, y) dx dy
−∞ −∞
∞ ∞
= g(x) h(y) fX (x) fY (y) dx dy by independence
−∞ −∞
∞ ∞
= g(x) fX (x) dx h(y) fY (y) dy = E[g(X)] E[h(Y)]
−∞ −∞
∙ In particular, if X, Y are independent, E(XY) = E(X) E(Y), i.e., Cov(X, Y) =
∙ Hence, independent ⇒ uncorrelated
∙ However, if X and Y are uncorrelated they are not necessarily independent
/
Example
∙ Let X, Y ∈ {−, −, , } such that
pX,Y (, ) = /, pX,Y (−, −) = /
pX,Y (−, ) = /, pX,Y (, −) = /,
pX,Y (x, y) = , otherwise y
/
Are X and Y independent? Are they uncorrelated? 2
/
∙ Clearly X and Y are not independent, since if we 1
−2 −1 1 2
know the outcome of one, we completely know x
/
the outcome of the other −1
/
To check if they are uncorrelated, we find the covariance −2
E(X) = E(Y) = ⋅ + (−) ⋅ + (−) ⋅ +⋅ = ,
E(XY) = ⋅ ⋅ + (−) ⋅ (−) ⋅ + ⋅ (−) ⋅ + (−) ⋅ ⋅ =
Thus, Cov(X, Y) = and X and Y are uncorrelated!
/
Signal estimation
∙ Consider the following signal estimation problem:
X Y ̂
X
Sensor Estimator
∙ The signal X may be location, scene illumination, temperature, pressure, . . .
The sensor may be lidar, camera, temperature / pressure sensor, . . .
The sensor output Y is a noisy observation of X
∙ This setting may also represent prediction / forecasting, e.g.,
Y is solar output power in hour t, X is solar power at t + (see HW )
∙ Upon observing Y, the estimator tries to finds a good estimate X̂ of X
∙ There are different types of estimators that one can use depending on
The goodness / fidelity criterion
Knowledge about the statistics of (X, Y)
/
Linear MSE estimation
∙ Consider the following signal estimation problem
X Y ̂
X
Sensor Estimator
∙ There are different types of estimators that one can use depending on
The goodness / fidelity criterion
Knowledge about the statistics of (X, Y)
∙ The most popular fidelity criterion is the mean squared error between X̂ and X,
̂ ],
MSE = E[(X − X) the smaller the better
∙ To find the optimal X, ̂ we need to know the distribution of (X, Y) (Slide set )
∙ We often have estimates only of the means, variances, and covariance of (X, Y)
∙ It turns out with this information, we can find the best linear MSE estimate, i.e.,
̂ = aY + b that minimizes the MSE = E[(X − X)
the X ̂ ]
∙ We refer to this estimator as the linear MMSE estimate
/
Linear MSE estimation
∙ Theorem: The linear MMSE estimate of X given Y is
̂ = Cov(X, Y) (Y − E(Y)) + E(X)
X
Var(Y)
Y − E(Y)
= ρX,Y σX + E(X)
σY
and the minimum MSE is
Cov (X, Y)
MMSE = Var(X) − = ( − ρX,Y )σX
Var(Y)
∙ Properties of linear MMSE estimate:
̂ = E(X), i.e., the estimate is unbiased
E(X)
̂ = E(X), i.e., ignore the observation !
If ρX,Y = , i.e., X, Y uncorrelated, then X
(Y − E(Y)) (X − E(X))
If ρX,Y = ±, i.e., =∓ , then the linear estimate is perfect
σY σX
/
Proof of theorem
∙ We want to find a, b that minimize E[(X − aY − b) ]
Can take partial derivatives and set them to , but let’s do it slightly differently
∙ First suppose we wish to estimate X by the best constant b that minimize
the MSE = E[(X − b) ]
∙ The answer is b = E(X) and the minimum MSE = Var(X), i.e., absent any
observations, the mean is the MMSE estimate of X and the variance is its MSE
∙ We can show this using calculus or in a nicer way as follows:
E (X − b) = E [(X − E(X)) + (E(X) − b)]
= E (X − E(X)) + (E(X) − b) + (E(X) − b)(X − E(X))
= E[(X − E(X)) ] + (E(X) − b) + (E(X) − b) E(X − E(X))
= E[(X − E(X)) ] + (E(X) − b) ≥ E[(X − E(X)) ],
with equality iff b = E(X)
/
Proof of theorem (continued)
∙ We want to find a, b that minimize E[(X − aY − b) ]
∙ Suppose a has been chosen, what b minimizes E[((X − aY) − b) ]?
∙ From the above result, we have b = E(X − aY) = E(X) − a E(Y) ()
∙ So, we want to choose a to minimize
MSE = E ((X − aY) − E(X − aY)) , which is the same as
E ((X − E(X)) − a(Y − E(Y))) = Var(X) + a Var(Y) − a Cov(X, Y)
This is a quadratic function of a and is minimized for
Cov(X, Y) ρX,Y σX σY ρX,Y σX
a= = = ()
Var(Y) σY σY
∙ Substituting from () and (), the linear MMSE estimate and the MMSE are
ρX,Y σX ρX,Y σX
̂ = aY + b =
X Y + E(X) − E(Y),
σY σY
MMSE = Var(X) + a Var(Y) − a Cov(X, Y)
ρX,Y σX
= σX + ρXY σX − ⋅ ρX,Y σX σY = ( − ρX,Y )σX
σY
/
The additive noise channel
∙ When you measure a signal, e.g., location, scene, temperature, pressure, . . . ,
the measuring device/circuit adds noise to the signal
∙ We model this system by an additive noise channel
Z
X Y =X+Z
The input signal X has known mean μX and variance σX ,
the additive noise Z has zero mean and known variance σZ , and
the output (observation) Y = X + Z, where X and Z are uncorrelated
∙ Find the linear MMSE estimate of the signal X given the output Y and its MSE
/
The additive noise channel
∙ The best linear MSE estimate is
̂ = Cov(X, Y) (Y − E(Y)) + E(X)
X
Var(Y)
∙ So we need to find E(Y), Var(Y), and Cov(X, Y) in terms of μX , σX , σZ
E(Y) = E(X + Z) = E(X) + E(Z) = μX +
Var(Y) = E[(Y − E(Y)) ] = E[(X − μX ) + Z) ] = σX + σZ
Cov(X, Y) = E[(X − μX )(X + Z − μX )] = E[(X − μX ) + (X − μX )Z] = σX +
∙ Thus, the linear MMSE estimate is
σX σX σZ
̂ =
X (Y − μX ) + μX = Y+ μX
σX + σZ σX + σZ σX + σZ
σX
So if the signal to noise ratio (SNR) is high, we put more weight on Y,
σZ
and if it’s low, we put more weight on μX
/
The additive noise channel
∙ From the theorem, the minimum MSE is
Cov (X, Y)
MMSE = Var(X) −
Var(Y)
∙ From the model, we have Var(X) = σX , Cov(X, Y) = σX , Var(Y) = σX + σZ
∙ Hence, the minimum MSE is
σX
MMSE = σX −
σX + σZ
σX σZ σX
= =
σX + σZ SNR +
So the MMSE decreases from σX to zero as the SNR increases from to ∞
/
Summary
∙ Expectation is linear
∙ Covariance and correlation coefficient
∙ Independent ⇒ uncorrelated; reverse doesn’t hold in general
∙ Application: linear estimation
∙ The additive noise channel
/