Bayesian Decision Theory
Selim Aksoy
Department of Computer Engineering
Bilkent University
[email protected] CS 551, Fall 2016
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 1 / 46
Bayesian Decision Theory
I Bayesian Decision Theory is a fundamental statistical
approach that quantifies the tradeoffs between various
decisions using probabilities and costs that accompany
such decisions.
I First, we will assume that all probabilities are known.
I Then, we will study the cases where the probabilistic
structure is not completely known.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 2 / 46
Fish Sorting Example Revisited
I State of nature is a random variable.
I Define w as the type of fish we observe (state of nature,
class) where
I w = w1 for sea bass,
I w = w2 for salmon.
I P (w1 ) is the a priori probability that the next fish is a sea
bass.
I P (w2 ) is the a priori probability that the next fish is a salmon.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 3 / 46
Prior Probabilities
I Prior probabilities reflect our knowledge of how likely each
type of fish will appear before we actually see it.
I How can we choose P (w1 ) and P (w2 )?
I Set P (w1 ) = P (w2 ) if they are equiprobable (uniform priors).
I May use different values depending on the fishing area, time
of the year, etc.
I Assume there are no other types of fish
P (w1 ) + P (w2 ) = 1
(exclusivity and exhaustivity).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 4 / 46
Making a Decision
I How can we make a decision with only the prior
information?
w if P (w ) > P (w )
1 1 2
Decide
w2 otherwise
I What is the probability of error for this decision?
P (error ) = min{P (w1 ), P (w2 )}
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 5 / 46
Class-Conditional Probabilities
I Let’s try to improve the decision using the lightness
measurement x.
I Let x be a continuous random variable.
I Define p(x|wj ) as the class-conditional probability density
(probability of x given that the state of nature is wj for
j = 1, 2).
I p(x|w1 ) and p(x|w2 ) describe the difference in lightness
between populations of sea bass and salmon.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 6 / 46
Class-Conditional Probabilities
Figure 1: Hypothetical class-conditional probability density functions for two
classes.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 7 / 46
Posterior Probabilities
I Suppose we know P (wj ) and p(x|wj ) for j = 1, 2, and
measure the lightness of a fish as the value x.
I Define P (wj |x) as the a posteriori probability (probability of
the state of nature being wj given the measurement of
feature value x).
I We can use the Bayes formula to convert the prior
probability to the posterior probability
p(x|wj )P (wj )
P (wj |x) =
p(x)
P2
where p(x) = j=1 p(x|wj )P (wj ).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 8 / 46
Making a Decision
I p(x|wj ) is called the likelihood and p(x) is called the
evidence.
I How can we make a decision after observing the value of x?
w if P (w |x) > P (w |x)
1 1 2
Decide
w2 otherwise
I Rewriting the rule gives
p(x|w1 ) P (w2 )
w
1 if p(x|w2 )
> P (w1 )
Decide
w 2 otherwise
I Note that, at every x, P (w1 |x) + P (w2 |x) = 1.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 9 / 46
Probability of Error
I What is the probability of error for this decision?
P (w |x) if we decide w
1 2
P (error |x) =
P (w2 |x) if we decide w1
I What is the average probability of error?
Z ∞ Z ∞
P (error ) = p(error , x) dx = P (error |x) p(x) dx
−∞ −∞
I Bayes decision rule minimizes this error because
P (error |x) = min{P (w1 |x), P (w2 |x)}.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 10 / 46
Bayesian Decision Theory
I How can we generalize to
I more than one feature?
I replace the scalar x by the feature vector x
I more than two states of nature?
I just a difference in notation
I allowing actions other than just decisions?
I allow the possibility of rejection
I different risks in the decision?
I define how costly each action is
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 11 / 46
Bayesian Decision Theory
I Let {w1 , . . . , wc } be the finite set of c states of nature
(classes, categories).
I Let {α1 , . . . , αa } be the finite set of a possible actions.
I Let λ(αi |wj ) be the loss incurred for taking action αi when
the state of nature is wj .
I Let x be the d-component vector-valued random variable
called the feature vector .
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 12 / 46
Bayesian Decision Theory
I p(x|wj ) is the class-conditional probability density function.
I P (wj ) is the prior probability that nature is in state wj .
I The posterior probability can be computed as
p(x|wj )P (wj )
P (wj |x) =
p(x)
Pc
where p(x) = j=1 p(x|wj )P (wj ).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 13 / 46
Conditional Risk
I Suppose we observe x and take action αi .
I If the true state of nature is wj , we incur the loss λ(αi |wj ).
I The expected loss with taking action αi is
c
X
R(αi |x) = λ(αi |wj )P (wj |x)
j=1
which is also called the conditional risk.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 14 / 46
Minimum-Risk Classification
I The general decision rule α(x) tells us which action to take
for observation x.
I We want to find the decision rule that minimizes the overall
risk Z
R = R(α(x)|x) p(x) dx.
I Bayes decision rule minimizes the overall risk by selecting
the action αi for which R(αi |x) is minimum.
I The resulting minimum overall risk is called the Bayes risk
and is the best performance that can be achieved.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 15 / 46
Two-Category Classification
I Define
I α1 : deciding w1 ,
I α2 : deciding w2 ,
I λij = λ(αi |wj ).
I Conditional risks can be written as
R(α1 |x) = λ11 P (w1 |x) + λ12 P (w2 |x),
R(α2 |x) = λ21 P (w1 |x) + λ22 P (w2 |x).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 16 / 46
Two-Category Classification
I The minimum-risk decision rule becomes
w if (λ − λ )P (w |x) > (λ − λ )P (w |x)
1 21 11 1 12 22 2
Decide
w2 otherwise
I This corresponds to deciding w1 if
p(x|w1 ) (λ12 − λ22 ) P (w2 )
>
p(x|w2 ) (λ21 − λ11 ) P (w1 )
⇒ comparing the likelihood ratio to a threshold that is
independent of the observation x.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 17 / 46
Minimum-Error-Rate Classification
I Actions are decisions on classes (αi is deciding wi ).
I If action αi is taken and the true state of nature is wj , then
the decision is correct if i = j and in error if i 6= j.
I We want to find a decision rule that minimizes the
probability of error.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 18 / 46
Minimum-Error-Rate Classification
I Define the zero-one loss function
0 if i = j
λ(αi |wj ) = i, j = 1, . . . , c
1 if i 6= j
(all errors are equally costly).
I Conditional risk becomes
c
X
R(αi |x) = λ(αi |wj ) P (wj |x)
j=1
X
= P (wj |x)
j6=i
= 1 − P (wi |x).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 19 / 46
Minimum-Error-Rate Classification
I Minimizing the risk requires maximizing P (wi |x) and results
in the minimum-error decision rule
Decide wi if P (wi |x) > P (wj |x) ∀j 6= i.
I The resulting error is called the Bayes error and is the best
performance that can be achieved.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 20 / 46
Minimum-Error-Rate Classification
Figure 2: The likelihood ratio p(x|w1 )/p(x|w2 ). The threshold θa is computed
using the priors P (w1 ) = 2/3 and P (w2 ) = 1/3, and a zero-one loss function.
If we penalize mistakes in classifying w2 patterns as w1 more than the
converse, we should increase the threshold to θb .
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 21 / 46
Discriminant Functions
I A useful way of representing classifiers is through
discriminant functions gi (x), i = 1, . . . , c, where the classifier
assigns a feature vector x to class wi if
gi (x) > gj (x) ∀j 6= i.
I For the classifier that minimizes conditional risk
gi (x) = −R(αi |x).
I For the classifier that minimizes error
gi (x) = P (wi |x).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 22 / 46
Discriminant Functions
I These functions divide the feature space into c decision
regions (R1 , . . . , Rc ), separated by decision boundaries.
I Note that the results do not change even if we replace every
gi (x) by f (gi (x)) where f (·) is a monotonically increasing
function (e.g., logarithm).
I This may lead to significant analytical and computational
simplifications.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 23 / 46
The Gaussian Density
I Gaussian can be considered as a model where the feature
vectors for a given class are continuous-valued, randomly
corrupted versions of a single typical or prototype vector.
I Some properties of the Gaussian:
I Analytically tractable.
I Completely specified by the 1st and 2nd moments.
I Has the maximum entropy of all distributions with a given
mean and variance.
I Many processes are asymptotically Gaussian (Central Limit
Theorem).
I Linear transformations of a Gaussian are also Gaussian.
I Uncorrelatedness implies independence.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 24 / 46
Univariate Gaussian
I For x ∈ R:
p(x) = N (µ, σ 2 )
" 2 #
1 1 x−µ
=√ exp −
2πσ 2 σ
where
Z ∞
µ = E[x] = x p(x) dx,
−∞
Z ∞
2 2
σ = E[(x − µ) ] = (x − µ)2 p(x) dx.
−∞
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 25 / 46
Univariate Gaussian
Figure 3: A univariate Gaussian distribution has roughly 95% of its area in
the range |x − µ| ≤ 2σ.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 26 / 46
Multivariate Gaussian
I For x ∈ Rd :
p(x) = N (µ, Σ)
1 1 T −1
= exp − (x − µ) Σ (x − µ)
(2π)d/2 |Σ|1/2 2
where
Z
µ = E[x] = x p(x) dx,
Z
T
Σ = E[(x − µ)(x − µ) ] = (x − µ)(x − µ)T p(x) dx.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 27 / 46
Multivariate Gaussian
Figure 4: Samples drawn from a two-dimensional Gaussian lie in a cloud
centered on the mean µ. The loci of points of constant density are the
ellipses for which (x − µ)T Σ−1 (x − µ) is constant, where the eigenvectors of
Σ determine the direction and the corresponding eigenvalues determine the
length of the principal axes. The quantity r2 = (x − µ)T Σ−1 (x − µ) is called
the squared Mahalanobis distance from x to µ.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 28 / 46
Linear Transformations
I Recall that, given x ∈ Rd , A ∈ Rd×k , y = AT x ∈ Rk ,
if x ∼ N (µ, Σ), then y ∼ N (AT µ, AT ΣA).
I As a special case, the whitening transform
Aw = ΦΛ−1/2
where
I Φ is the matrix whose columns are the orthonormal
eigenvectors of Σ,
I Λ is the diagonal matrix of the corresponding eigenvalues,
gives a covariance matrix equal to the identity matrix I.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 29 / 46
Discriminant Functions for the Gaussian
Density
I Discriminant functions for minimum-error-rate classification
can be written as
gi (x) = ln p(x|wi ) + ln P (wi ).
I For p(x|wi ) = N (µi , Σi )
1 d 1
gi (x) = − (x − µi )T Σ−1
i (x − µi ) − ln2π − ln|Σi | + lnP (wi ).
2 2 2
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 30 / 46
Case 1: Σi = σ 2I
I Discriminant functions are
gi (x) = wTi x + wi0 (linear discriminant)
where
1
wi = µ
σ2 i
1
wi0 = − 2 µTi µi + ln P (wi )
2σ
(wi0 is the threshold or bias for the i’th category).
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 31 / 46
Case 1: Σi = σ 2I
I Decision boundaries are the hyperplanes gi (x) = gj (x), and
can be written as
wT (x − x0 ) = 0
where
w = µi − µj
1 σ2 P (wi )
x0 = (µi + µj ) − ln (µ − µj ).
2 kµi − µj k2 P (wj ) i
I Hyperplane separating Ri and Rj passes through the point
x0 and is orthogonal to the vector w.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 32 / 46
Case 1: Σi = σ 2I
Figure 5: If the covariance matrices of two distributions are equal and
proportional to the identity matrix, then the distributions are spherical in d
dimensions, and the boundary is a generalized hyperplane of d − 1
dimensions, perpendicular to the line separating the means. The decision
boundary shifts as the priors are changed.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 33 / 46
Case 1: Σi = σ 2I
I Special case when P (wi ) are the same for i = 1, . . . , c is the
minimum-distance classifier that uses the decision rule
assign x to wi∗ where i∗ = arg min kx − µi k.
i=1,...,c
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 34 / 46
Case 2: Σi = Σ
I Discriminant functions are
gi (x) = wTi x + wi0 (linear discriminant)
where
wi = Σ−1 µi
1
wi0 = − µTi Σ−1 µi + ln P (wi ).
2
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 35 / 46
Case 2: Σi = Σ
I Decision boundaries can be written as
wT (x − x0 ) = 0
where
w = Σ−1 (µi − µj )
1 ln(P (wi )/P (wj ))
x0 = (µi + µj ) − (µi − µj ).
2 (µi − µj )T Σ−1 (µi − µj )
I Hyperplane passes through x0 but is not necessarily
orthogonal to the line between the means.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 36 / 46
Case 2: Σi = Σ
Figure 6: Probability densities with equal but asymmetric Gaussian
distributions. The decision hyperplanes are not necessarily perpendicular to
the line connecting the means.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 37 / 46
Case 3: Σi = arbitrary
I Discriminant functions are
gi (x) = xT Wi x + wTi x + wi0 (quadratic discriminant)
where
1
Wi = − Σ−1
2 i
wi = Σi−1 µi
1 1
wi0 = − µTi Σ−1
i µi − ln |Σi | + ln P (wi ).
2 2
I Decision boundaries are hyperquadrics.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 38 / 46
Case 3: Σi = arbitrary
Figure 7: Arbitrary Gaussian distributions lead to Bayes decision boundaries
that are general hyperquadrics.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 39 / 46
Case 3: Σi = arbitrary
Figure 8: Arbitrary Gaussian distributions lead to Bayes decision boundaries
that are general hyperquadrics.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 40 / 46
Error Probabilities and Integrals
I For the two-category case
P (error ) = P (x ∈ R2 , w1 ) + P (x ∈ R1 , w2 )
= P (x ∈ R2 |w1 )P (w1 ) + P (x ∈ R1 |w2 )P (w2 )
Z Z
= p(x|w1 ) P (w1 ) dx + p(x|w2 ) P (w2 ) dx.
R2 R1
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 41 / 46
Error Probabilities and Integrals
I For the multicategory case
P (error ) = 1 − P (correct)
Xc
=1− P (x ∈ Ri , wi )
i=1
c
X
=1− P (x ∈ Ri |wi )P (wi )
i=1
c Z
X
=1− p(x|wi ) P (wi ) dx.
i=1 Ri
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 42 / 46
Error Probabilities and Integrals
Figure 9: Components of the probability of error for equal priors and the
non-optimal decision point x∗ . The optimal point xB minimizes the total
shaded area and gives the Bayes error rate.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 43 / 46
Receiver Operating Characteristics
I Consider the two-category case and define
I w1 : target is present,
I w2 : target is not present.
Table 1: Confusion matrix.
Assigned
w1 w2
w1 correct detection mis-detection
True
w2 false alarm correct rejection
I Mis-detection is also called false negative or Type II error.
I False alarm is also called false positive or Type I error.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 44 / 46
Receiver Operating Characteristics
I If we use a parameter (e.g.,
a threshold) in our
decision, the plot of these
rates for different values of
the parameter is called the
receiver operating
characteristic (ROC) curve.
Figure 10: Example receiver
operating characteristic (ROC) curves
for different settings of the system.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 45 / 46
Summary
I To minimize the overall risk, choose the action that
minimizes the conditional risk R(α|x).
I To minimize the probability of error, choose the class that
maximizes the posterior probability P (wj |x).
I If there are different penalties for misclassifying patterns
from different classes, the posteriors must be weighted
according to such penalties before taking action.
I Do not forget that these decisions are the optimal ones
under the assumption that the “true” values of the
probabilities are known.
CS 551, Fall 2016 c 2016, Selim Aksoy (Bilkent University) 46 / 46