Bayesian Networks
Philipp Koehn
6 April 2017
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Outline 1
● Bayesian Networks
● Parameterized distributions
● Exact inference
● Approximate inference
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
2
bayesian networks
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Bayesian Networks 3
● A simple, graphical notation for conditional independence assertions
and hence for compact specification of full joint distributions
● Syntax
– a set of nodes, one per variable
– a directed, acyclic graph (link ≈ “directly influences”)
– a conditional distribution for each node given its parents:
P(Xi∣P arents(Xi))
● In the simplest case, conditional distribution represented as
a conditional probability table (CPT) giving the
distribution over Xi for each combination of parent values
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 4
● Topology of network encodes conditional independence assertions:
● W eather is independent of the other variables
● T oothache and Catch are conditionally independent given Cavity
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 5
● I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary
doesn’t call. Sometimes it’s set off by minor earthquakes.
Is there a burglar?
● Variables: Burglar, Earthquake, Alarm, JohnCalls, M aryCalls
● Network topology reflects “causal” knowledge
– A burglar can set the alarm off
– An earthquake can set the alarm off
– The alarm can cause Mary to call
– The alarm can cause John to call
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 6
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Compactness 7
● A conditional probability table for Boolean Xi with k Boolean parents has 2k
rows for the combinations of parent values
● Each row requires one number p for Xi = true
(the number for Xi = f alse is just 1 − p)
● If each variable has no more than k parents,
the complete network requires O(n ⋅ 2k ) numbers
● I.e., grows linearly with n, vs. O(2n) for the full joint distribution
● For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25 − 1 = 31)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Global Semantics 8
● Global semantics defines the full joint distribution as the product of the local
conditional distributions:
n
P (x1, . . . , xn) = ∏ P (xi∣parents(Xi))
i=1
● E.g., P (j ∧ m ∧ a ∧ ¬b ∧ ¬e)
= P (j∣a)P (m∣a)P (a∣¬b, ¬e)P (¬b)P (¬e)
= 0.9 × 0.7 × 0.001 × 0.999 × 0.998
≈ 0.00063
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Local Semantics 9
● Local semantics: each node is conditionally independent
of its nondescendants given its parents
● Theorem: Local semantics ⇔ global semantics
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Markov Blanket 10
● Each node is conditionally independent of all others given its
Markov blanket: parents + children + children’s parents
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Constructing Bayesian Networks 11
● Need a method such that a series of locally testable assertions of
conditional independence guarantees the required global semantics
1. Choose an ordering of variables X1, . . . , Xn
2. For i = 1 to n
add Xi to the network
select parents from X1, . . . , Xi−1 such that
P(Xi∣P arents(Xi)) = P(Xi∣X1, . . . , Xi−1)
● This choice of parents guarantees the global semantics:
n
P(X1, . . . , Xn) = ∏ P(Xi∣X1, . . . , Xi−1) (chain rule)
i=1
n
= ∏ P(Xi∣P arents(Xi)) (by construction)
i=1
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 12
● Suppose we choose the ordering M , J, A, B, E
● P (J∣M ) = P (J)?
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 13
● Suppose we choose the ordering M , J, A, B, E
● P (J∣M ) = P (J)? No
● P (A∣J, M ) = P (A∣J)? P (A∣J, M ) = P (A)?
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 14
● Suppose we choose the ordering M , J, A, B, E
● P (J∣M ) = P (J)? No
● P (A∣J, M ) = P (A∣J)? P (A∣J, M ) = P (A)? No
● P (B∣A, J, M ) = P (B∣A)?
● P (B∣A, J, M ) = P (B)?
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 15
● Suppose we choose the ordering M , J, A, B, E
● P (J∣M ) = P (J)? No
● P (A∣J, M ) = P (A∣J)? P (A∣J, M ) = P (A)? No
● P (B∣A, J, M ) = P (B∣A)? Yes
● P (B∣A, J, M ) = P (B)? No
● P (E∣B, A, J, M ) = P (E∣A)?
● P (E∣B, A, J, M ) = P (E∣A, B)?
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 16
● Suppose we choose the ordering M , J, A, B, E
● P (J∣M ) = P (J)? No
● P (A∣J, M ) = P (A∣J)? P (A∣J, M ) = P (A)? No
● P (B∣A, J, M ) = P (B∣A)? Yes
● P (B∣A, J, M ) = P (B)? No
● P (E∣B, A, J, M ) = P (E∣A)? No
● P (E∣B, A, J, M ) = P (E∣A, B)? Yes
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 17
● Deciding conditional independence is hard in noncausal directions
● (Causal models and conditional independence seem hardwired for humans!)
● Assessing conditional probabilities is hard in noncausal directions
● Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example: Car Diagnosis 18
● Initial evidence: car won’t start
● Testable variables (green), “broken, so fix it” variables (orange)
● Hidden variables (gray) ensure sparse structure, reduce parameters
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example: Car Insurance 19
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Compact Conditional Distributions 20
● CPT grows exponentially with number of parents
CPT becomes infinite with continuous-valued parent or child
● Solution: canonical distributions that are defined compactly
● Deterministic nodes are the simplest case:
X = f (P arents(X)) for some function f
● E.g., Boolean functions
N orthAmerican ⇔ Canadian ∨ U S ∨ M exican
● E.g., numerical relationships among continuous variables
∂Level
= inflow + precipitation - outflow - evaporation
∂t
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Compact Conditional Distributions 21
● Noisy-OR distributions model multiple noninteracting causes
– parents U1 . . . Uk include all causes (can add leak node)
– independent failure probability qi for each cause alone
Ô⇒ P (X∣U1 . . . Uj , ¬Uj+1 . . . ¬Uk ) = 1 − ∏ji = 1 qi
Cold F lu M alaria P (F ever) P (¬F ever)
F F F 0.0 1.0
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 × 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 × 0.1
T T F 0.88 0.12 = 0.6 × 0.2
T T T 0.988 0.012 = 0.6 × 0.2 × 0.1
● Number of parameters linear in number of parents
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Hybrid (Discrete+Continuous) Networks 22
● Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)
● Option 1: discretization—possibly large errors, large CPTs
Option 2: finitely parameterized canonical families
● 1) Continuous variable, discrete+continuous parents (e.g., Cost)
2) Discrete variable, continuous parents (e.g., Buys?)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Continuous Child Variables 23
● Need one conditional density function for child variable given continuous
parents, for each possible assignment to discrete parents
● Most common is the linear Gaussian model, e.g.,:
P (Cost = c∣Harvest = h, Subsidy? = true)
= N (ath + bt, σt)(c)
2
1 1 c − (ath + bt)
= √ exp (− ( ) )
σt 2π 2 σt
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Continuous Child Variables 24
● All-continuous network with LG distributions
Ô⇒ full joint distribution is a multivariate Gaussian
● Discrete+continuous LG network is a conditional Gaussian network i.e., a
multivariate Gaussian over all continuous variables for each combination of
discrete variable values
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Discrete Variable w/ Continuous Parents 25
● Probability of Buys? given Cost should be a “soft” threshold:
● Probit distribution uses integral of Gaussian:
x
Φ(x) = ∫−∞ N (0, 1)(x)dx
P (Buys? = true ∣ Cost = c) = Φ((−c + µ)/σ)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Why the Probit? 26
● It’s sort of the right shape
● Can view as hard threshold whose location is subject to noise
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Discrete Variable 27
● Sigmoid (or logit) distribution also used in neural networks:
1
P (Buys? = true ∣ Cost = c) =
1 + exp(−2 −c+µ
σ )
● Sigmoid has similar shape to probit but much longer tails:
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
28
inference
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Inference Tasks 29
● Simple queries: compute posterior marginal P(Xi∣E = e)
e.g., P (N oGas∣Gauge = empty, Lights = on, Starts = f alse)
● Conjunctive queries: P(Xi, Xj ∣E = e) = P(Xi∣E = e)P(Xj ∣Xi, E = e)
● Optimal decisions: decision networks include utility information;
probabilistic inference required for P (outcome∣action, evidence)
● Value of information: which evidence to seek next?
● Sensitivity analysis: which probability values are most critical?
● Explanation: why do I need a new starter motor?
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Inference by Enumeration 30
● Slightly intelligent way to sum out variables from the joint without actually
constructing its explicit representation
● Simple query on the burglary network
P(B∣j, m)
= P(B, j, m)/P (j, m)
= αP(B, j, m)
= α ∑e ∑a P(B, e, a, j, m)
● Rewrite full joint entries using product of CPT entries:
P(B∣j, m)
= α ∑e ∑a P(B)P (e)P(a∣B, e)P (j∣a)P (m∣a)
= αP(B) ∑e P (e) ∑a P(a∣B, e)P (j∣a)P (m∣a)
● Recursive depth-first enumeration: O(n) space, O(dn) time
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Enumeration Algorithm 31
function E NUMERATION -A SK(X, e, bn) returns a distribution over X
inputs: X, the query variable
e, observed values for variables E
bn, a Bayesian network with variables {X} ∪ E ∪ Y
Q(X ) ← a distribution over X, initially empty
for each value xi of X do
extend e with value xi for X
Q(xi ) ← E NUMERATE -A LL(VARS[bn], e)
return N ORMALIZE(Q(X ))
function E NUMERATE -A LL(vars, e) returns a real number
if E MPTY ?(vars) then return 1.0
Y ← F IRST(vars)
if Y has value y in e
then return P (y ∣ P a(Y )) × E NUMERATE -A LL(R EST(vars), e)
else return ∑y P (y ∣ P a(Y )) × E NUMERATE -A LL(R EST(vars), ey )
where ey is e extended with Y = y
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Evaluation Tree 32
● Enumeration is inefficient: repeated computation
e.g., computes P (j∣a)P (m∣a) for each value of e
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Inference by Variable Elimination 33
● Variable elimination: carry out summations right-to-left,
storing intermediate results (factors) to avoid recomputation
P(B∣j, m)
= α P(B) ∑e P (e) ∑a P(a∣B, e) P (j∣a) P (m∣a)
² ² ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¸¹ ¹ ¹ ¹¶ ´¹¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¶
B E A J M
= αP(B) ∑e P (e) ∑a P(a∣B, e)P (j∣a)fM (a)
= αP(B) ∑e P (e) ∑a P(a∣B, e)fJ (a)fM (a)
= αP(B) ∑e P (e) ∑a fA(a, b, e)fJ (a)fM (a)
= αP(B) ∑e P (e)fĀJM (b, e) (sum out A)
= αP(B)fĒ ĀJM (b) (sum out E)
= αfB (b) × fĒ ĀJM (b)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Variable Elimination Algorithm 34
function E LIMINATION -A SK(X, e, bn) returns a distribution over X
inputs: X, the query variable
e, evidence specified as an event
bn, a belief network specifying joint distribution P(X1, . . . , Xn)
factors ← [ ]; vars ← R EVERSE(VARS[bn])
for each var in vars do
factors ← [M AKE -FACTOR(var , e)∣factors]
if var is a hidden variable then factors ← S UM -O UT(var, factors)
return N ORMALIZE(P OINTWISE -P RODUCT(factors))
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Irrelevant Variables 35
● Consider the query P (JohnCalls∣Burglary = true)
P (J∣b) = αP (b) ∑ P (e) ∑ P (a∣b, e)P (J∣a) ∑ P (m∣a)
e a m
Sum over m is identically 1; M is irrelevant to the query
● Theorem 1: Y is irrelevant unless Y ∈ Ancestors({X} ∪ E)
● Here
– X = JohnCalls, E = {Burglary}
– Ancestors({X} ∪ E) = {Alarm, Earthquake}
⇒ M aryCalls is irrelevant
● Compare this to backward chaining from the query in Horn clause KBs
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Irrelevant Variables 36
● Definition: moral graph of Bayes net: marry all parents and drop arrows
● Definition: A is m-separated from B by C iff separated by C in the moral graph
● Theorem 2: Y is irrelevant if m-separated from X by E
● For P (JohnCalls∣Alarm = true), both
Burglary and Earthquake are irrelevant
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Complexity of Exact Inference 37
● Singly connected networks (or polytrees)
– any two nodes are connected by at most one (undirected) path
– time and space cost of variable elimination are O(dk n)
● Multiply connected networks
– can reduce 3SAT to exact inference Ô⇒ NP-hard
– equivalent to counting 3SAT models Ô⇒ #P-complete
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
38
approximate inference
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Inference by Stochastic Simulation 39
● Basic idea
– Draw N samples from a sampling distribution S
– Compute an approximate posterior probability P̂
– Show this converges to the true probability P
● Outline
– Sampling from an empty network
– Rejection sampling: reject samples disagreeing with evidence
– Likelihood weighting: use evidence to weight samples
– Markov chain Monte Carlo (MCMC): sample from a stochastic process
whose stationary distribution is the true posterior
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Sampling from an Empty Network 40
function P RIOR -S AMPLE(bn) returns an event sampled from bn
inputs: bn, a belief network specifying joint distribution P(X1, . . . , Xn)
x ← an event with n elements
for i = 1 to n do
xi ← a random sample from P(Xi ∣ parents(Xi))
given the values of P arents(Xi) in x
return x
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 41
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 42
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 43
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 44
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 45
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 46
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Example 47
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Sampling from an Empty Network 48
● Probability that P RIOR S AMPLE generates a particular event
SP S (x1 . . . xn) = ∏ni= 1 P (xi∣parents(Xi)) = P (x1 . . . xn)
i.e., the true prior probability
● E.g., SP S (t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t)
● Let NP S (x1 . . . xn) be the number of samples generated for event x1, . . . , xn
● Then we have lim P̂ (x1, . . . , xn) = lim NP S (x1, . . . , xn)/N
N →∞ N →∞
= SP S (x1, . . . , xn)
= P (x1 . . . xn)
● That is, estimates derived from P RIOR S AMPLE are consistent
● Shorthand: P̂ (x1, . . . , xn) ≈ P (x1 . . . xn)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Rejection Sampling 49
● P̂(X∣e) estimated from samples agreeing with e
function R EJECTION -S AMPLING(X, e, bn, N) returns an estimate of P (X ∣e)
local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
x ← P RIOR -S AMPLE(bn)
if x is consistent with e then
N[x] ← N[x]+1 where x is the value of X in x
return N ORMALIZE(N[X])
● E.g., estimate P(Rain∣Sprinkler = true) using 100 samples
27 samples have Sprinkler = true
Of these, 8 have Rain = true and 19 have Rain = f alse
● P̂(Rain∣Sprinkler = true) = N ORMALIZE(⟨8, 19⟩) = ⟨0.296, 0.704⟩
● Similar to a basic real-world empirical estimation procedure
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Analysis of Rejection Sampling 50
● P̂(X∣e) = αNP S (X, e) (algorithm defn.)
= NP S (X, e)/NP S (e) (normalized by NP S (e))
≈ P(X, e)/P (e) (property of P RIOR S AMPLE)
= P(X∣e) (defn. of conditional probability)
● Hence rejection sampling returns consistent posterior estimates
● Problem: hopelessly expensive if P (e) is small
● P (e) drops off exponentially with number of evidence variables!
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting 51
● Idea: fix evidence variables, sample only nonevidence variables,
and weight each sample by the likelihood it accords the evidence
function L IKELIHOOD -W EIGHTING(X, e, bn, N) returns an estimate of P (X ∣e)
local variables: W, a vector of weighted counts over X, initially zero
for j = 1 to N do
x, w ← W EIGHTED -S AMPLE(bn)
W[x ] ← W[x ] + w where x is the value of X in x
return N ORMALIZE(W[X ])
function W EIGHTED -S AMPLE(bn, e) returns an event and a weight
x ← an event with n elements; w ← 1
for i = 1 to n do
if Xi has a value xi in e
then w ← w × P (Xi = xi ∣ parents(Xi ))
else xi ← a random sample from P(Xi ∣ parents(Xi ))
return x, w
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 52
w = 1.0
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 53
w = 1.0
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 54
w = 1.0
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 55
w = 1.0 × 0.1
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 56
w = 1.0 × 0.1
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 57
w = 1.0 × 0.1
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Example 58
w = 1.0 × 0.1 × 0.99 = 0.099
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Likelihood Weighting Analysis 59
● Sampling probability for W EIGHTED S AMPLE is
SW S (z, e) = ∏li = 1 P (zi∣parents(Zi))
● Note: pays attention to evidence in ancestors only
Ô⇒ somewhere “in between” prior and
posterior distribution
● Weight for a given sample z, e is
w(z, e) = ∏mi = 1 P (ei ∣parents(Ei ))
● Weighted sampling probability is
SW S (z, e)w(z, e)
= ∏li = 1 P (zi∣parents(Zi)) ∏m
i = 1 P (ei ∣parents(Ei ))
= P (z, e) (by standard global semantics of network)
● Hence likelihood weighting returns consistent estimates
but performance still degrades with many evidence variables
because a few samples have nearly all the total weight
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Approximate Inference using MCMC 60
● “State” of network = current assignment to all variables
● Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
function MCMC-A SK(X, e, bn, N) returns an estimate of P (X ∣e)
local variables: N[X ], a vector of counts over X, initially zero
Z, the nonevidence variables in bn
x, the current state of the network, initially copied from e
initialize x with random values for the variables in Y
for j = 1 to N do
for each Zi in Z do
sample the value of Zi in x from P(Zi ∣mb(Zi ))
given the values of M B(Zi ) in x
N[x ] ← N[x ] + 1 where x is the value of X in x
return N ORMALIZE(N[X ])
● Can also choose a variable to sample at random each time
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
The Markov Chain 61
● With Sprinkler = true, W etGrass = true, there are four states:
● Wander about for a while, average what you see
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
MCMC Example 62
● Estimate P(Rain∣Sprinkler = true, W etGrass = true)
● Sample Cloudy or Rain given its Markov blanket, repeat.
Count number of times Rain is true and false in the samples.
● E.g., visit 100 states
31 have Rain = true, 69 have Rain = f alse
● P̂(Rain∣Sprinkler = true, W etGrass = true)
= N ORMALIZE(⟨31, 69⟩) = ⟨0.31, 0.69⟩
● Theorem: chain approaches stationary distribution:
long-run fraction of time spent in each state is exactly
proportional to its posterior probability
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Markov Blanket Sampling 63
● Markov blanket of Cloudy is Sprinkler and Rain
● Markov blanket of Rain is
Cloudy, Sprinkler, and W etGrass
● Probability given the Markov blanket is calculated as follows:
P (x′i∣mb(Xi)) = P (x′i∣parents(Xi)) ∏Zj ∈Children(Xi) P (zj ∣parents(Zj ))
● Easily implemented in message-passing parallel systems, brains
● Main computational problems
– difficult to tell if convergence has been achieved
– can be wasteful if Markov blanket is large:
P (Xi∣mb(Xi)) won’t change much (law of large numbers)
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017
Summary 64
● Bayes nets provide a natural representation for (causally induced)
conditional independence
● Topology + CPTs = compact representation of joint distribution
● Generally easy for (non)experts to construct
● Canonical distributions (e.g., noisy-OR) = compact representation of CPTs
● Continuous variables Ô⇒ parameterized distributions (e.g., linear Gaussian)
● Exact inference by variable elimination
– polytime on polytrees, NP-hard on general graphs
– space = time, very sensitive to topology
● Approximate inference by LW, MCMC
– LW does poorly when there is lots of (downstream) evidence
– LW, MCMC generally insensitive to topology
– Convergence can be very slow with probabilities close to 1 or 0
– Can handle arbitrary combinations of discrete and continuous variables
Philipp Koehn Artificial Intelligence: Bayesian Networks 6 April 2017