Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views43 pages

Graphical Models

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views43 pages

Graphical Models

Uploaded by

idhitappu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Undirected Graphical Models

10/36-702

Contents

1 Marginal Correlation Graphs 2

2 Partial Correlation Graphs 10

3 Conditional Independence Graphs 13

3.1 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Multinomials and Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . 13

3.3 The Nonparametric Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 A Deeper Look At Conditional Independence Graphs 26

4.1 Markov Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Clique Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Directed vs. Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

1
Graphical models are a way of representing the relationships between features (variables).
There are two main brands: directed and undirected. We shall focus on undirected graphical
models. See Figure 1 for an example of an undirected graph.

Undirected graphs come in different flavors, such as:

1. Marginal Correlation Graphs.


2. Partial Correlation Graphs.
3. Conditional Independence Graphs.

In each case, there are parametric and nonparametric versions.

Let X1 , . . . , Xn ∼ P where Xi = (Xi (1), . . . , Xi (d))T ∈ Rd . The vertices (nodes) of the


graph refer to the d features. Each node of the graph corresponds to one feature. Edges
represent relationships between the features. The graph is represented by G = (V, E) where
V = (V1 , . . . , Vd ) are the vertices and E are the edges. We can regard the edges E as a d × d
matrix where E(j, k) = 1 if there is an edge between feature j and feature k and 0 otherwise.
Alternatively, you can regard E as a list of pairs where (j, k) ∈ E if there is an edge between
j and k. We write
X qY
to mean that X and Y are independent. In other words, p(x, y) = p(x)p(y). We write

X q Y |Z

to mean that X and Y are independent given Z. In other words, p(x, y|z) = p(x|z)p(y|z).

1 Marginal Correlation Graphs

In a marginal correlation graph (or association graph) we put an edge between Vj and Vk if
|ρ(j, k)| ≥  where ρ(j, k) is some measure of association. Often we use  = 0 in which case
there is an edge iff ρ(j, k) 6= 0. We also write ρ(Xj , Xk ) to mean the same as ρ(j, k).

The parameter ρ(j, k) is required to have the following property:

X qY implies that ρ(X, Y ) = 0.

In general, the reverse may not be true. We will say that ρ is strong if

X qY if and only if ρ(X, Y ) = 0.

We would lile ρ to have several properties: easy to compute, robust to outliers and there
is some way to calculate a confidence interval for the parameter. Here is a summary of the
association measures we will consider:

2
Figure 1: A Protein network. From: Maslov and Sneppen (2002). Specificity and Stability
in Topology of Protein Networks. Science, 296, 910-913.

Strong Robust Fast Confidence Interval


√ √
Pearson × ×
√ √ √
Kendall ×

Dcorr √ ×
√ × sort
√ of
τ∗ ×

Pearson Correlation. A common choice of ρ is the Pearson correlation. For two variables
X and Y te Pearson correlation is
Cov(X, Y )
ρ(X, Y ) = (1)
σX σY
. The sample estimate is
Pn
i=1 (Xi − X)(Yi − Y )
r(X, Y ) = .
sX sY
When dealing with d feaures X(1), . . . , X(d), we write ρ(j, k) ≡ ρ(X(j), X(k)). The sample
correlation is denoted by rjk .

To test H0 : ρ(j, k) = 0 versus H1 : ρ(j, k) 6= 0 we can use an asymptotic test or an exact

3
test. The asymptotic test works like this. Define
 
1 1 + rjk
Zjk = log .
2 1 − rjk
Fisher proved that  
1
Zjk ≈ N θjk ,
n−3
where  
1 1 + ρjk
θjk = log .
2 1 − ρjk

We reject H0 if |Zjk | > zα/2 / n − 3. In fact, to control for multiple testing, we should reject

when |Zjk | > zα/(2m) / n − 3 where m = d2 . The confidence interval is Cn = [a, b] where

√ √
a = exp(Zjk − zα/2 / n − 3) and b = exp(Zjk + zα/2 / n − 3). A simultaneous confidence
set for all the correlations can be obtained using the high dimensional bootstrap which we
decribe later.

An exact test can be obtained by using a permutation test. Permute one of the variables
1 B
and recompute the correlation. Repeat B times to get rjk , . . . , rjk . The p-value is
1 X s
p= I(|rjk | ≥ |rjk |).
B s

Reject if p ≤ α/m.

Kendall’s τ . The Pearson correlation is not very robust to outliers. A more robust measure
of association is Kendall’s tau defined by
" #
 
τ (X, Y ) = E sign (X1 − X2 )(Y1 − Y2 ) .

Kendall’s τ can be intepreted as: probabilty(concordant) - probabilty(disconcordant). See


this plot:

~ ~

~ ~

Concordant Pair Discordant Pair

4
τ can be estimated by
" #
1 X
τb(X, Y ) = n sign[(Xs − Xt )(Ys − Yt )] .
2 s6=t

p of this form is called a U -statistic. Under H0 , τbjk ≈ N (0, 4/(9n)) so we reject


A statistic
when 9n/4|b τjk | > zα/2m . Alternatively, use the permutation test.

Distance Correlation. There are various nonparametric measures of association. The


most common are the distance correlation and the RKHS correlation. The squared distance
covariance between two random vectors X and Y is defined by (Szekely et al 2007)

γ 2 (X, Y ) = Cov(||X − X 0 ||, ||Y − Y 0 ||) − 2Cov(||X − X 0 ||, ||Y − Y 00 ||) (2)

where (X, Y ), (X 0 , Y 0 ) and (X 00 , Y 00 ) are independent pairs. The distance correlation is

γ 2 (X, Y )
ρ2 (X, Y ) = p .
γ 2 (X, X)γ 2 (Y, Y )

It can be shown that


|φX,Y (s, t) − φX (s)φY (t)|2
Z
2 1
γ (X, Y ) = ds dt (3)
c1 c2 ||s||1+d ||t||1+d

where c1 , c2 are constants and φ denotes the characteristic function. Another expression
(Lyons 2013) for γ is
γ 2 (X, Y ) = E[δ(X, X 0 )δ(Y, Y 0 )]
where Z Z Z
0 0
δ(X, X ) = d(X, X ) − 2 d(X, u)dP (u) + d(u, v)dP (u)dP (v)

and d(x, y) = ||x − y||. In fact, other metrics d can be used.

Lemma 1 We have that 0 ≤ ρ(X, Y ) ≤ 1 and ρ(X, Y ) = 0 if and only if X q Y .

An estimate of γ is
1 X
b2 (X, Y ) =
γ Ajk Bjk
n2 j,k

where
Ajk = ajk − aj· − a·k + a·· , Bjk = bjk − bj· − b·k + b·· .

5
Here, ajk = ||Xj − Xk || and aj· , a·k , a·· are the row, column and grand means of the matrix
{ajk }. The limiting distribution of γ b2 (X, Y ) is complicated. But we can easily test H0 :
γ(X, Y ) = 0 using a permutation test.

Another nonparametric measure of independence based on RKHS is

γ 2 (X, Y ) = E[Kh (X, X 0 )Kh (Y, Y 0 )] + E[Kh (X, X 0 )]E[Kh (Y, Y 0 )]


"Z Z #
− 2E Kh (X, u)dP (u) Kh (Y, v)dP (v)

for a kernel Kh . See Gretton et al (2008).


d

To apply any of these methods to graphs, we need to test all 2
correlations.

The Bergsma-Dassios τ ∗ Correlation. Bergsma and Dassios (2014) extended Kendall’s


τ into a strong correlation. The definition is

τ ∗ (X, Y ) = E[a(X1 , X2 , X3 , X4 )a(Y1 , Y2 , Y3 , Y4 )] (4)

where
a(z1 , z2 , z3 , z4 ) = sign(|z1 − z2 | + |z3 − z4 | − |z1 − z3 − |z2 − z4 |).

Lemma 2 τ ∗ (X, Y ) ≥ 0. Further, τ ∗ (X, Y ) = 0 if and only if X q Y .

An estimate of τ ∗ is
1 X
τb∗ = n
 a(Xi , Xj , Xk , X` )a(Yi , Yj , Yk , Y` ) (5)
4

where the sum is over all distinct quadruples.

The τ ∗ parameter can also be given an interpretation in terms of concordant and discordant
points if we define them as follows:

~ ~ ~ ~
~ ~

~ ~
~ ~ ~ ~

Concordant Group Concordant Group Discordant Group

6
Then
2P (concordant) − P (discordant)
τ∗ = .
3
This statistic is related to the distance covariance as follows:

τ ∗ (X, Y ) = E[a(X1 , X2 , X3 , X4 )a(Y1 , Y2 , Y3 , Y4 )]


1
γ 2 (X, Y ) = E[b(X1 , X2 , X3 , X4 )b(Y1 , Y2 , Y3 , Y4 )]
4
where
b(z1 , z2 , z3 , z4 ) = |z1 − z2 | + |z3 − z4 | − |z1 − z3 | − |z2 − z4 |
and a(z1 , z2 , z3 , z4 ) = sign(b(z1 , z2 , z3 , z4 )).

To test H0 : X q Y we use a permutation test. Recently Dhar, Dassios and Bergsma (2016)
showed that τb∗ has good power and is quite robust.

Confidence Intervals. Constructing confidence intervals for γ is not easy. The problem
is that the statistics have different limiting distributions depending on whether the the null
H0 : X q Y is true or not. For example, if H0 is true then

√ X
n(b
γ γ) λj [(Zj + aj )2 − 1]
j=1

where Z1 , Z2 , . . . , N (0, 1) and {λj , aj }∞


1 are (unknown) constants. A similar result holds for
γ
b. This is called a Gaussian chaos. On the other hand, when H0 is false, we have that

γ − γ)
n(b N (0, σ 2 )

for some σ > 0. When τ is small but not quite 0, the distribution will be something in-
between a Normal and a Gaussian chaos.

Since the limiting distribution varies, we cannot really use it to construct a confidence
interval. On way to solce this problem is to use blocking. Instead of using a U-statistics
based on all subsets of size 4, we can break the dataset into non-overlapping blocks of size
4. We construct Q = a(Xi , Xj , Xk , X` )a(Yi , Yj , Yk , Y` ) on each block. Then we define
1 X
g= Qj
m j

where m = n/4 is the number of blocks. Since this is an average, it will have a limiting
Normal distribution. We can then use the Normal approximation or the bootstrap to get
confidence intervals. However, g uses less information then γ
b so we will get larger confidence
intervals than necessary.

7
For τ ∗ the situation is better. Note that τb∗ is a U-statistic of order 4. That is
1 X
τb∗ = τb∗ = n K(Zi , Zj , Zk , Z` )
4

where the sum is over all distinct quadruples and Zi = (Xi , Yi ). Also, −1 ≤ K ≤ 1. By
Hoeffding’s inequalty for U -statsitics of order b we have
2 /r 2
τ ∗ − τ ∗ | > t) ≤ 2e−2(n/b)t
P(|b
where r is the range of K. In our case, r = 2 and b = 4 so
2 /8
τ ∗ − τ ∗ | > t) ≤ 2e−nt
P(|b .
p
If we set tn = (8/n) log(2/α) then Cn = τb∗ ± tn is a 1 − α confidence interval. However, it
may not be shortest possible confidence interval. Find the shortest valid confidence interval
is an open question.

Example. Figure 2 shows some Pearson graphs. These are: two Markov chains, a hub,
four clusters, and a band. Technically, these should be best discovered using conditional
independence graphs (discussed later). But correlation graphs are easy to estimate and
often reveal the salient structure.

Figure 3 shows a graph from highly non-Normal data. The data have the structure of two
Markov chains. I used the distance correlation on all pairs with permutation tests. Nice!

High Dimensional Bootstrap For Pearson Correlations We can also get simultaneous
confidence intervals for many Pearson correlations. This is especially important if we want
to put an edge when |ρ(j, k)| ≥ . If we have a confidence interval C then we can put an
edge whenever [−, ] ∩ C = ∅.

The easiest way to get simultaneous confidence intervals is to use the bootstrap. Let R be
the d × d matrix of true correlations and let R b be the d × d matrix of sample correlations.
(Actually, it is probably better to use the Fisher transformed correlations.) Let X1∗ , . . . , Xn∗
denote a bootstrap sample and let R b∗ be the d × d matrix of correlations from the bootstrap
sample. After taking B bootstrap samples we have R b∗ . Let δj = √n maxs,t |R
b∗ , . . . , R b∗ (s, t)−
1 B j
R(s,
b t)| and define
B
1 X
Fn (w) =
b I(δj ≤ w)
B j=1
which approximates

Fn (w) = P( n max |R(s,
b t) − R(s, t)| ≤ w).
s,t

Let wα = Fbn−1 (α). Finally, we set


 

b t) − √ wα
Cst = R(s, , R(s, t) + √ .
b
n n

8
● ●

● ●

● ●

● ●



● ●

● ●




● ●

● ●


● ●


● ●

● ● ●

● ●

● ●



● ●






● ●

● ●



● ●


● ●

Figure 2: Pearson correlation graphs. Top left: two Markov-chains. Top right: a Hub.
Bottom left: 4 clusters. Bottom right: banded.

9




● ●





Figure 3: Graph based on nonparametric distance correlation. Two Markov chains.

1/6
Theorem 3 Suppose that d = o(en ). Then
P (R(s, t) ∈ Cst for all s, t) → 1 − α
as n → ∞.

2 Partial Correlation Graphs

Let X, Y ∈ R and Z be a random vector. The partial correlation between X and Y , given
Z, is a measure of association between X and Y after removing the effect of Z.

Specifically, ρ(X, Y |Z) is the correlation between X and Y where


X = X − ΠZ X, Y = Y − ΠZ Y.
Here, ΠZ X is the projection of X onto the linear space spanned by Z. That is ΠZ X = β T X
where β minimizes E[Y − β T X]2 . In other words, ΠZ X is the linear regression of X on Z.
Similarly, for ΠZ Y . We’ll give an explicit formula for the partial correlation shortly.

Now let’s go back to graphs. Let X = (X(1), . . . , X(d)) and let ρjk denote the partial
correlation between X(j) and X(k) given all the other variables. Let R = {ρjk } be the d × d
matrix of partial correlations.

Lemma 4 The matrix R is given by R(j, k) = −Ωjk / Ωjj Ωkk where Ω = Σ−1 and Σ is the
p

covariance matrix of X.

10
The partial correlation graph G has an edge between j and k when ρjk 6= 0.

In the low-dimensional setting, we can estimateq R as follows. Let Sn be the sample covari-
−1 b k) = −Ω
ance. Let Ωb = S and define R(j,
n
b jk / Ω
b jj Ω
b kk . The easiest way to construct the
graph is to use get simultaneous confidence intervals Cjk using the bootstrap. Then we put
an edge if 0 ∈
/ Cjk .

There is also a Normal approximation similar to correlations. Define


 
1 1 + rjk
Zjk = log
2 1 − rjk

where rjk = R(j,


b k). Then
 
1
Zjk ≈ N θjk ,
n−g−3
where g = d − 2 and  
1 1 + ρjk
θjk = log .
2 1 − ρjk

We reject H0 if |Zjk | > zα/(2m) / n − g − 3.

In high dimensions, this won’t work since Sn is not invertible. In fact,

b k)) ≈ 1
Var(R(j,
n−d
which shows that we cannot reliably estimate the partial correlation when d is large. You
can do three things:

1. Compute a correlation graph instead. This is easy, works well, and often reveals similar
structure that is in the partial correlation graph.
2. Shrinkage: let Ωb = [(1 − )Sn + D]−1 where 0 ≤  ≤ 1 and D is a diagonal matrix
with Djj = Sjj . Then we use the bootstrap to test the entries of the matrix. Based
on calculations in Schafer and Strimmer (2005) amd Ledoit and Wolf (2004), a good
choice of  is P
j6=k Var(sjk )
d
= P 2
j6=j sjk

where n
n X
Var(s
d jk ) = (sijk − wjk )2 ,
(n − 1)3 i=1
wijk = (Xi (j) − X(j))(Xi (k) − X(k)) and wjk = n−1 i wijk . This choice is based on
P
minimizing the estimated risk.
3. Use the graphical lasso (described below). Warning! The reliability of the graphical
lasso depends on lots of non-trivial, uncheckable assumptions.

11
Sparsity vs. Regularization lambda = 0.266 lambda = 0.165 lambda = 0.13

0.30
● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●

● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●

Sparsity Level
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●

● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●

0.15
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ●
● ● ● ● ● ●
● ● ●

● ● ● ●
● ● ● ●
● ● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●


● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●


0.00

● ● ● ●
● ●

● ● ● ● ●

● ●
● ● ●

0.50 0.20 0.05

Regularization Parameter

Figure 4: Graphical Lasso for a hub graph.

For the graphical lasso we proceed as follows. We assume that Xi ∼ N (µ, Σ). Then we
estimated Σ (and hence Ω = Σ−1 ) using the penalized log-likelihood,
h X i
Ω = arg max `(Ω) − λ
b |ωjk |
Ω0
j6=k

where the log-likelihood (after maximizing over µ) is

n n nd
`(Ω) = log |Ω| − tr(ΩSn ) − log(2π). (6)
2 2 2

Node-wise regression. A related but different approach due to Meinshausen and Buhlmann
(2006). The idea is to regress each variable on all the others using the lasso.

Example: Figure 4 shows a hub graph. The graphs was estimated by Meinshausen and
Buhlmann (2006) method using the R package huge.

How To Choose λ. If we use the graphical lasso, how do we choose λ? One idea is to use
cross-validation based on the Normal log-likelihood. In this case we fit the model on part
of the data and evaluate the log-likehood on the held-out data. This is not very reliable
since it depends heavily on the Normality assumption. Currently, I do not think there is a
rigorosuly justified, robust method for choosing λ. Perhaps the best strategy is to plot the
graph for many values of λ.

More Robust Approach. Here is a more robust method. For each pair of variables, regress
them on all the other variables (using your favorate regression method). Now compute the
Kendal correlation on the residuals. This seems like a good idea but I have not seen anyone
try it.

12
Nonparametric Partial Correlation Graphs. There are various ways to create a non-
parametric partial correlation. Let us write
X = g(Z) + X
Y = h(Z) + Y .
Thus, X = X − g(Z) and Y = Y − h(Z) where g(z) = E[X|Z = x] and h(z) = E[Y |Z = z].
Now define
ρ(X, Y |Z) = ρ(X , Y )
where Here We can estimate ρ by using nonparametric regression to estimate g(z) and h(z).
X,i = Xi − gb(Xi ) and b
Then we take the correlation between the residuals b Y,i = Yi − b
h(Yi ).
When Z is high-dimensional, we can use SpAM to estimate g and h.

3 Conditional Independence Graphs

The strongest type of undirected graph is a conditional independence graph. In this case, we
omit the edge between j and k if X(j) is independent of X(k) given the rest of the variables.
We write this as
X(j) q X(k) | rest. (7)

Conditional independence graphs are the most informative undirected graphs but they are
also the hardest to estimate.

3.1 Gaussian

In the special case of Normality, they are equivalent to partial correlation graphs.

Theorem 5 Suppose that X = (X(1), . . . , X(d)) ∼ N (µ, Σ). Let Ω = Σ−1 . Then X(j) is
independent of X(k) given the rest, if and only if Ωjk = 0.

So, in the Normal case, we are back to partial correlations.

3.2 Multinomials and Log-Linear Models

When all the variables are discrete, the joint distribution is multinomial. It is convenient to
reparameterize the multinomial in a form known as a log-linear model.

13
Let’s start with a simple example. Suppose X = (X(1), X(2)) and that each variable is
binary. Let
p(x1 , x2 ) = P(X1 = x1 , X2 = x2 ).
So, for example, p(0, 1) = P(X1 = 0, X2 = 1). There are four unknown parameters:
p(0, 0), p(0, 1), p(1, 0), p(1, 1). Actually, these have to add up to 1, so there are really only
three free parameters.

We can now write


log p(x1 , x2 ) = β0 + β1 x1 + β2 x2 + β12 x1 x2 .
This is the log-linear representation of the multinomial. The parameters β = (β0 , β1 , β2 , β12 )
are functions of p(0, 0), p(0, 1), p(1, 0), p(1, 1). Conversely, β = (β0 , β1 , β2 , β12 ) are functions
of p(0, 0), p(0, 1), p(1, 0), p(1, 1). In fact we can solve and get:
     
p(1, 0) p(0, 1) p(1, 1)p(0, 0)
β0 = log p(0, 0), β1 = log , β2 = log , β12 = log .
p(0, 0) p(0, 0) p(0, 1)p(1, 0)

So why should be bother writing the model this way? The answer is:

Lemma 6 In the above model, X1 q X2 if and only if β12 = 0.

The log-linear representation converts statements about independence and con-


ditional independence into statements about parameters being 0.

Now suppose that X = (X(1), . . . , X(d)). Let’s continue to assume that each variable is
binary. The log-linear representation is:
X X
log p(x1 , . . . , xd ) = β0 + β j xj + βjk xj xk + · · · + β12···d x1 · · · xd .
j j<k

Theorem 7 We have that X(j) q X(k)|rest if and only if every βA = 0 if (j, k) ∈ A.

Here is an example. Suppose that d = 3 and suppose that

log p(x1 , x2 , x3 ) = β0 + β1 x1 + β2 x2 + β3 x3 + β12 x1 x2 + β13 x1 x3 .

In this model, β23 = β123 = 0. We conclude that X(2) q X(3)|X(1). Hence, we can omit the
edge between X(2) and X(3).

Log-linear models thus make a nice connection between conditional independence graphs
and parameters. There is a simple one-line command in R for fitting log-linear models. The
function gives the parameter estimates as well as tests that each parameter is 0.

14
When the variables are not binary, the model is a bit more complicated. Each variable is
now represented by a vector of parameters rather than one parameter. But conceptually, it
is the same. Suppose Xj ∈ {0, 1, . . . , m − 1}, for j ∈ V , with V = {1, . . . , d}; thus each of
the d variables takes one of m possible values.

Definition 8 Let X = (X1 , . . . , Xd ) be a discrete random vector with probability function


p(x) = P (X = x) = P (X1 = x1 , . . . , Xd = xd ) where x = (x1 , . . . , xd ). The log-linear repre-
sentation of p(x) is
X
log p(x) = ψA (xA ) (8)
A⊂V

with the constraints that ψΦ is a constant , and if j ∈ A and xj = 0 then ψA (xA ) = 0.

The formula in (8) is called the log-linear expansion of p(x). Each ψA (xA ) may depend
on
Pd some  unknown parameters θA . Note that the total number of parameters satisfies
d j d
j=1 j (m − 1) = m , however one of the parameters is the normalizing constant, and
is determined by the constraint that the sum of the probabilities is one. Thus, there are
md − 1 free parameters, and this is a minimal exponential parameterization of the multino-
mial. Let θ = (θA : A ⊂ V ) be the set of all these parameters. We will write p(x) = p(x; θ)
when we want to emphasize the dependence on the unknown parameters θ.

The next theorem provides an easy way to read out conditional independence in a log-linear
model.

Theorem 9 Let (XA , XB , XC ) be a partition of X = (X1 , . . . , Xd ). Then XB ⊥⊥ XC | XA if


and only if all the ψ-terms in the log-linear expansion that have at least one coordinate in B
and one coordinate in C are zero.

Proof. From the definition of conditional independence, we know that XB ⊥⊥ XC | XA if


and only if p(xA , xB , xC ) = f (xA , xB )g(xA , xC ) for some functions f and g.

Suppose that ψt is 0 whenever t has coordinates in B and C. Hence, ψt is 0 if t * A ∪ B or


t * A ∪ C. Therefore
X X X
log p(x) = ψt (xt ) + ψt (xt ) − ψt (xt ). (9)
t⊂A∪B t⊂A∪C t⊂A

Exponentiating, we see that the joint density is of the form f (xA , xB )g(xA , xC ). Therefore
XB ⊥⊥ XC | XA . The reverse follows by reversing the argument. 

A graphical log-linear model with respect to a graph G is a log-linear model for which the
parameters ψA satisfy ψA (xA ) 6= 0 if and only if A is a clique of G. Thus, a graphical

15
log-linear model has potential functions on each clique, both maximal and non-maximal,
with the restriction that ψA (xA ) = 0 in case xj = 0 for any j ∈ A. In a hierarchical log-
linear model, if ψA (xA ) = 0 then ψB (xB ) = 0 whenever A ⊂ B. Thus, the parameters in a
hierarchical model are nested, in the sense that if a parameter is identically zero for some
subset of variables, the parameter for supersets of those variables must also be zero. Every
graphical log-linear model is hierarchical, but a hierarchical model need not be graphical;
Such a relationship is shown in Figure 5 and is characterized by the next lemma.

Lemma 10 A graphical log-linear model is hierarchical but the reverse need not be true.

Proof. We assume there exists a model that is graphical but not hierarchical. There must
exist two sets A and B, such that A ⊂ B with ψA (xA ) = 0 and ψB (xB ) 6= 0. Since the model
is graphical, ψB (xB ) 6= 0 implies that B is a clique. We then know that A must also be a
clique due to A ⊂ B, which implies that ψA (xA ) 6= 0. A contradiction.

To see that a hierarchical model does not have to be graphical. We consider the following
example. Let
3
X X
log p(x) = ψΦ + ψi (xi ) + ψjk (xjk ). (10)
i=1 1≤j<k≤3

This model is hierarchical but not graphical. The graph corresponding to this model is a
complete graph with three nodes X1 , X2 , X3 . It is not graphical since ψ123 (x) = 0, which is
contradict with the fact that the graph is complete. 

Log-linear = Multinomial

Hierarchical

Graphical

Figure 5: Every graphical log-linear model is hierarchical but the reverse may not be true.

16
3.3 The Nonparametric Case

In real life, nothing has a Normal distribution. What should we do? We could just use a
correlation graph or partial correlation graph. That’s what I recommend. But if you really
want a nonparametric conditional independence graph, there are some possible approaches.

Conditional cdf Method. Let X and Y be real and let Z be a random vector. Define

U = F (X|Z), V = G(Y |Z)

where
F (x|z) = P(X ≤ x|Z = z), G(y|z) = P(Y ≤ y|Z = z).

Lemma 11 If X q Y | Z then U q V .

If we knew F and G, we could compute Ui = F (Xi |Zi ) and Vi = G(Yi |Zi ) and then test for
independence between U and V .

In practice, we estimate Ui and Vi using smoothing:


bi = Fb(Xi |Zi ), Vbi = G(Y
U b i |Zi )

where
Pn
K(||z − Zs ||/h)I(Xs ≤ x)
s=1P
Fb(x|z) = n
s=1 K(||z − Zs ||/h)
Pn
K(||z − Zs ||/b)I(Ys ≤ y)
s=1P
G(y|z)
b = n .
s=1 K(||z − Zs ||/b)

We have the following result from Bergsma (2011).

Theorem 12 (Bergsma) Let θ be the Pearson or Kendall measure of association between


U and V . Let θe be the sample version based on (Ui , Vi ), i = 1, . . . , n. Let θb be the sample
version based on (U
bi , Vbi ), i = 1, . . . , n. Suppose that

n1/2 (θe − θ) = OP (1), nβ1 (Fb(x|z) − F (x|z)) = OP (1), nβ2 (Fb(y|z) − F (y|z)) = OP (1).

Then √ √
n(θb − θ) = n(θe − θ) + OP (n−γ )
where γ = min{β1 , β2 }.

17
This means that, asymptotically, we can treat (U
bi , Vbi ) as if they were (Ui , Vi ). Of course, for
graphs, the whole procedure needs to be repeated for each pair of variables.

There are some caveats. First, we are essentially doing high dimensional regression. In high
dimensions, the convergence will be very slow. Second, we have to choose the bandwidths h
and b. It is not obvious how to do this in practice.

Challenge: Can you think of a way to do sparse estimation of F (x|z) and F (y|z)?

Nonparanormal. Another approach is to use a Gaussian copula, also known as a Nonpara-


normal (Liu, Lafferty and Wasserman 2009). Recall
P that, in high dimnsional nonparametric
regression, we replaced the linear model Y = j βj Xj +  with the sparse additive model:
X
Y = fj (Xj ) +  where most ||fj || = 0.
j

We can take a similar strategy for graphs.

Assumptions Dimension Regression Graphical Models


low linear model multivariate normal
parametric
high lasso graphical lasso
low additive model nonparanormal
nonparametric
high sparse additive model `1 -regularized nonparanormal

One idea is to do node-wise regression using SpAM (see Voorman, Shojaie and Witten 2007).
An alternative is as follows. Let f (X) = (f1 (X1 ), . . . , fp (Xp )). Assume that f (X) ∼ N (µ, Σ).
Write X ∼ NPN(µ, Σ, f ). If each fj is monotone then this is just a Gaussian copula, that is,
 
F (x1 , . . . , xp ) = Φµ,Σ Φ−1 (F1 (x1 )), . . . , Φ−1 (Fp (xp )) .

Lemma 13 Xj q Xk |rest iff Σ−1


jk = 0 where Σ = cov(f (X)), f (x) = (f1 (x1 ), . . . , fd (xd ))
and fj (xj ) = Φ−1 (Fj (xj )).

The marginal means and variances µj and σj are not identifiable but this does not affect the
graph G.

18
0.30

0.4
0.25

0.3 0.20

0.15
0.2 2 2
0.10
0.1 1 1
0.05

−2 −2
0 0
−1 −1

0 −1 0 −1

1 1

−2 −2
2 2

Three

0.4

0.3

0.2 2

0.1
1
0.0
−2
0
−1

0 −1

−2
2

examples of nonparanormals.

We can estimate G using a two stage procedure:

1. Estimate each Zj = fj (xj ) = Φ−1 (Fj (xj )).


2. Apply the glasso to the Zj ’s.

Let fbj (xj ) = Φ−1 (Fbj (xj )). The usual empirical Fbj (xj ) will not work if d increases with n.
We use a Winsorized version:

δn
 if Fbj (x) < δn
Fej (x) = Fbj (x) if δn ≤ Fbj (x) ≤ 1 − δn

(1 − δn ) if Fbj (x) > 1 − δn ,

where
1
δn ≡ √ .
4n1/4 π log n
This choice of δn provides the right bias-variance balance so that we can achieve the desired
rate of convergence in our estimate of Ω and the associated undirected graph G. Now
compute the sample covariance Sn of the Normalized variables: Zj = fbj (Xj ) = Φ−1 (Fej (Xj )).
Finally, apply the glasso to Sn . Let Sn∗ be the covariance using the true fj ’s.

19
Suppose that d ≤ nξ . For large n,
c4 n1/2 2
   
∗ c1 d c1 d
P max |Sn (j, k) − Sn (j, k)| >  ≤ + + c3 exp −
jk (n2 )2ξ (n2 )c5 ξ−1 log d log2 n
and hence s 
2
log d log n 
max |Sn (j, k) − Sn∗ (j, k)| = OP  .
jk n1/2

Suppose (unrealistically) that X (i) ∼ NPN(µ0 , Σ0 , f0 ), and let Ω0 = Σ−1 0 . If


s
log d log2 n
λn 
n1/2
q 
(s+d) log d log2 n
then kΩn − Ω0 kF = OP
b
n1/2
and
q 
s log d log2 n
kΩ
b n − Ω0 k2 = OP
n1/2
where s is the sparsity level. Under extra conditions we
get sparsistency:
 
b n (j, k)) = sign(Σ0 (j, k) for all j, k) → 1.
P sign(Σ

Now suppose (more realistically) P is not NPN. Let R(fb, Σ)


b denote risk (expected log-

likelihood). Let d ≤ e for ξ < 1 and let
p
Mn = {f : fj monotone, ||fj ||∞ ≤ C log n},
Cn = {Ω : ||Ω−1 ||1 ≤ Ln }

with Ln = o(n(1−ξ)/2 / log n). Then
b − inf R(f, Ω) = oP (1).
R(fb, Ω)
f,Ω

Not Normal Not Normal Normal


glasso path glasso path glasso path
0.4
0.3
0.1

0.3
0.2

0.2
0.0

0.1

0.1
−0.1

0.0

0.0
−0.1

−0.1
−0.2

−0.2

−0.2

0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

nonparanormal path nonparanormal path nonparanormal path


0.4

0.5

0.4
0.4
0.3

0.3
0.3

0.2
0.2

0.2

0.1
0.1

0.1

0.0
0.0

0.0

−0.1
−0.1
−0.1

−0.2

0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

n = 500, p = 40

20
CDF Transform Power Transform No Transform
1.0

1.0

1.0
0.9
0.9

Nonparanormal

0.9
glasso Nonparanormal
Nonparanormal glasso
glasso

0.8
0.8
1−FP

1−FP

1−FP
0.8

0.7
0.7

0.7

0.6
0.6

0.5
0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

1−FN 1−FN 1−FN

Figure 6: ROC curves for sample sizes n = 200.

The Skeptic is a more robst version (Spearman/Kendall Estimators Pre-empt Transformations


π

to Infer Correlation). Set Sjk = sin 2 τbjk where
b

2 X  
τbjk = sign (Xs (j) − Xt (j))(Xs (k) − Xt (k)) .
n(n − 1) 1≤s<t≤n

Then (with d ≥ n) r !
log d 1
P max |Sbjk − Σjk | > 2.45π ≤ .
jk n d
As in Yuan (2010), let
( )
1
M= Ω : Ω  0, ||Ω||1 ≤ κ, ≤ λmin (Ω) ≤ λmax (Ω) < c, deg(Ω) ≤ M .
c

Then, for all 1 ≤ q ≤ ∞,


r !
log d
sup ||Ω
b − Ω||q = OP M .
Ω∈M n

From Yuan, this implies that the Skepic is minimax rate optimal. Now plug Sb into glasso.
See Liu, Han, Yuan, Lafferty and Wasserman arXiv:1202.2169 for numerical experiments
and theoretical results.

21
Forests. Yet another approach is based on forests. A tractable family of graphs are forests.
A graph F is a forest if it contains no cycles. If F is a d-node undirected forest with vertex
set VF = {1, . . . , d} and edge set EF ⊂ {1, . . . , d} × {1, . . . , d}, the number of edges satisfies
|EF | < d. Suppose that P is Markov to F and has density p. Then p can be written as
Y p(xi , xj ) Y
p(x) = p(xk ), (11)
p(xi ) p(xj ) k∈V
(i,j)∈EF F

where each p(xi , xj ) is a bivariate density, and each p(xk ) is a univariate density. Using (11),
we have

E log p(X) (12)


Z  X 
p(xi , xj ) X
= − p(x) log + log (p(xk )) dx
p(xi )p(xj ) k∈V
(i,j)∈EF F

X Z p(xi , xj ) XZ
= − p(xi , xj ) log dxi dxj − p(xk ) log p(xk )dxk
p(xi )p(xj ) k∈VF
(i,j)∈EF
X X
= − I(Xi ; Xj ) + H(Xk ), (13)
(i,j)∈EF k∈VF

where Z
p(xi , xj )
I(Xi ; Xj ) ≡ p(xi , xj ) log dxi dxj (14)
p(xi ) p(xj )
is the mutual information between the pair of variables Xi , Xj and
Z
H(Xk ) ≡ −ds p(xk ) log p(xk ) dxk (15)

is the entropy.

The optimal forest F ∗ can P be found by minimizing the right hand side of (13). Since
the entropy term H(X) = k H(Xk ) is constant across all forests, this can be recast as
the problem of finding the maximum weight spanning forest for a weighted graph, where
the weight w(i, j) of the edge connecting nodes i and j is I(Xi ; Xj ). Kruskal’s algorithm
(Kruskal 1956) is a greedy algorithm that is guaranteed to find a maximum weight spanning
tree of a weighted graph. In the setting of density estimation, this procedure was proposed
by Chow and Liu (1968) as a way of constructing a tree approximation to a distribution.
At each stage the algorithm adds an edge connecting that pair of variables with maximum
mutual information among all pairs not yet visited by the algorithm, if doing so does not
form a cycle. When stopped early, after k < d edges have been added, it yields the best
k-edge weighted forest.

Of course, the above procedure is not practical since the true density p(x) is unknown. In
applications, we parameterize bivariate and univariate distributions to be pθij (xi , xj ) and

22
Chow-Liu Algorithm for Learning Forest Graphs

Initialize E (0) = ∅ and the desired forest size Kh< d. i


Calculate the mutual information matrix M c = Ibn (Xi , Xj ) according to (16).
For k = 1, . . . , K

(a) (i(k) , j (k) ) ← argmax(i,j) M


c(i, j) such that E (k−1) ∪ {(i(k) , j (k) )} does not contain
a cycle.

(b) E (k) ← E (k−1) ∪ {(i(k) , j (k) )}.

Output the obtained edge set E (K) .

pθk (xk ). We replace the population mutual information I(Xi ; Xj ) in (13) by the plug-in
estimate Ibn (Xi , Xj ), defined as
Z pθbij (xi , xj )
Ibn (Xi , Xj ) = pθbij (xi , xj ) log dxi dxj (16)
pθbi (xi ) pθbj (xj )

where θbij and θb


hk are maximum
i likelihood estimates. Given this estimated mutual information
matrix M c = Ibn (Xi , Xj ) , we can apply Kruskal’s algorithm (equivalently, the Chow-Liu
algorithm) to find the best forest structure Fb. The detailed algorithm is described in the
following:

Example 14 (Learning Gaussian maximum weight spanning tree) For Gaussian data
X ∼ N (µ, Σ), we know that the mutual information between two variables are
1
I(Xi ; Xj ) = − log 1 − ρ2ij ,

(17)
2
where ρij is the correlation between Xi and Xj . To obtain an empirical estimator, we simply
plug-in the sample correlation ρbij . Once the mutual information matrix is calculated, we
could apply the Chow-Liu algorithm to get the maximum weight spanning tree.

Example 15 (Graphs for Equities Data) We collect the daily closing prices were ob-
tained for 452 stocks that were consistently in the S&P 500 index between January 1, 2003
through January 1, 2011. This gave us altogether 2,015 data points, each data point corre-
sponds to the vector of closing prices on a trading day. With St,j denoting the closing price
of stock j on day t, we consider the variables Xtj = log (St,j /St−1,j ) and build graphs over
the indices j. We simply treat the instances Xt as independent replicates, even though they
form a time series. We truncate every stock so that its data points are within six times the

23
mean absolute deviation from the sample average. In Figure 7(a) we show boxplots for 10
randomly chosen stocks. It can be seen that the data contains outliers even after truncation;
the reasons for these outliers includes splits in a stock, which increases the number of shares.
In Figure 7(b) we show the boxplots of the data after the nonparanormal transformation (the
details of nonparanormal transformation will be explained in the nonparametric graphical
model chapter). In this analysis, we use the subset of the data between January 1, 2003 to
January 1, 2008, before the onset of the “financial crisis.” There are altogether n = 1, 257
data points and d = 452 dimensions.
0.4

2


● ●


0.2



● ●
● ●

● ● ●
● ●

● ● ●
● ●
● ●

● ●
● ● ●
● ● ● ● ●


● ●
● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ● ●
● ● ●

● ●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ●

1

● ●
● ● ● ●
● ● ●
● ● ●


● ●
● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ●


● ● ●

● ● ●
● ●


● ●
● ● ● ● ●

● ● ●
● ●

0.0


● ● ● ●


● ●
● ●
● ●
● ●


● ● ●
● ● ● ●

● ●

● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ●



● ●
● ●

● ● ●
● ● ● ● ●


● ● ● ●
● ● ● ●

● ● ●

● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●

● ●
● ●
● ●
● ● ●
● ● ●

● ●
● ● ● ●

● ●
● ● ●
● ● ● ●
● ● ● ● ●


● ● ● ●

● ● ●

● ●
X

X
● ●

0

−0.2

● ●

● ●
● ●




● ● ●




−0.4

−1
−0.6

−2


● ●

AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO

Stocks Stocks

(a) original data (b) after nonparanormal transformation

Figure 7: Boxplots of Xt = log(St /St−1 ) for 10 stocks. As can be seen, the original data
has many outliers, which is addressed by the nonparanormal transformation on the re-scaled
data (right).

The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sec-
tors, including Consumer Discretionary (70 stocks), Energy (37 stocks), Financials
(74 stocks), Consumer Staples (35 stocks), Telecommunications Services (6 stocks),
Health Care (46 stocks), Industrials (59 stocks), Information Technology (64 stocks),
Materials (29 stocks), and Utilities (32 stocks). It is expected that stocks from the same
GICS sectors should tend to be clustered together, since stocks from the same GICS sector
tend to interact more with each other. In the graphs shown below, the nodes are colored
according to the GICS sector of the corresponding stock.

With the Gaussian assumption, we directly apply Chow-Liu algorithm to obtain a full span-
ning tree of d − 1 = 451 edges. The resulting graph is shown in Figure 8. We see that the
stocks from the same GICS sector are clustered very well.

To get a nonparametric version, we can just an nonparametric estimate of the mutual in-
formation. But for that matter, we might as well put in any measure of association such as

24
● ● ●

● ● ● ●
● ●

● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ●
●● ●● ●
● ● ● ●

● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●

● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●


● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●

● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●

● ●

Figure 8: Tree graph learned from S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008.
The graph is estimated using the Chow-Liu algorithm under the Gaussian model. The nodes
are colored according to their GICS sector categories.

25
distance correlation. In that case, we see that a forest is just a correlation graph without
cycles.

4 A Deeper Look At Conditional Independence Graphs

In this section, we will take a closer look at conditional independence graphs.

Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A,
B, and C be subsets of vertices. We say that C separates A and B if every path from
a node in A to a node in B passes through a node in C. Now consider a random vector
X = (X(1), . . . , X(d)) where Xj corresponds to node j in the graph. If A ⊂ {1, . . . , d} then
we write XA = (X(j) : j ∈ A).

4.1 Markov Properties

A probability distribution P for a random vector X = (X(1), . . . , X(d)) may satisfy a range
of different Markov properties with respect to a graph G = (V, E):

Definition 16 (Global Markov Property) A probability distribution P for a random vector


X = (X(1), . . . , X(d)) satisfies the global Markov property with respect to a graph G if for
any disjoint vertex subsets A, B, and C such that C separates A and B, the random variables
XA are conditionally independent of XB given XC .

The set of distributions that is globally Markov with respect to G is denoted by P(G).

Definition 17 (Local Markov Property) A probability distribution P for a random vector


X = (X(1), . . . , X(d)) satisfies the local Markov property with respect to a graph G if the
conditional distribution of a variable given its neighbors is independent of the remaining
nodes. That is, let N (s) = {t ∈ V | (s, t) ∈ E} denote the set of neighbors of a node s ∈ V .
Then the local Markov property is that
p(xs | xt , t 6= s) = p (xs | xt , t ∈ N (s)) (18)
for each node s.

Definition 18 (Pairwise Markov Property) A probability distribution P for a random vector


X = (X(1), . . . , X(d)) satisfies the pairwise Markov property with respect to a graph G if
for any pair of non-adjacent nodes s, t ∈ V , we have
Xs ⊥⊥ Xt | XV \{s,t} . (19)

26
Consider for example the graph in Figure 9. Here the set C separates A and B. Thus, a
distribution that satisfies the global Markov property for this graph must have the property
that the random variables in A are conditionally independent of the random variables in
B given the random variables C. This is seen to generalize the usual Markov property for
simple chains, where XA −→ XC −→ XB forms a Markov chain in case XA and XB are
independent given XC . A distribution that satisfies the global Markov property is said to
be a Markov random field or Markov network with respect to the graph. The local Markov
property is depicted in Figure 10.

6 7 8

1 2 3 4 5

Figure 9: An undirected graph. C = {3, 7} separates A = {1, 2} and B = {4, 8}.

X2

X3 X1 X4

X5

Figure 10: The local Markov property: Conditioned on its four neighbors X2 , X3 , X4 , and
X5 , node X1 is independent of the remaining nodes in the graph.

From the definitions, the relationships of different Markov properties can be characterized
as:

27
Proposition 19 For any undirected graph G and any distribution P , we have
global Markov property =⇒ local Markov property =⇒ pairwise Markov property.

Proof. The global Markov property implies the local Markov property because for each
node s ∈ V , its neighborhood N (s) separates {s} and V \ {N (s) ∪ {s}}. Assume next
that the local Markov property holds. Any t that is not adjacent to s is an element of
t ∈ V \ {N (s) ∪ {s}}. Therefore
N (s) ∪ [(V \ {N (s) ∪ {s}}) \ {t}] = V \ {s, t}, (20)
and it follows from the local Markov property that
Xs ⊥⊥ XV \{N (s)∪{s}} | XV \{s,t} . (21)
This implies Xs ⊥⊥ Xt | XV \{s,t} , which is the pairwise Markov property. 

The next theorem, due to Pearl (1986), provides a sufficient condition for equivalence.

Theorem 20 Suppose that, for all disjoint subsets A, B, C, D ⊂ V ,


if XA ⊥⊥ XB | XC∪D and XA ⊥⊥ XC | XB∪D , then XA ⊥⊥ XB∪C | XD , (22)
then the global, local, and pairwise Markov properties are equivalent.

Proof. It is enough to show that the pairwise Markov property implies the global Markov
property under the given condition. Let S, A, B ⊂ V with S separating A from B in the
graph G. Without loss of generality both A and B are assumed to be non-empty. The
proof can be carried out using backward induction on the number of nodes in S, denoted by
m = |S|. Let d = |V |, for the base case, if m = d − 1 then both A and B only consist of
single vertex and the result follows from pairwise Markov property.

Now assume that m < d − 1 and separation implies conditional independence for all sepa-
rating sets S with more than m nodes. We proceed in two cases: (i) A ∪ B ∪ S = V and (ii)
A∪B∪S ⊂V.

For case (i), we know that at least one of A and B must have more than one element.
Without loss of generality, we assume A has more than one element. If s ∈ A, then S ∪ {s}
separates A \ {s} from B and also S ∪ (A \ {s}) separates s from B. Thus by the induction
hypothesis
XA\{s} ⊥⊥ XB | XS∪{s} and Xs ⊥⊥ XB | S ∪ (A \ {s}). (23)
Now the condition (22) implies XA ⊥⊥ XB | XS . For case (ii), we could choose s ∈ V \ (A ∪
B ∪ S). Then S ∪ {s} separates A and B, implying A ⊥⊥ B | S ∪ {s}. We then proceed in
two cases, either A ∪ S separates B from s or B ∪ S separates A from s. For both cases, the
condition (22) implies that A ⊥⊥ B | S. 

The next proposition provides a stronger condition that implies (22).

28
Proposition 21 Let X = (X1 , . . . , Xd ) be a random vector with distribution P and joint
density p(x). If the joint density p(x) is positive and continuous with respect to a product
measure, then condition (22) holds.

Proof. Without loss of generality, it suffices to assume that d = 3. We want to show that
if X1 ⊥⊥ X2 | X3 and X1 ⊥⊥ X3 | X2 then X1 ⊥⊥ {X2 , X3 }. (24)
Since the density is positive and X1 ⊥⊥ X2 | X3 and X1 ⊥⊥ X3 | X2 , we know that there must
exist some positive functions f13 , f23 , g12 , g23 such that the joint density takes the following
factorization:
p(x1 , x2 , x3 ) = f13 (x1 , x3 )f23 (x2 , x3 ) = g12 (x1 , x2 )g23 (x2 , x3 ). (25)
Since the density is continuous and positive, we have
f13 (x1 , x3 )f23 (x2 , x3 )
g12 (x1 , x2 ) = . (26)
g23 (x2 , x3 )
For each fixed X3 = x03 , we see that g12 (x1 , x2 ) = h(x1 )`(x2 ) where h(x1 ) = f13 (x1 , x03 ) and
`(x2 ) = f23 (x2 , x03 )/g23 (x2 , x03 ). This implies that
p(x1 , x2 , x3 ) = h(x1 )`(x2 )g23 (x2 , x3 ) (27)
and hence X1 ⊥⊥ {X2 , X3 } as desired. 

From Proposition 21, we see that for distributions with positive continuous densities, the
global, local, and pairwise Markov properties are all equivalent. If a distribution P satisfies
global Markov property with respect to a graph G, we say that P is Markov to G

4.2 Clique Decomposition

Unlike a directed graph which encodes a factorization of the joint probability distribution in
terms of conditional probability distributions. An undirected graph encodes a factorization
of the joint probability distribution in terms of clique potentials. Recall that a clique in
a graph is a fully connected subset of vertices. Thus, every pair of nodes in a clique is
connected by an edge. A clique is a maximal clique if it is not contained in any larger clique.
Consider, for example, the graph shown in the right plot of Figure 11. The pairs {X4 , X5 }
and {X1 , X3 } form cliques; {X4 , X5 } is a maximal clique, while {X1 , X3 } is not maximal
since it is contained in a larger clique {X1 , X2 , X3 }.

A set of clique potentials {ψC (xC ) ≥ 0}C∈C determines a probability distribution that factors
with respect to the graph by normalizing:
1 Y
p(x1 , . . . , x|V | ) = ψC (xC ). (28)
Z C∈C

29
X1 X4 X1 X4

X3 X3

X2 X5 X2 X5

Figure 11: A directed graph encodes a factorization of the joint probability distribution in
terms of conditional probability distributions. An undirected graph encodes a factorization
of the joint probability distribution in terms of clique potentials.

The normalizing constant or partition function Z sums (or integrates) over all settings of the
random variables: Z Y
Z= ψC (xC )dx1 . . . dx|V | . (29)
x1 ,...,x|V | C∈C

Thus, the family of distributions represented by the undirected graph in Figure 11 can be
written as
p(x1 , x2 , x3 , x4 , x5 ) = ψ1,2,3 (x1 , x2 , x3 ) ψ1,4 (x1 , x4 ) ψ4,5 (x4 , x5 ). (30)
In contrast, the family of distributions represented by the directed graph in Figure 11 can
be factored into conditional distributions according to

p(x1 , x2 , x3 , x4 , x5 ) = p(x5 ) p(x4 | x5 ) p(x1 | x4 ) p(x3 | x1 ) p(x2 | x1 , x3 ). (31)

Theorem 22 For any undirected graph G = (V, E), a distribution P that factors with respect
to the graph also satisfies the global Markov property on the graph.

Proof. Let A, B, S ⊂ V such that S separates A and B. We want to show XA ⊥⊥ XB | XS .


For a subset D ⊂ V , we denote GD to be the subgraph induced by the vertex set D. We
define Ae to be the connectivity components in GV \S which contain A and B e = V \ (A e ∪ S).
Since A and B are separated by S, they must belong to different connectivity components
of GV \S and any clique of G must be a subset of either A e ∪ S or B e ∪ S. Let CA be the set
of cliques contained in Ae ∪ S, the joint density p(x) takes the following factorization
Y Y Y
p(x) = ψC (xC ) = ψC (xC ) ψC (xC ). (32)
C∈C C∈CA C∈C\CA

e ⊥⊥ B
This implies that A e | S and thus A ⊥⊥ B | S. 

30
It is worth remembering that while we think of the set of maximal cliques as given in a
list, the problem of enumerating the set of maximal cliques in a graph is NP-hard, and the
problem of determining the largest maximal clique is NP-complete. However, many graphs
of interest in statistical analysis are sparse, with the number of cliques of size O(|V |).

Theorem 22 shows that factoring with respect to a graph implies global Markov property.
The next question is, under what conditions the Markov properties imply factoring with
respect to a graph. In fact, in the case where P has a positive and continuous density we can
show that the pairwise Markov property implies factoring with respect to a graph. Thus all
Markov properties are equivalent. The results have been discovered by many authors but is
usually referred to as Hammersley and Clifford due to one of their unpublished manuscript
in 1971. They proved the result in the discrete case. The following result is usually referred
to as the Hammersley-Clifford theorem; a proof appears in Besag (1974). The extension to
the continuous case is left as an exercise.

Theorem 23 (Hammersley-Clifford-Besag) Suppose that G = (V, E) is a graph and


Xi , i ∈ V are random variables that take on a finite number of values. If p(x) > 0 is strictly
positive and satisfies the local Markov property with respect to G, then it factors with respect
to G.

Proof. Let d = |V |. By re-indexing the values of Xi , we may assume without loss of


generality that each Xi takes on the value 0 with positive probability, and P(0, 0, . . . , 0) > 0.
Let X0\i denote the vector X0\i = (X1 , X2 , . . . , Xi−1 , 0, Xi+1 , . . . , Xd ) obtained by setting
Xi = 0, and let X\i = (X1 , X2 , . . . , Xi−1 , Xi+1 , . . . , Xd ) denote the vector of all components
except Xi . Then

P(x) P(xi | x\i )


= . (33)
P(xi\0 ) P(0 | x\i )

Now, let
 
P(x)
Q(x) = log . (34)
P(0)

Then for any i ∈ {1, 2, . . . , d} we have that


 
P(x)
Q(x) = log (35)
P(0)
   
P(0, . . . , xi , 0, . . . , 0) P(x)
= log + log (36)
P(0) P(0, . . . , xi , 0, . . . , 0)
d     
1X P(0, . . . , xi , 0, . . . , 0) P(x)
= log + log . (37)
d i=1 P(0) P(0, . . . , xi , 0, . . . , 0)

31
Recursively, we obtain
X X X
Q(x) = φi (xi ) + φij (xi , xj ) + φijk (xi , xj , xk ) + · · · + φ12···d (x)
i i<j i<j<k

for functions φA that satisfy φA (xA ) = 0 if i ∈ A and xi = 0. Consider node i = 1, we have

P(xi | x\i )
 
Q(x) − Q(x0\i ) = log (38)
P(0 | x\i )
X X
= φ1 (x1 ) + φ1i (x1 , xi ) + φ1ij (x1 , xi , xj ) + · · · + φ12···d (x)
i>1 j>i>1

depends only on x1 and the neighbors of node 1 in the graph. Thus, from the local Markov
property, if k is not a neighbor of node 1, then the above expression does not depend of xk .
In particular, φ1k (x1 , xk ) = 0, and more generally all φA (xA ) with 1 ∈ A and k ∈ A are
identically zero. Similarly, if i, j are not neighbors in the graph, then φA (xA ) = 0 for any
A containing i and j. Thus, φA 6= 0 only holds for the subsets A that form cliques in the
graph. Since it is obvious that exp(φA (x)) > 0, we finish the proof. 

Since factoring with respect to the graph implies the global Markov property, we may sum-
marize this result as follows:

For positive distributions: global Markov ⇔ local Markov ⇔ factored

For strictly positive distributions, the global Markov property, the local Markov
property, and factoring with respect to the graph are equivalent.

Thus we can write:


!
1 Y 1 X
p(x) = ψC (xC ) = exp log ψC (xC )
Z C∈C Z C∈C

where C is the set of all (maximal) cliques in G and Z is the normalization constant. This
is called the Gibbs representation.

4.3 Directed vs. Undirected Graphs

Directed graphical models are naturally viewed as generative; the graph specifies a straight-
forward (in principle) procedure for sampling from the underlying distribution. For instance,

32
a sample from a distribution represented from the DAG in left plot of Figure 12 can be sam-
pled as follows:

X1 ∼ P (X1 ) (39)
X2 ∼ P (X2 ) (40)
X3 ∼ P (X3 ) (41)
X5 ∼ P (X5 ) (42)
X4 | X1 , X2 ∼ P (X4 | X1 , X2 ) (43)
X6 | X3 , X4 , X5 ∼ P (X6 | X3 , X4 , X5 ). (44)

As long as each of the conditional probability distributions can be efficiently sampled, the
full model can be efficiently sampled as well. In contrast, there is no straightforward way to
sample from an distribution from the family specified by an undirected graph. Instead one
need something like MCMC.

X1 X2 X1 X2

X3 X4 X3 X4

X5 X5

X6 X6

Figure 12: A DAG and its corresponding moral graph. A probability distribution that factors
according to a DAG obeys the global Markov property on the undirected moral graph.

Generally, edges must be added to the skeleton of a DAG in order for the distribution to
satisfy the global Markov property on the graph. Consider the example in Figure 12. Here
the directed model has a distribution

p(x1 ) p(x2 ) p(x3 ) p(x5 ) p(x4 | x1 , x2 ) p(x6 | x3 , x4 , x5 ). (45)

The corresponding undirected graphical model has two maximal cliques, and factors as

ψ1,2,4 (x1 , x2 , x4 ) ψ3,4,5,6 (x3 , x4 , x5 , x6 ). (46)

33
More generally, let P be a probability distribution that is Markov to a DAG G. We define
the moralized graph of G as the following:

Definition 24 (Moral graph) The moral graph M of a DAG G is an undirected graph that
contains an undirected edge between two nodes Xi and Xj if (i) there is a directed edge
between Xi and Xj in G, or (ii) Xi and Xj are both parents of the same node.

Theorem 25 If a probability distribution factors with respected to a DAG G, then it obeys


the global Markov property with respect to the undirected moral graph of G.

Proof. Directly follows from the definition of Bayesian networks and Theorem 22. 

XA XB XC XA XB XC

XA XB XC

Figure 13: These three graphs encode distributions with identical independence relations.
Conditioned on variable XC , the variables XA and XB are independent; thus C separates A
and B in the undirected graph.

Example 26 (Basic Directed and Undirected Graphs) To illustrate some basic cases, con-
sider the graphs in Figure 13. Each of the top three graphs encodes the same family of
probability distributions. In the two directed graphs, by d-separation the variables XA and
XB are independent conditioned on the variable XC . In the corresponding undirected graph,
which simply removes the arrows, node C separates A and B.

The two graphs in Figure 14 provide an example of a directed graph which encodes a set
of conditional independence relationships that can not be perfectly represented by the cor-
responding moral graph. In this case, for the directed graph the node C is a collider, and
deriving an equivalent undirected graph requires joining the parents by an edge. In the corre-
sponding undirected graph, A and B are not separated by C. However, in the directed graph,
XA and XB are marginally independent, such an independence relationship is lost in the
moral graph. Conversely, Figure 15 provides an undirected graph over four variables. There
is no directed graph over four variables that implies the same set of conditional independence
properties.

34
XA XC XA XC

XB XB

Figure 14: A directed graph whose conditional independence properties can not be perfectly
expressed by its undirected moral graph. In the directed graph, the node C is a collider;
therefore, XA and XB are not independent conditioned on XC . In the corresponding moral
graph, A and B are not separated by C. However, in the directed graph, we have the
independence relationship XA ⊥⊥ XB , which is missing in the moral graph.

X1 X2

X3 X4

Figure 15: This undirected graph encodes a family of distributions that cannot be represented
by a directed graph on the same set of nodes.

35
X1 X2 X3 Xd

Y1 Y2 Y3 Yd

X1 X2 X3 Xd

Y1 Y2 Y3 Yd

Figure 16: The top graph is a directed graph representing a hidden Markov model. The
shaded nodes are observed, but the unshaded nodes, representing states in a latent Markov
chain, are unobserved. Replacing the directed edges by undirected edges (bottom) does not
changed the independence relations.

36
The upper plot in Figure 16 shows the directed graph underlying a hidden Markov model.
There are no colliders in this graph, and therefore the undirected skeleton represents an
equivalent set of independence relations. Thus, hidden Markov models are equivalent to
hidden Markov fields with an underlying tree graph.

4.4 Faithfulness

The set of all distributions that are Markov to a graph G is denoted by P(G) To understand
P(G) more clearly, we introduce some more notation. Given a distribution P let I(P ) denote
all conditional independence statements that are true for P . For example, if P has density
p and p(x1 , x2 , x3 ) = p(x1 )p(x2 )p(x3 |x1 , x2 ) then I(P ) = {X1 ⊥⊥ X2 }. On the other hand, if
p(x1 , x2 , x3 ) = p(x1 )p(x2 , x3 ) then
n o
I(P ) = X1 ⊥⊥ X2 , X1 ⊥⊥ X3 , X1 ⊥⊥ X2 |X3 , X1 ⊥⊥ X3 |X2 .

Similarly, given a graph G let I(G) denote all independence statements implied by the graph.
For example, if G is the graph in Figure 15, then
n o
I(G) = X1 ⊥⊥ X4 | {X2 , X3 } , X2 ⊥⊥ X3 | {X1 , X4 } .

From definition, we could write P(G) as


n o
P(G) = P : I(G) ⊆ I(P ) . (47)

This result often leads to confusion since you might have expected that P(G) would be equal
to P : I(G) = I(P ) . But this is incorrect. For example, consider the undirected graph
X1 — X2 , in this case, I(G) = ∅ and P(G) consists of all distributions p(x1 , x2 ). Since,
P(G) consists of all distributions, it also includes distributions of the form p0 (x1 , x2 ) =
p0 (x1 )p0 (x2 ). For such a distribution we have I(P0 ) = {X1 ⊥⊥ X2 }. Hence, I(G) is a strict
subset of I(P0 ).

In fact, you can think of I(G) as the set of independence statements that are common to all
P ∈ P(G). In other words,
\n o
I(G) = I(P ) : P ∈ P(G) . (48)

Every P ∈ P(G) has the independence properties in I(G). But some P ’s in P(G) might
have extra independence properties.

We say that P is faithful to G if I(P ) = I(G). We define


n o
F(G) = P : I(G) = I(P ) (49)

37
and we note that F(G) ⊂ P(G). A distribution P that is in P(G) but is not in F(G) is said
to be unfaithful with respect to G. The independence relation expressed by G are correct
for such a P . It’s just that P has extra independence relations not expressed by G. A
distribution P is also Markov to some graph. For example, any distribution is Markov to
the complete graph. But there exist distributions P that are not faithful to any graph. This
means that there will be some independence relations of P that cannot be expressed using
a graph.

Example 27 The directed graph in Figure 14 implies that XA ⊥⊥ XB but that XA and XB
are not independent given XC . There is no undirected graph G for (XA , XB , XC ) such that
I(G) contains XA ⊥⊥ XB but excludes XA ⊥⊥ XB | XC . The only way to represent P is to
use the complete graph. Then P is Markov to G since I(G) = ∅ ⊂ I(P ) = {XA ⊥⊥ XB }
but P is unfaithful to G since it has an independence relation not represented by G, namely,
{XA ⊥⊥ XB }.

Example 28 (Unfaithful Gaussian distribution) . Let ξ1 , ξ2 , ξ3 ∼ N (0, 1) be indepen-


dent.
X1 = ξ1 (50)
X2 = aX1 + ξ2 (51)
X3 = bX2 + cX1 + ξ3 (52)
2
where a, b, c are nonzero. See Figure 17. Now suppose that c = − b(a a+1) . Then
Cov (X2 , X3 ) = E(X2 X3 ) − EX2 EX3 (53)
= E(X2 X3 ) (54)
= E [(aX1 + ξ2 )(bX2 + cX1 + ξ3 )] (55)
= E [(aξ1 + ξ2 )(b(aξ1 + ξ2 ) + cX1 + ξ3 )] (56)
= (a2 b + ac)Eξ12 + bEξ22 . (57)
= a2 b + ac + b = 0. (58)
Thus, we know that X2 ⊥⊥ X3 . We would like to drop the edge between X2 and X3 . But this
would imply that X2 ⊥⊥ X3 | X1 which is not true.

Generally, the set of unfaithful distributions P(G) \ F(G) is a small set. In fact, it has
Lebesgue measure zero if we restrict ourselves to nice parametric families. However, these
unfaithful distributions are scattered throughout P(G) in a complex way; see Figure 18.

References

Gretton, Song, Fukumizu, Scholkopf, Smola (2008). A Kernel Statistical Test of Indepen-
dence. NIPS.

38
a
X1 X2

c b

X3

Figure 17: An unfaithful Gaussian distribution.

Figure 18: The blob represents the set P(G) of distributions that are Markov to some graph
G. The lines are a stylized representation of the members of P(G) that are not faithful to G.
Hence the lines represent the set P(G) \ F(G). These distributions have extra independence
relations not captured by the graph G. The set P(G) \ F(G) is small but these distributions
are scattered throughout P(G).

39
Lyons, Russell. (2013). Distance covariance in metric spaces. The Annals of Probability, 41,
3284-3305.

Szekely, Gabor J., Maria L. Rizzo, and Nail K. Bakirov. (2007). Measuring and testing
dependence by correlation of distances. The Annals of Statistics, 35, 2769-2794.

Exercises
1. Prove Equation (48).
2. Prove Theorem 22.
3. Prove Theorem 25.
4. State and prove the continuous version of Hammersley-Clifford theorem.
5. Prove that the undirected graph (the square) of Figure 15 represents a family of prob-
ability distributions that cannot be represented by a directed graph on the same set of
vertices.
6. Consider random variables (X1 , X2 , X3 , X4 ). Suppose the log-density is

log p(x) = ψΦ + ψ12 (x1 , x2 ) + ψ13 (x1 , x3 ) + ψ24 (x2 , x4 ) + ψ34 (x3 , x4 ). (59)

(a) Draw the graph G for these variables.


(b) Write down all independence and conditional independence relationships implied
by the graph.
(c) Is this model graphical? Is it hierarchical?
7. Suppose that the parameters p(x1 , x2 , x3 ) are proportional to the following values:

P(0, 0, 0) = 2, P(0, 0, 1) = 8, P(0, 1, 0) = 4, P(0, 1, 1) = 16, (60)


P(1, 0, 0) = 16, P(1, 0, 1) = 128, P(1, 1, 0) = 32, P(1, 1, 1) = 256. (61)

Find the ψ-terms for the log-linear expansion. Comment on the model.
8. Let X1 , . . . , X4 be binary. Draw the independence graphs corresponding to the follow-
ing log-linear models (where α > 0). Also, identify whether each is graphical and/or
hierarchical (or neither).
(a) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4
(b) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4 + 12x2 x3 + 78x2 x4 + 3x3 x4 + 32x2 x3 x4
(c) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4 + 12x2 x3 + 3x3 x4 + x1 x4 + 2x1 x2
(d) log p(x) = α + 5055x1 x2 x3 x4 .
9. This problem is based on the following graph:

40
X1

X2 X5

X3 X4

(a) Can you construct a DAG that has the same conditional independencies of the
form p(xi | xj , j 6= i) = p(xi | xi1 , xi2 ) as those implied by the above graph?
(b) For each of the following families of distributions, list any independence relations,
if any, that are implied in addition to the independence relations for a general
distribution that is Markov to the above graph.
1
(i) p(x) = ψ (x , x2 ) ψ23 (x2 , x3 ) ψ34 (x3 , x4 ) ψ45 (x4 , x5 ) ψ51 (x5 , x1 )
Z 12 1
1
(ii) p(x) = ψ (x , x2 ) ψ23 (x2 , x3 ) ψ34 (x3 , x4 ) ψ45 (x4 , x5 )
Z 12 1
1
(iii) p(x) = ψ (x , x2 ) ψ34 (x3 , x4 ) ψ5 (x5 )
Z 12 1
(c) Now suppose that each Xi is binary, so that Xi ∈ {0, 1}, and consider the following
family of distributions:
p(x) ∝ ψθ (x1 , x2 ) ψθ (x2 , x3 ) ψθ (x3 , x4 ) ψθ (x4 , x5 ) ψθ (x5 , x1 )
where ψθ (x, y) = θ xy with θ > 0. For this family of models, calculate the condi-
tional probability P (X1 = 1 | X3 = 1, X4 = 1) as a function of θ.
10. Let X1 , . . . , X6 be random variables with joint distribution of the form
p(x) ∝ f (x1 , x2 , x3 ) g(x3 , x4 ) g(x1 , x5 ) g(x4 , x5 ) f (x1 , x5 , x6 )
where f : R3 −→ R+ and g : R2 −→ R+ are arbitrary non-negative functions.
(a) What is the graph with the fewest edges that represents the independence relations
for this family of distributions?
(b) For each of the following sets of variables C ⊂ {X1 , . . . , X6 }, give non-empty sets
of variables A and B such that A ⊥⊥ B | C.
(i) C = {X1 , X4 }
(ii) C = {X1 , X3 }
(ii) C = {X1 , X3 , X4 }
(c) Now suppose that Xi ∈ {0, 1} are binary and that
f (x, y, z) = αxyz g(x, y) = αxy
Find the conditional probability P (X4 = 1 | X1 = 1, X3 = 1).

41
11. Let X = (X1 , X2 , X3 , X4 , X5 ) be a random vector distributed as X ∼ N (0, Σ) where
the covariance matrix Σ is given by
   
9 −3 −3 −3 −3 3 1 1 1 1
−3 6 1 1 1 1 3 0 0 0
 
1 
−3 1  with inverse Σ−1 = 1

Σ= 6 1 1 0 3 0 0.
15 
−3 1
 
1 6 1 1 0 0 3 0
−3 1 1 1 6 1 0 0 0 3

(a) What is the graph for X, viewed as an undirected graphical model?


(b) Which of the following independence statements are true?
i. X2 ⊥⊥ X3 | X1
ii. X3 ⊥⊥ X4
iii. X1 ⊥⊥ X3 | X2
iv. X1 ⊥⊥ X5
(c) List the local Markov properties for this graphical model.
(d) Find the conditional density p(x2 | X1 = −3).
12. Consider a chain graph X1 —X2 —X3 —X4 —X5 , we assume all variables are binary.
One log-linear model that is consistent with this graph is

log p(x) = β0 + 5 x1 x2 + x2 x3 + x3 x4 + x4 x5 .

Simulate n = 100 random vectors from this distribution. Fit the model
X X
log p(x) = β0 + β j xj + βk` xk x`
j k<`

using maximum likelihood. Report your estimators. Use forward model selection with
BIC to choose a submodel. Compare the selected model to the true model.
13. Let X ∼ N (0, Σ) where X = (X(1), . . . , X(d))T , d = 10. Let Ω = Σ−1 and suppose
that Ω(i, i) = 1, Ω(i, i − 1) = .5, Ω(i − 1, i) = .5 and Ω(i, j) = 0 otherwise. Simulate 50
random vectors and use the glasso method to estimate the covariance matrix. Compare
your estimated graph to the true graph.
14. Let X = (X1 , X2 , X3 , X4 ) be a random vector satisfying

f (X) ∼ N (0, Σ)

where f (x) = (x31 , x32 , x33 , x34 ) and the covariance matrix Σ is given by
   
−4 −2 −2 −2 3 1 1 1
1 −2 4 1 1 1 2 0 0
, with inverse Σ−1 = 
  
Σ=  .
6  −2 1 4 1   1 0 2 0
−2 1 1 4 1 0 0 2

42
(a) Give an expression for the density of X.
(b) What is the graph for X, viewed as a graphical model?
(c) List the local Markov properties for this graphical model.
15. Let X = (X(1), . . . , X(d)) where each Xj ∈ {0, 1}. Consider the log-linear model
d
X d
X d
X
log p(x) = β0 + βj xj + βjk xj xk + · · · + βjk` xj xk x` + · · · +
j=1 j<k j<k<`

Suppose that βA = 0 whenever {1, 2} ⊂ A. Show that X1 ⊥⊥ X2 | X3 , . . . , Xd .


16. Let X = (X1 , . . . , Xd ) ∈ Rd be a random vector, with bivariate marginal densities
pij (xi , xj ) > 0 and univariate marginal densities pi (xi ). Let G = (V, E) be a tree graph
on {1, . . . , d}, so that G does not contain any cycles. Consider the family of functions
d
Y Y
fm (x1 , . . . , xd ) = θi (xi )mi θij (xi , xj ),
i=1 (i,j)∈E

where mi ∈ Z are integers. Find a set of integers m1 , . . . , md ∈ Z for which the function
fm is a probability density; i.e., fm is nonnegative and integrates to one.
17. Given X = (Y, Z) ∈ R6 , let X ∼ N (0, Σ) be a random Gaussian  vector
 where Y =
A B
(Y1 , Y2 ) ∈ R2 and Z = (Z1 , Z2 , Z3 , Z4 ) ∈ R4 , with Σ−1 = Ω = where
BT C
1
 
0 0 2 2
1 1 1  1 2 0 0
   
2 0 1 2 3 4 2
A= B= 1 0 0 2 1  .
C= 
0 2 −1 2
− 13 1
4 2
0 0 12 2

(a) Draw the undirected graph of X.


(b) Draw the undirected graph of Z. Hint: Recall that
−1  −1
A + A−1 BS −1 B T A−1 −A−1 BS −1
 
A B
=
BT C −S −1 B T A−1 S −1

where S = C − B T A−1 B is the Schur complement.


(c) Which of the following independence statements hold?
1. Y1 ⊥⊥ Y2 | Z
2. Z1 ⊥⊥ Z4 | Z2
3. Z1 ⊥⊥ Z4 | Y1
4. Z1 ⊥⊥ Z2 .

43

You might also like