Graphical Models
Graphical Models
10/36-702
Contents
3.1 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1
Graphical models are a way of representing the relationships between features (variables).
There are two main brands: directed and undirected. We shall focus on undirected graphical
models. See Figure 1 for an example of an undirected graph.
X q Y |Z
to mean that X and Y are independent given Z. In other words, p(x, y|z) = p(x|z)p(y|z).
In a marginal correlation graph (or association graph) we put an edge between Vj and Vk if
|ρ(j, k)| ≥ where ρ(j, k) is some measure of association. Often we use = 0 in which case
there is an edge iff ρ(j, k) 6= 0. We also write ρ(Xj , Xk ) to mean the same as ρ(j, k).
In general, the reverse may not be true. We will say that ρ is strong if
We would lile ρ to have several properties: easy to compute, robust to outliers and there
is some way to calculate a confidence interval for the parameter. Here is a summary of the
association measures we will consider:
2
Figure 1: A Protein network. From: Maslov and Sneppen (2002). Specificity and Stability
in Topology of Protein Networks. Science, 296, 910-913.
Pearson Correlation. A common choice of ρ is the Pearson correlation. For two variables
X and Y te Pearson correlation is
Cov(X, Y )
ρ(X, Y ) = (1)
σX σY
. The sample estimate is
Pn
i=1 (Xi − X)(Yi − Y )
r(X, Y ) = .
sX sY
When dealing with d feaures X(1), . . . , X(d), we write ρ(j, k) ≡ ρ(X(j), X(k)). The sample
correlation is denoted by rjk .
3
test. The asymptotic test works like this. Define
1 1 + rjk
Zjk = log .
2 1 − rjk
Fisher proved that
1
Zjk ≈ N θjk ,
n−3
where
1 1 + ρjk
θjk = log .
2 1 − ρjk
√
We reject H0 if |Zjk | > zα/2 / n − 3. In fact, to control for multiple testing, we should reject
√
when |Zjk | > zα/(2m) / n − 3 where m = d2 . The confidence interval is Cn = [a, b] where
√ √
a = exp(Zjk − zα/2 / n − 3) and b = exp(Zjk + zα/2 / n − 3). A simultaneous confidence
set for all the correlations can be obtained using the high dimensional bootstrap which we
decribe later.
An exact test can be obtained by using a permutation test. Permute one of the variables
1 B
and recompute the correlation. Repeat B times to get rjk , . . . , rjk . The p-value is
1 X s
p= I(|rjk | ≥ |rjk |).
B s
Reject if p ≤ α/m.
Kendall’s τ . The Pearson correlation is not very robust to outliers. A more robust measure
of association is Kendall’s tau defined by
" #
τ (X, Y ) = E sign (X1 − X2 )(Y1 − Y2 ) .
~ ~
~ ~
4
τ can be estimated by
" #
1 X
τb(X, Y ) = n sign[(Xs − Xt )(Ys − Yt )] .
2 s6=t
γ 2 (X, Y ) = Cov(||X − X 0 ||, ||Y − Y 0 ||) − 2Cov(||X − X 0 ||, ||Y − Y 00 ||) (2)
γ 2 (X, Y )
ρ2 (X, Y ) = p .
γ 2 (X, X)γ 2 (Y, Y )
where c1 , c2 are constants and φ denotes the characteristic function. Another expression
(Lyons 2013) for γ is
γ 2 (X, Y ) = E[δ(X, X 0 )δ(Y, Y 0 )]
where Z Z Z
0 0
δ(X, X ) = d(X, X ) − 2 d(X, u)dP (u) + d(u, v)dP (u)dP (v)
An estimate of γ is
1 X
b2 (X, Y ) =
γ Ajk Bjk
n2 j,k
where
Ajk = ajk − aj· − a·k + a·· , Bjk = bjk − bj· − b·k + b·· .
5
Here, ajk = ||Xj − Xk || and aj· , a·k , a·· are the row, column and grand means of the matrix
{ajk }. The limiting distribution of γ b2 (X, Y ) is complicated. But we can easily test H0 :
γ(X, Y ) = 0 using a permutation test.
where
a(z1 , z2 , z3 , z4 ) = sign(|z1 − z2 | + |z3 − z4 | − |z1 − z3 − |z2 − z4 |).
An estimate of τ ∗ is
1 X
τb∗ = n
a(Xi , Xj , Xk , X` )a(Yi , Yj , Yk , Y` ) (5)
4
The τ ∗ parameter can also be given an interpretation in terms of concordant and discordant
points if we define them as follows:
~ ~ ~ ~
~ ~
~ ~
~ ~ ~ ~
6
Then
2P (concordant) − P (discordant)
τ∗ = .
3
This statistic is related to the distance covariance as follows:
To test H0 : X q Y we use a permutation test. Recently Dhar, Dassios and Bergsma (2016)
showed that τb∗ has good power and is quite robust.
Confidence Intervals. Constructing confidence intervals for γ is not easy. The problem
is that the statistics have different limiting distributions depending on whether the the null
H0 : X q Y is true or not. For example, if H0 is true then
∞
√ X
n(b
γ γ) λj [(Zj + aj )2 − 1]
j=1
for some σ > 0. When τ is small but not quite 0, the distribution will be something in-
between a Normal and a Gaussian chaos.
Since the limiting distribution varies, we cannot really use it to construct a confidence
interval. On way to solce this problem is to use blocking. Instead of using a U-statistics
based on all subsets of size 4, we can break the dataset into non-overlapping blocks of size
4. We construct Q = a(Xi , Xj , Xk , X` )a(Yi , Yj , Yk , Y` ) on each block. Then we define
1 X
g= Qj
m j
where m = n/4 is the number of blocks. Since this is an average, it will have a limiting
Normal distribution. We can then use the Normal approximation or the bootstrap to get
confidence intervals. However, g uses less information then γ
b so we will get larger confidence
intervals than necessary.
7
For τ ∗ the situation is better. Note that τb∗ is a U-statistic of order 4. That is
1 X
τb∗ = τb∗ = n K(Zi , Zj , Zk , Z` )
4
where the sum is over all distinct quadruples and Zi = (Xi , Yi ). Also, −1 ≤ K ≤ 1. By
Hoeffding’s inequalty for U -statsitics of order b we have
2 /r 2
τ ∗ − τ ∗ | > t) ≤ 2e−2(n/b)t
P(|b
where r is the range of K. In our case, r = 2 and b = 4 so
2 /8
τ ∗ − τ ∗ | > t) ≤ 2e−nt
P(|b .
p
If we set tn = (8/n) log(2/α) then Cn = τb∗ ± tn is a 1 − α confidence interval. However, it
may not be shortest possible confidence interval. Find the shortest valid confidence interval
is an open question.
Example. Figure 2 shows some Pearson graphs. These are: two Markov chains, a hub,
four clusters, and a band. Technically, these should be best discovered using conditional
independence graphs (discussed later). But correlation graphs are easy to estimate and
often reveal the salient structure.
Figure 3 shows a graph from highly non-Normal data. The data have the structure of two
Markov chains. I used the distance correlation on all pairs with permutation tests. Nice!
High Dimensional Bootstrap For Pearson Correlations We can also get simultaneous
confidence intervals for many Pearson correlations. This is especially important if we want
to put an edge when |ρ(j, k)| ≥ . If we have a confidence interval C then we can put an
edge whenever [−, ] ∩ C = ∅.
The easiest way to get simultaneous confidence intervals is to use the bootstrap. Let R be
the d × d matrix of true correlations and let R b be the d × d matrix of sample correlations.
(Actually, it is probably better to use the Fisher transformed correlations.) Let X1∗ , . . . , Xn∗
denote a bootstrap sample and let R b∗ be the d × d matrix of correlations from the bootstrap
sample. After taking B bootstrap samples we have R b∗ . Let δj = √n maxs,t |R
b∗ , . . . , R b∗ (s, t)−
1 B j
R(s,
b t)| and define
B
1 X
Fn (w) =
b I(δj ≤ w)
B j=1
which approximates
√
Fn (w) = P( n max |R(s,
b t) − R(s, t)| ≤ w).
s,t
8
● ●
●
●
● ●
● ●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ●
● ●
●
●
● ●
●
● ●
● ● ●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
● ●
Figure 2: Pearson correlation graphs. Top left: two Markov-chains. Top right: a Hub.
Bottom left: 4 clusters. Bottom right: banded.
9
●
●
●
●
●
● ●
●
●
●
●
●
●
●
1/6
Theorem 3 Suppose that d = o(en ). Then
P (R(s, t) ∈ Cst for all s, t) → 1 − α
as n → ∞.
Let X, Y ∈ R and Z be a random vector. The partial correlation between X and Y , given
Z, is a measure of association between X and Y after removing the effect of Z.
Now let’s go back to graphs. Let X = (X(1), . . . , X(d)) and let ρjk denote the partial
correlation between X(j) and X(k) given all the other variables. Let R = {ρjk } be the d × d
matrix of partial correlations.
Lemma 4 The matrix R is given by R(j, k) = −Ωjk / Ωjj Ωkk where Ω = Σ−1 and Σ is the
p
covariance matrix of X.
10
The partial correlation graph G has an edge between j and k when ρjk 6= 0.
In the low-dimensional setting, we can estimateq R as follows. Let Sn be the sample covari-
−1 b k) = −Ω
ance. Let Ωb = S and define R(j,
n
b jk / Ω
b jj Ω
b kk . The easiest way to construct the
graph is to use get simultaneous confidence intervals Cjk using the bootstrap. Then we put
an edge if 0 ∈
/ Cjk .
b k)) ≈ 1
Var(R(j,
n−d
which shows that we cannot reliably estimate the partial correlation when d is large. You
can do three things:
1. Compute a correlation graph instead. This is easy, works well, and often reveals similar
structure that is in the partial correlation graph.
2. Shrinkage: let Ωb = [(1 − )Sn + D]−1 where 0 ≤ ≤ 1 and D is a diagonal matrix
with Djj = Sjj . Then we use the bootstrap to test the entries of the matrix. Based
on calculations in Schafer and Strimmer (2005) amd Ledoit and Wolf (2004), a good
choice of is P
j6=k Var(sjk )
d
= P 2
j6=j sjk
where n
n X
Var(s
d jk ) = (sijk − wjk )2 ,
(n − 1)3 i=1
wijk = (Xi (j) − X(j))(Xi (k) − X(k)) and wjk = n−1 i wijk . This choice is based on
P
minimizing the estimated risk.
3. Use the graphical lasso (described below). Warning! The reliability of the graphical
lasso depends on lots of non-trivial, uncheckable assumptions.
11
Sparsity vs. Regularization lambda = 0.266 lambda = 0.165 lambda = 0.13
0.30
● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ●
Sparsity Level
● ● ● ● ● ● ●
● ● ● ● ● ●
● ●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
0.15
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●
● ●
●
● ● ●
● ● ● ● ● ●
● ● ●
●
● ● ● ●
● ● ● ●
● ● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
●
●
0.00
● ● ● ●
● ●
●
● ● ● ● ●
●
● ●
● ● ●
Regularization Parameter
For the graphical lasso we proceed as follows. We assume that Xi ∼ N (µ, Σ). Then we
estimated Σ (and hence Ω = Σ−1 ) using the penalized log-likelihood,
h X i
Ω = arg max `(Ω) − λ
b |ωjk |
Ω0
j6=k
n n nd
`(Ω) = log |Ω| − tr(ΩSn ) − log(2π). (6)
2 2 2
Node-wise regression. A related but different approach due to Meinshausen and Buhlmann
(2006). The idea is to regress each variable on all the others using the lasso.
Example: Figure 4 shows a hub graph. The graphs was estimated by Meinshausen and
Buhlmann (2006) method using the R package huge.
How To Choose λ. If we use the graphical lasso, how do we choose λ? One idea is to use
cross-validation based on the Normal log-likelihood. In this case we fit the model on part
of the data and evaluate the log-likehood on the held-out data. This is not very reliable
since it depends heavily on the Normality assumption. Currently, I do not think there is a
rigorosuly justified, robust method for choosing λ. Perhaps the best strategy is to plot the
graph for many values of λ.
More Robust Approach. Here is a more robust method. For each pair of variables, regress
them on all the other variables (using your favorate regression method). Now compute the
Kendal correlation on the residuals. This seems like a good idea but I have not seen anyone
try it.
12
Nonparametric Partial Correlation Graphs. There are various ways to create a non-
parametric partial correlation. Let us write
X = g(Z) + X
Y = h(Z) + Y .
Thus, X = X − g(Z) and Y = Y − h(Z) where g(z) = E[X|Z = x] and h(z) = E[Y |Z = z].
Now define
ρ(X, Y |Z) = ρ(X , Y )
where Here We can estimate ρ by using nonparametric regression to estimate g(z) and h(z).
X,i = Xi − gb(Xi ) and b
Then we take the correlation between the residuals b Y,i = Yi − b
h(Yi ).
When Z is high-dimensional, we can use SpAM to estimate g and h.
The strongest type of undirected graph is a conditional independence graph. In this case, we
omit the edge between j and k if X(j) is independent of X(k) given the rest of the variables.
We write this as
X(j) q X(k) | rest. (7)
Conditional independence graphs are the most informative undirected graphs but they are
also the hardest to estimate.
3.1 Gaussian
In the special case of Normality, they are equivalent to partial correlation graphs.
Theorem 5 Suppose that X = (X(1), . . . , X(d)) ∼ N (µ, Σ). Let Ω = Σ−1 . Then X(j) is
independent of X(k) given the rest, if and only if Ωjk = 0.
When all the variables are discrete, the joint distribution is multinomial. It is convenient to
reparameterize the multinomial in a form known as a log-linear model.
13
Let’s start with a simple example. Suppose X = (X(1), X(2)) and that each variable is
binary. Let
p(x1 , x2 ) = P(X1 = x1 , X2 = x2 ).
So, for example, p(0, 1) = P(X1 = 0, X2 = 1). There are four unknown parameters:
p(0, 0), p(0, 1), p(1, 0), p(1, 1). Actually, these have to add up to 1, so there are really only
three free parameters.
So why should be bother writing the model this way? The answer is:
Now suppose that X = (X(1), . . . , X(d)). Let’s continue to assume that each variable is
binary. The log-linear representation is:
X X
log p(x1 , . . . , xd ) = β0 + β j xj + βjk xj xk + · · · + β12···d x1 · · · xd .
j j<k
In this model, β23 = β123 = 0. We conclude that X(2) q X(3)|X(1). Hence, we can omit the
edge between X(2) and X(3).
Log-linear models thus make a nice connection between conditional independence graphs
and parameters. There is a simple one-line command in R for fitting log-linear models. The
function gives the parameter estimates as well as tests that each parameter is 0.
14
When the variables are not binary, the model is a bit more complicated. Each variable is
now represented by a vector of parameters rather than one parameter. But conceptually, it
is the same. Suppose Xj ∈ {0, 1, . . . , m − 1}, for j ∈ V , with V = {1, . . . , d}; thus each of
the d variables takes one of m possible values.
The formula in (8) is called the log-linear expansion of p(x). Each ψA (xA ) may depend
on
Pd some unknown parameters θA . Note that the total number of parameters satisfies
d j d
j=1 j (m − 1) = m , however one of the parameters is the normalizing constant, and
is determined by the constraint that the sum of the probabilities is one. Thus, there are
md − 1 free parameters, and this is a minimal exponential parameterization of the multino-
mial. Let θ = (θA : A ⊂ V ) be the set of all these parameters. We will write p(x) = p(x; θ)
when we want to emphasize the dependence on the unknown parameters θ.
The next theorem provides an easy way to read out conditional independence in a log-linear
model.
Exponentiating, we see that the joint density is of the form f (xA , xB )g(xA , xC ). Therefore
XB ⊥⊥ XC | XA . The reverse follows by reversing the argument.
A graphical log-linear model with respect to a graph G is a log-linear model for which the
parameters ψA satisfy ψA (xA ) 6= 0 if and only if A is a clique of G. Thus, a graphical
15
log-linear model has potential functions on each clique, both maximal and non-maximal,
with the restriction that ψA (xA ) = 0 in case xj = 0 for any j ∈ A. In a hierarchical log-
linear model, if ψA (xA ) = 0 then ψB (xB ) = 0 whenever A ⊂ B. Thus, the parameters in a
hierarchical model are nested, in the sense that if a parameter is identically zero for some
subset of variables, the parameter for supersets of those variables must also be zero. Every
graphical log-linear model is hierarchical, but a hierarchical model need not be graphical;
Such a relationship is shown in Figure 5 and is characterized by the next lemma.
Lemma 10 A graphical log-linear model is hierarchical but the reverse need not be true.
Proof. We assume there exists a model that is graphical but not hierarchical. There must
exist two sets A and B, such that A ⊂ B with ψA (xA ) = 0 and ψB (xB ) 6= 0. Since the model
is graphical, ψB (xB ) 6= 0 implies that B is a clique. We then know that A must also be a
clique due to A ⊂ B, which implies that ψA (xA ) 6= 0. A contradiction.
To see that a hierarchical model does not have to be graphical. We consider the following
example. Let
3
X X
log p(x) = ψΦ + ψi (xi ) + ψjk (xjk ). (10)
i=1 1≤j<k≤3
This model is hierarchical but not graphical. The graph corresponding to this model is a
complete graph with three nodes X1 , X2 , X3 . It is not graphical since ψ123 (x) = 0, which is
contradict with the fact that the graph is complete.
Log-linear = Multinomial
Hierarchical
Graphical
Figure 5: Every graphical log-linear model is hierarchical but the reverse may not be true.
16
3.3 The Nonparametric Case
In real life, nothing has a Normal distribution. What should we do? We could just use a
correlation graph or partial correlation graph. That’s what I recommend. But if you really
want a nonparametric conditional independence graph, there are some possible approaches.
Conditional cdf Method. Let X and Y be real and let Z be a random vector. Define
where
F (x|z) = P(X ≤ x|Z = z), G(y|z) = P(Y ≤ y|Z = z).
Lemma 11 If X q Y | Z then U q V .
If we knew F and G, we could compute Ui = F (Xi |Zi ) and Vi = G(Yi |Zi ) and then test for
independence between U and V .
where
Pn
K(||z − Zs ||/h)I(Xs ≤ x)
s=1P
Fb(x|z) = n
s=1 K(||z − Zs ||/h)
Pn
K(||z − Zs ||/b)I(Ys ≤ y)
s=1P
G(y|z)
b = n .
s=1 K(||z − Zs ||/b)
n1/2 (θe − θ) = OP (1), nβ1 (Fb(x|z) − F (x|z)) = OP (1), nβ2 (Fb(y|z) − F (y|z)) = OP (1).
Then √ √
n(θb − θ) = n(θe − θ) + OP (n−γ )
where γ = min{β1 , β2 }.
17
This means that, asymptotically, we can treat (U
bi , Vbi ) as if they were (Ui , Vi ). Of course, for
graphs, the whole procedure needs to be repeated for each pair of variables.
There are some caveats. First, we are essentially doing high dimensional regression. In high
dimensions, the convergence will be very slow. Second, we have to choose the bandwidths h
and b. It is not obvious how to do this in practice.
Challenge: Can you think of a way to do sparse estimation of F (x|z) and F (y|z)?
One idea is to do node-wise regression using SpAM (see Voorman, Shojaie and Witten 2007).
An alternative is as follows. Let f (X) = (f1 (X1 ), . . . , fp (Xp )). Assume that f (X) ∼ N (µ, Σ).
Write X ∼ NPN(µ, Σ, f ). If each fj is monotone then this is just a Gaussian copula, that is,
F (x1 , . . . , xp ) = Φµ,Σ Φ−1 (F1 (x1 )), . . . , Φ−1 (Fp (xp )) .
The marginal means and variances µj and σj are not identifiable but this does not affect the
graph G.
18
0.30
0.4
0.25
0.3 0.20
0.15
0.2 2 2
0.10
0.1 1 1
0.05
−2 −2
0 0
−1 −1
0 −1 0 −1
1 1
−2 −2
2 2
Three
0.4
0.3
0.2 2
0.1
1
0.0
−2
0
−1
0 −1
−2
2
examples of nonparanormals.
Let fbj (xj ) = Φ−1 (Fbj (xj )). The usual empirical Fbj (xj ) will not work if d increases with n.
We use a Winsorized version:
δn
if Fbj (x) < δn
Fej (x) = Fbj (x) if δn ≤ Fbj (x) ≤ 1 − δn
(1 − δn ) if Fbj (x) > 1 − δn ,
where
1
δn ≡ √ .
4n1/4 π log n
This choice of δn provides the right bias-variance balance so that we can achieve the desired
rate of convergence in our estimate of Ω and the associated undirected graph G. Now
compute the sample covariance Sn of the Normalized variables: Zj = fbj (Xj ) = Φ−1 (Fej (Xj )).
Finally, apply the glasso to Sn . Let Sn∗ be the covariance using the true fj ’s.
19
Suppose that d ≤ nξ . For large n,
c4 n1/2 2
∗ c1 d c1 d
P max |Sn (j, k) − Sn (j, k)| > ≤ + + c3 exp −
jk (n2 )2ξ (n2 )c5 ξ−1 log d log2 n
and hence s
2
log d log n
max |Sn (j, k) − Sn∗ (j, k)| = OP .
jk n1/2
0.3
0.2
0.2
0.0
0.1
0.1
−0.1
0.0
0.0
−0.1
−0.1
−0.2
−0.2
−0.2
0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
0.5
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
−0.1
−0.1
−0.1
−0.2
0.0 0.1 0.2 0.3 0.4 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
n = 500, p = 40
20
CDF Transform Power Transform No Transform
1.0
1.0
1.0
0.9
0.9
Nonparanormal
0.9
glasso Nonparanormal
Nonparanormal glasso
glasso
0.8
0.8
1−FP
1−FP
1−FP
0.8
0.7
0.7
0.7
0.6
0.6
0.5
0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
2 X
τbjk = sign (Xs (j) − Xt (j))(Xs (k) − Xt (k)) .
n(n − 1) 1≤s<t≤n
Then (with d ≥ n) r !
log d 1
P max |Sbjk − Σjk | > 2.45π ≤ .
jk n d
As in Yuan (2010), let
( )
1
M= Ω : Ω 0, ||Ω||1 ≤ κ, ≤ λmin (Ω) ≤ λmax (Ω) < c, deg(Ω) ≤ M .
c
From Yuan, this implies that the Skepic is minimax rate optimal. Now plug Sb into glasso.
See Liu, Han, Yuan, Lafferty and Wasserman arXiv:1202.2169 for numerical experiments
and theoretical results.
21
Forests. Yet another approach is based on forests. A tractable family of graphs are forests.
A graph F is a forest if it contains no cycles. If F is a d-node undirected forest with vertex
set VF = {1, . . . , d} and edge set EF ⊂ {1, . . . , d} × {1, . . . , d}, the number of edges satisfies
|EF | < d. Suppose that P is Markov to F and has density p. Then p can be written as
Y p(xi , xj ) Y
p(x) = p(xk ), (11)
p(xi ) p(xj ) k∈V
(i,j)∈EF F
where each p(xi , xj ) is a bivariate density, and each p(xk ) is a univariate density. Using (11),
we have
X Z p(xi , xj ) XZ
= − p(xi , xj ) log dxi dxj − p(xk ) log p(xk )dxk
p(xi )p(xj ) k∈VF
(i,j)∈EF
X X
= − I(Xi ; Xj ) + H(Xk ), (13)
(i,j)∈EF k∈VF
where Z
p(xi , xj )
I(Xi ; Xj ) ≡ p(xi , xj ) log dxi dxj (14)
p(xi ) p(xj )
is the mutual information between the pair of variables Xi , Xj and
Z
H(Xk ) ≡ −ds p(xk ) log p(xk ) dxk (15)
is the entropy.
The optimal forest F ∗ can P be found by minimizing the right hand side of (13). Since
the entropy term H(X) = k H(Xk ) is constant across all forests, this can be recast as
the problem of finding the maximum weight spanning forest for a weighted graph, where
the weight w(i, j) of the edge connecting nodes i and j is I(Xi ; Xj ). Kruskal’s algorithm
(Kruskal 1956) is a greedy algorithm that is guaranteed to find a maximum weight spanning
tree of a weighted graph. In the setting of density estimation, this procedure was proposed
by Chow and Liu (1968) as a way of constructing a tree approximation to a distribution.
At each stage the algorithm adds an edge connecting that pair of variables with maximum
mutual information among all pairs not yet visited by the algorithm, if doing so does not
form a cycle. When stopped early, after k < d edges have been added, it yields the best
k-edge weighted forest.
Of course, the above procedure is not practical since the true density p(x) is unknown. In
applications, we parameterize bivariate and univariate distributions to be pθij (xi , xj ) and
22
Chow-Liu Algorithm for Learning Forest Graphs
pθk (xk ). We replace the population mutual information I(Xi ; Xj ) in (13) by the plug-in
estimate Ibn (Xi , Xj ), defined as
Z pθbij (xi , xj )
Ibn (Xi , Xj ) = pθbij (xi , xj ) log dxi dxj (16)
pθbi (xi ) pθbj (xj )
Example 14 (Learning Gaussian maximum weight spanning tree) For Gaussian data
X ∼ N (µ, Σ), we know that the mutual information between two variables are
1
I(Xi ; Xj ) = − log 1 − ρ2ij ,
(17)
2
where ρij is the correlation between Xi and Xj . To obtain an empirical estimator, we simply
plug-in the sample correlation ρbij . Once the mutual information matrix is calculated, we
could apply the Chow-Liu algorithm to get the maximum weight spanning tree.
Example 15 (Graphs for Equities Data) We collect the daily closing prices were ob-
tained for 452 stocks that were consistently in the S&P 500 index between January 1, 2003
through January 1, 2011. This gave us altogether 2,015 data points, each data point corre-
sponds to the vector of closing prices on a trading day. With St,j denoting the closing price
of stock j on day t, we consider the variables Xtj = log (St,j /St−1,j ) and build graphs over
the indices j. We simply treat the instances Xt as independent replicates, even though they
form a time series. We truncate every stock so that its data points are within six times the
23
mean absolute deviation from the sample average. In Figure 7(a) we show boxplots for 10
randomly chosen stocks. It can be seen that the data contains outliers even after truncation;
the reasons for these outliers includes splits in a stock, which increases the number of shares.
In Figure 7(b) we show the boxplots of the data after the nonparanormal transformation (the
details of nonparanormal transformation will be explained in the nonparametric graphical
model chapter). In this analysis, we use the subset of the data between January 1, 2003 to
January 1, 2008, before the onset of the “financial crisis.” There are altogether n = 1, 257
data points and d = 452 dimensions.
0.4
2
●
●
● ●
●
●
0.2
●
●
● ●
● ●
●
● ● ●
● ●
●
● ● ●
● ●
● ●
●
● ●
● ● ●
● ● ● ● ●
●
●
● ●
● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ● ● ●
● ● ●
●
● ●
● ● ● ●
● ●
● ● ●
● ●
● ●
● ● ●
●
●
1
●
● ●
● ● ● ●
● ● ●
● ● ●
●
●
● ●
● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ●
●
●
● ● ●
●
● ● ●
● ●
●
●
● ●
● ● ● ● ●
●
● ● ●
● ●
●
0.0
●
● ● ● ●
●
●
● ●
● ●
● ●
● ●
●
●
● ● ●
● ● ● ●
●
● ●
●
● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ●
●
●
●
● ●
● ●
●
● ● ●
● ● ● ● ●
●
●
● ● ● ●
● ● ● ●
●
● ● ●
●
● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
●
● ●
● ●
● ●
● ● ●
● ● ●
●
● ●
● ● ● ●
●
● ●
● ● ●
● ● ● ●
● ● ● ● ●
●
●
● ● ● ●
●
● ● ●
●
● ●
X
X
● ●
0
●
−0.2
● ●
● ●
● ●
●
●
●
●
● ● ●
●
●
●
●
●
−0.4
−1
−0.6
−2
●
● ●
●
●
AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO AAPL BAC AMAT CERN CLX MSFT IBM JPM UPS YHOO
Stocks Stocks
Figure 7: Boxplots of Xt = log(St /St−1 ) for 10 stocks. As can be seen, the original data
has many outliers, which is addressed by the nonparanormal transformation on the re-scaled
data (right).
The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sec-
tors, including Consumer Discretionary (70 stocks), Energy (37 stocks), Financials
(74 stocks), Consumer Staples (35 stocks), Telecommunications Services (6 stocks),
Health Care (46 stocks), Industrials (59 stocks), Information Technology (64 stocks),
Materials (29 stocks), and Utilities (32 stocks). It is expected that stocks from the same
GICS sectors should tend to be clustered together, since stocks from the same GICS sector
tend to interact more with each other. In the graphs shown below, the nodes are colored
according to the GICS sector of the corresponding stock.
With the Gaussian assumption, we directly apply Chow-Liu algorithm to obtain a full span-
ning tree of d − 1 = 451 edges. The resulting graph is shown in Figure 8. We see that the
stocks from the same GICS sector are clustered very well.
To get a nonparametric version, we can just an nonparametric estimate of the mutual in-
formation. But for that matter, we might as well put in any measure of association such as
24
● ● ●
●
● ● ● ●
● ●
●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ●
●● ●● ●
● ● ● ●
●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
●
●
● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ● ●
●
● ●
Figure 8: Tree graph learned from S&P 500 stock data from Jan. 1, 2003 to Jan. 1, 2008.
The graph is estimated using the Chow-Liu algorithm under the Gaussian model. The nodes
are colored according to their GICS sector categories.
25
distance correlation. In that case, we see that a forest is just a correlation graph without
cycles.
Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A,
B, and C be subsets of vertices. We say that C separates A and B if every path from
a node in A to a node in B passes through a node in C. Now consider a random vector
X = (X(1), . . . , X(d)) where Xj corresponds to node j in the graph. If A ⊂ {1, . . . , d} then
we write XA = (X(j) : j ∈ A).
A probability distribution P for a random vector X = (X(1), . . . , X(d)) may satisfy a range
of different Markov properties with respect to a graph G = (V, E):
The set of distributions that is globally Markov with respect to G is denoted by P(G).
26
Consider for example the graph in Figure 9. Here the set C separates A and B. Thus, a
distribution that satisfies the global Markov property for this graph must have the property
that the random variables in A are conditionally independent of the random variables in
B given the random variables C. This is seen to generalize the usual Markov property for
simple chains, where XA −→ XC −→ XB forms a Markov chain in case XA and XB are
independent given XC . A distribution that satisfies the global Markov property is said to
be a Markov random field or Markov network with respect to the graph. The local Markov
property is depicted in Figure 10.
6 7 8
1 2 3 4 5
X2
X3 X1 X4
X5
Figure 10: The local Markov property: Conditioned on its four neighbors X2 , X3 , X4 , and
X5 , node X1 is independent of the remaining nodes in the graph.
From the definitions, the relationships of different Markov properties can be characterized
as:
27
Proposition 19 For any undirected graph G and any distribution P , we have
global Markov property =⇒ local Markov property =⇒ pairwise Markov property.
Proof. The global Markov property implies the local Markov property because for each
node s ∈ V , its neighborhood N (s) separates {s} and V \ {N (s) ∪ {s}}. Assume next
that the local Markov property holds. Any t that is not adjacent to s is an element of
t ∈ V \ {N (s) ∪ {s}}. Therefore
N (s) ∪ [(V \ {N (s) ∪ {s}}) \ {t}] = V \ {s, t}, (20)
and it follows from the local Markov property that
Xs ⊥⊥ XV \{N (s)∪{s}} | XV \{s,t} . (21)
This implies Xs ⊥⊥ Xt | XV \{s,t} , which is the pairwise Markov property.
The next theorem, due to Pearl (1986), provides a sufficient condition for equivalence.
Proof. It is enough to show that the pairwise Markov property implies the global Markov
property under the given condition. Let S, A, B ⊂ V with S separating A from B in the
graph G. Without loss of generality both A and B are assumed to be non-empty. The
proof can be carried out using backward induction on the number of nodes in S, denoted by
m = |S|. Let d = |V |, for the base case, if m = d − 1 then both A and B only consist of
single vertex and the result follows from pairwise Markov property.
Now assume that m < d − 1 and separation implies conditional independence for all sepa-
rating sets S with more than m nodes. We proceed in two cases: (i) A ∪ B ∪ S = V and (ii)
A∪B∪S ⊂V.
For case (i), we know that at least one of A and B must have more than one element.
Without loss of generality, we assume A has more than one element. If s ∈ A, then S ∪ {s}
separates A \ {s} from B and also S ∪ (A \ {s}) separates s from B. Thus by the induction
hypothesis
XA\{s} ⊥⊥ XB | XS∪{s} and Xs ⊥⊥ XB | S ∪ (A \ {s}). (23)
Now the condition (22) implies XA ⊥⊥ XB | XS . For case (ii), we could choose s ∈ V \ (A ∪
B ∪ S). Then S ∪ {s} separates A and B, implying A ⊥⊥ B | S ∪ {s}. We then proceed in
two cases, either A ∪ S separates B from s or B ∪ S separates A from s. For both cases, the
condition (22) implies that A ⊥⊥ B | S.
28
Proposition 21 Let X = (X1 , . . . , Xd ) be a random vector with distribution P and joint
density p(x). If the joint density p(x) is positive and continuous with respect to a product
measure, then condition (22) holds.
Proof. Without loss of generality, it suffices to assume that d = 3. We want to show that
if X1 ⊥⊥ X2 | X3 and X1 ⊥⊥ X3 | X2 then X1 ⊥⊥ {X2 , X3 }. (24)
Since the density is positive and X1 ⊥⊥ X2 | X3 and X1 ⊥⊥ X3 | X2 , we know that there must
exist some positive functions f13 , f23 , g12 , g23 such that the joint density takes the following
factorization:
p(x1 , x2 , x3 ) = f13 (x1 , x3 )f23 (x2 , x3 ) = g12 (x1 , x2 )g23 (x2 , x3 ). (25)
Since the density is continuous and positive, we have
f13 (x1 , x3 )f23 (x2 , x3 )
g12 (x1 , x2 ) = . (26)
g23 (x2 , x3 )
For each fixed X3 = x03 , we see that g12 (x1 , x2 ) = h(x1 )`(x2 ) where h(x1 ) = f13 (x1 , x03 ) and
`(x2 ) = f23 (x2 , x03 )/g23 (x2 , x03 ). This implies that
p(x1 , x2 , x3 ) = h(x1 )`(x2 )g23 (x2 , x3 ) (27)
and hence X1 ⊥⊥ {X2 , X3 } as desired.
From Proposition 21, we see that for distributions with positive continuous densities, the
global, local, and pairwise Markov properties are all equivalent. If a distribution P satisfies
global Markov property with respect to a graph G, we say that P is Markov to G
Unlike a directed graph which encodes a factorization of the joint probability distribution in
terms of conditional probability distributions. An undirected graph encodes a factorization
of the joint probability distribution in terms of clique potentials. Recall that a clique in
a graph is a fully connected subset of vertices. Thus, every pair of nodes in a clique is
connected by an edge. A clique is a maximal clique if it is not contained in any larger clique.
Consider, for example, the graph shown in the right plot of Figure 11. The pairs {X4 , X5 }
and {X1 , X3 } form cliques; {X4 , X5 } is a maximal clique, while {X1 , X3 } is not maximal
since it is contained in a larger clique {X1 , X2 , X3 }.
A set of clique potentials {ψC (xC ) ≥ 0}C∈C determines a probability distribution that factors
with respect to the graph by normalizing:
1 Y
p(x1 , . . . , x|V | ) = ψC (xC ). (28)
Z C∈C
29
X1 X4 X1 X4
X3 X3
X2 X5 X2 X5
Figure 11: A directed graph encodes a factorization of the joint probability distribution in
terms of conditional probability distributions. An undirected graph encodes a factorization
of the joint probability distribution in terms of clique potentials.
The normalizing constant or partition function Z sums (or integrates) over all settings of the
random variables: Z Y
Z= ψC (xC )dx1 . . . dx|V | . (29)
x1 ,...,x|V | C∈C
Thus, the family of distributions represented by the undirected graph in Figure 11 can be
written as
p(x1 , x2 , x3 , x4 , x5 ) = ψ1,2,3 (x1 , x2 , x3 ) ψ1,4 (x1 , x4 ) ψ4,5 (x4 , x5 ). (30)
In contrast, the family of distributions represented by the directed graph in Figure 11 can
be factored into conditional distributions according to
Theorem 22 For any undirected graph G = (V, E), a distribution P that factors with respect
to the graph also satisfies the global Markov property on the graph.
e ⊥⊥ B
This implies that A e | S and thus A ⊥⊥ B | S.
30
It is worth remembering that while we think of the set of maximal cliques as given in a
list, the problem of enumerating the set of maximal cliques in a graph is NP-hard, and the
problem of determining the largest maximal clique is NP-complete. However, many graphs
of interest in statistical analysis are sparse, with the number of cliques of size O(|V |).
Theorem 22 shows that factoring with respect to a graph implies global Markov property.
The next question is, under what conditions the Markov properties imply factoring with
respect to a graph. In fact, in the case where P has a positive and continuous density we can
show that the pairwise Markov property implies factoring with respect to a graph. Thus all
Markov properties are equivalent. The results have been discovered by many authors but is
usually referred to as Hammersley and Clifford due to one of their unpublished manuscript
in 1971. They proved the result in the discrete case. The following result is usually referred
to as the Hammersley-Clifford theorem; a proof appears in Besag (1974). The extension to
the continuous case is left as an exercise.
Now, let
P(x)
Q(x) = log . (34)
P(0)
31
Recursively, we obtain
X X X
Q(x) = φi (xi ) + φij (xi , xj ) + φijk (xi , xj , xk ) + · · · + φ12···d (x)
i i<j i<j<k
P(xi | x\i )
Q(x) − Q(x0\i ) = log (38)
P(0 | x\i )
X X
= φ1 (x1 ) + φ1i (x1 , xi ) + φ1ij (x1 , xi , xj ) + · · · + φ12···d (x)
i>1 j>i>1
depends only on x1 and the neighbors of node 1 in the graph. Thus, from the local Markov
property, if k is not a neighbor of node 1, then the above expression does not depend of xk .
In particular, φ1k (x1 , xk ) = 0, and more generally all φA (xA ) with 1 ∈ A and k ∈ A are
identically zero. Similarly, if i, j are not neighbors in the graph, then φA (xA ) = 0 for any
A containing i and j. Thus, φA 6= 0 only holds for the subsets A that form cliques in the
graph. Since it is obvious that exp(φA (x)) > 0, we finish the proof.
Since factoring with respect to the graph implies the global Markov property, we may sum-
marize this result as follows:
For strictly positive distributions, the global Markov property, the local Markov
property, and factoring with respect to the graph are equivalent.
where C is the set of all (maximal) cliques in G and Z is the normalization constant. This
is called the Gibbs representation.
Directed graphical models are naturally viewed as generative; the graph specifies a straight-
forward (in principle) procedure for sampling from the underlying distribution. For instance,
32
a sample from a distribution represented from the DAG in left plot of Figure 12 can be sam-
pled as follows:
X1 ∼ P (X1 ) (39)
X2 ∼ P (X2 ) (40)
X3 ∼ P (X3 ) (41)
X5 ∼ P (X5 ) (42)
X4 | X1 , X2 ∼ P (X4 | X1 , X2 ) (43)
X6 | X3 , X4 , X5 ∼ P (X6 | X3 , X4 , X5 ). (44)
As long as each of the conditional probability distributions can be efficiently sampled, the
full model can be efficiently sampled as well. In contrast, there is no straightforward way to
sample from an distribution from the family specified by an undirected graph. Instead one
need something like MCMC.
X1 X2 X1 X2
X3 X4 X3 X4
X5 X5
X6 X6
Figure 12: A DAG and its corresponding moral graph. A probability distribution that factors
according to a DAG obeys the global Markov property on the undirected moral graph.
Generally, edges must be added to the skeleton of a DAG in order for the distribution to
satisfy the global Markov property on the graph. Consider the example in Figure 12. Here
the directed model has a distribution
The corresponding undirected graphical model has two maximal cliques, and factors as
33
More generally, let P be a probability distribution that is Markov to a DAG G. We define
the moralized graph of G as the following:
Definition 24 (Moral graph) The moral graph M of a DAG G is an undirected graph that
contains an undirected edge between two nodes Xi and Xj if (i) there is a directed edge
between Xi and Xj in G, or (ii) Xi and Xj are both parents of the same node.
Proof. Directly follows from the definition of Bayesian networks and Theorem 22.
XA XB XC XA XB XC
XA XB XC
Figure 13: These three graphs encode distributions with identical independence relations.
Conditioned on variable XC , the variables XA and XB are independent; thus C separates A
and B in the undirected graph.
Example 26 (Basic Directed and Undirected Graphs) To illustrate some basic cases, con-
sider the graphs in Figure 13. Each of the top three graphs encodes the same family of
probability distributions. In the two directed graphs, by d-separation the variables XA and
XB are independent conditioned on the variable XC . In the corresponding undirected graph,
which simply removes the arrows, node C separates A and B.
The two graphs in Figure 14 provide an example of a directed graph which encodes a set
of conditional independence relationships that can not be perfectly represented by the cor-
responding moral graph. In this case, for the directed graph the node C is a collider, and
deriving an equivalent undirected graph requires joining the parents by an edge. In the corre-
sponding undirected graph, A and B are not separated by C. However, in the directed graph,
XA and XB are marginally independent, such an independence relationship is lost in the
moral graph. Conversely, Figure 15 provides an undirected graph over four variables. There
is no directed graph over four variables that implies the same set of conditional independence
properties.
34
XA XC XA XC
XB XB
Figure 14: A directed graph whose conditional independence properties can not be perfectly
expressed by its undirected moral graph. In the directed graph, the node C is a collider;
therefore, XA and XB are not independent conditioned on XC . In the corresponding moral
graph, A and B are not separated by C. However, in the directed graph, we have the
independence relationship XA ⊥⊥ XB , which is missing in the moral graph.
X1 X2
X3 X4
Figure 15: This undirected graph encodes a family of distributions that cannot be represented
by a directed graph on the same set of nodes.
35
X1 X2 X3 Xd
Y1 Y2 Y3 Yd
X1 X2 X3 Xd
Y1 Y2 Y3 Yd
Figure 16: The top graph is a directed graph representing a hidden Markov model. The
shaded nodes are observed, but the unshaded nodes, representing states in a latent Markov
chain, are unobserved. Replacing the directed edges by undirected edges (bottom) does not
changed the independence relations.
36
The upper plot in Figure 16 shows the directed graph underlying a hidden Markov model.
There are no colliders in this graph, and therefore the undirected skeleton represents an
equivalent set of independence relations. Thus, hidden Markov models are equivalent to
hidden Markov fields with an underlying tree graph.
4.4 Faithfulness
The set of all distributions that are Markov to a graph G is denoted by P(G) To understand
P(G) more clearly, we introduce some more notation. Given a distribution P let I(P ) denote
all conditional independence statements that are true for P . For example, if P has density
p and p(x1 , x2 , x3 ) = p(x1 )p(x2 )p(x3 |x1 , x2 ) then I(P ) = {X1 ⊥⊥ X2 }. On the other hand, if
p(x1 , x2 , x3 ) = p(x1 )p(x2 , x3 ) then
n o
I(P ) = X1 ⊥⊥ X2 , X1 ⊥⊥ X3 , X1 ⊥⊥ X2 |X3 , X1 ⊥⊥ X3 |X2 .
Similarly, given a graph G let I(G) denote all independence statements implied by the graph.
For example, if G is the graph in Figure 15, then
n o
I(G) = X1 ⊥⊥ X4 | {X2 , X3 } , X2 ⊥⊥ X3 | {X1 , X4 } .
This result often leads to confusion since you might have expected that P(G) would be equal
to P : I(G) = I(P ) . But this is incorrect. For example, consider the undirected graph
X1 — X2 , in this case, I(G) = ∅ and P(G) consists of all distributions p(x1 , x2 ). Since,
P(G) consists of all distributions, it also includes distributions of the form p0 (x1 , x2 ) =
p0 (x1 )p0 (x2 ). For such a distribution we have I(P0 ) = {X1 ⊥⊥ X2 }. Hence, I(G) is a strict
subset of I(P0 ).
In fact, you can think of I(G) as the set of independence statements that are common to all
P ∈ P(G). In other words,
\n o
I(G) = I(P ) : P ∈ P(G) . (48)
Every P ∈ P(G) has the independence properties in I(G). But some P ’s in P(G) might
have extra independence properties.
37
and we note that F(G) ⊂ P(G). A distribution P that is in P(G) but is not in F(G) is said
to be unfaithful with respect to G. The independence relation expressed by G are correct
for such a P . It’s just that P has extra independence relations not expressed by G. A
distribution P is also Markov to some graph. For example, any distribution is Markov to
the complete graph. But there exist distributions P that are not faithful to any graph. This
means that there will be some independence relations of P that cannot be expressed using
a graph.
Example 27 The directed graph in Figure 14 implies that XA ⊥⊥ XB but that XA and XB
are not independent given XC . There is no undirected graph G for (XA , XB , XC ) such that
I(G) contains XA ⊥⊥ XB but excludes XA ⊥⊥ XB | XC . The only way to represent P is to
use the complete graph. Then P is Markov to G since I(G) = ∅ ⊂ I(P ) = {XA ⊥⊥ XB }
but P is unfaithful to G since it has an independence relation not represented by G, namely,
{XA ⊥⊥ XB }.
Generally, the set of unfaithful distributions P(G) \ F(G) is a small set. In fact, it has
Lebesgue measure zero if we restrict ourselves to nice parametric families. However, these
unfaithful distributions are scattered throughout P(G) in a complex way; see Figure 18.
References
Gretton, Song, Fukumizu, Scholkopf, Smola (2008). A Kernel Statistical Test of Indepen-
dence. NIPS.
38
a
X1 X2
c b
X3
Figure 18: The blob represents the set P(G) of distributions that are Markov to some graph
G. The lines are a stylized representation of the members of P(G) that are not faithful to G.
Hence the lines represent the set P(G) \ F(G). These distributions have extra independence
relations not captured by the graph G. The set P(G) \ F(G) is small but these distributions
are scattered throughout P(G).
39
Lyons, Russell. (2013). Distance covariance in metric spaces. The Annals of Probability, 41,
3284-3305.
Szekely, Gabor J., Maria L. Rizzo, and Nail K. Bakirov. (2007). Measuring and testing
dependence by correlation of distances. The Annals of Statistics, 35, 2769-2794.
Exercises
1. Prove Equation (48).
2. Prove Theorem 22.
3. Prove Theorem 25.
4. State and prove the continuous version of Hammersley-Clifford theorem.
5. Prove that the undirected graph (the square) of Figure 15 represents a family of prob-
ability distributions that cannot be represented by a directed graph on the same set of
vertices.
6. Consider random variables (X1 , X2 , X3 , X4 ). Suppose the log-density is
log p(x) = ψΦ + ψ12 (x1 , x2 ) + ψ13 (x1 , x3 ) + ψ24 (x2 , x4 ) + ψ34 (x3 , x4 ). (59)
Find the ψ-terms for the log-linear expansion. Comment on the model.
8. Let X1 , . . . , X4 be binary. Draw the independence graphs corresponding to the follow-
ing log-linear models (where α > 0). Also, identify whether each is graphical and/or
hierarchical (or neither).
(a) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4
(b) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4 + 12x2 x3 + 78x2 x4 + 3x3 x4 + 32x2 x3 x4
(c) log p(x) = α + 11x1 + 2x2 + 1.5x3 + 17x4 + 12x2 x3 + 3x3 x4 + x1 x4 + 2x1 x2
(d) log p(x) = α + 5055x1 x2 x3 x4 .
9. This problem is based on the following graph:
40
X1
X2 X5
X3 X4
(a) Can you construct a DAG that has the same conditional independencies of the
form p(xi | xj , j 6= i) = p(xi | xi1 , xi2 ) as those implied by the above graph?
(b) For each of the following families of distributions, list any independence relations,
if any, that are implied in addition to the independence relations for a general
distribution that is Markov to the above graph.
1
(i) p(x) = ψ (x , x2 ) ψ23 (x2 , x3 ) ψ34 (x3 , x4 ) ψ45 (x4 , x5 ) ψ51 (x5 , x1 )
Z 12 1
1
(ii) p(x) = ψ (x , x2 ) ψ23 (x2 , x3 ) ψ34 (x3 , x4 ) ψ45 (x4 , x5 )
Z 12 1
1
(iii) p(x) = ψ (x , x2 ) ψ34 (x3 , x4 ) ψ5 (x5 )
Z 12 1
(c) Now suppose that each Xi is binary, so that Xi ∈ {0, 1}, and consider the following
family of distributions:
p(x) ∝ ψθ (x1 , x2 ) ψθ (x2 , x3 ) ψθ (x3 , x4 ) ψθ (x4 , x5 ) ψθ (x5 , x1 )
where ψθ (x, y) = θ xy with θ > 0. For this family of models, calculate the condi-
tional probability P (X1 = 1 | X3 = 1, X4 = 1) as a function of θ.
10. Let X1 , . . . , X6 be random variables with joint distribution of the form
p(x) ∝ f (x1 , x2 , x3 ) g(x3 , x4 ) g(x1 , x5 ) g(x4 , x5 ) f (x1 , x5 , x6 )
where f : R3 −→ R+ and g : R2 −→ R+ are arbitrary non-negative functions.
(a) What is the graph with the fewest edges that represents the independence relations
for this family of distributions?
(b) For each of the following sets of variables C ⊂ {X1 , . . . , X6 }, give non-empty sets
of variables A and B such that A ⊥⊥ B | C.
(i) C = {X1 , X4 }
(ii) C = {X1 , X3 }
(ii) C = {X1 , X3 , X4 }
(c) Now suppose that Xi ∈ {0, 1} are binary and that
f (x, y, z) = αxyz g(x, y) = αxy
Find the conditional probability P (X4 = 1 | X1 = 1, X3 = 1).
41
11. Let X = (X1 , X2 , X3 , X4 , X5 ) be a random vector distributed as X ∼ N (0, Σ) where
the covariance matrix Σ is given by
9 −3 −3 −3 −3 3 1 1 1 1
−3 6 1 1 1 1 3 0 0 0
1
−3 1 with inverse Σ−1 = 1
Σ= 6 1 1 0 3 0 0.
15
−3 1
1 6 1 1 0 0 3 0
−3 1 1 1 6 1 0 0 0 3
Simulate n = 100 random vectors from this distribution. Fit the model
X X
log p(x) = β0 + β j xj + βk` xk x`
j k<`
using maximum likelihood. Report your estimators. Use forward model selection with
BIC to choose a submodel. Compare the selected model to the true model.
13. Let X ∼ N (0, Σ) where X = (X(1), . . . , X(d))T , d = 10. Let Ω = Σ−1 and suppose
that Ω(i, i) = 1, Ω(i, i − 1) = .5, Ω(i − 1, i) = .5 and Ω(i, j) = 0 otherwise. Simulate 50
random vectors and use the glasso method to estimate the covariance matrix. Compare
your estimated graph to the true graph.
14. Let X = (X1 , X2 , X3 , X4 ) be a random vector satisfying
f (X) ∼ N (0, Σ)
where f (x) = (x31 , x32 , x33 , x34 ) and the covariance matrix Σ is given by
−4 −2 −2 −2 3 1 1 1
1 −2 4 1 1 1 2 0 0
, with inverse Σ−1 =
Σ= .
6 −2 1 4 1 1 0 2 0
−2 1 1 4 1 0 0 2
42
(a) Give an expression for the density of X.
(b) What is the graph for X, viewed as a graphical model?
(c) List the local Markov properties for this graphical model.
15. Let X = (X(1), . . . , X(d)) where each Xj ∈ {0, 1}. Consider the log-linear model
d
X d
X d
X
log p(x) = β0 + βj xj + βjk xj xk + · · · + βjk` xj xk x` + · · · +
j=1 j<k j<k<`
where mi ∈ Z are integers. Find a set of integers m1 , . . . , md ∈ Z for which the function
fm is a probability density; i.e., fm is nonnegative and integrates to one.
17. Given X = (Y, Z) ∈ R6 , let X ∼ N (0, Σ) be a random Gaussian vector
where Y =
A B
(Y1 , Y2 ) ∈ R2 and Z = (Z1 , Z2 , Z3 , Z4 ) ∈ R4 , with Σ−1 = Ω = where
BT C
1
0 0 2 2
1 1 1 1 2 0 0
2 0 1 2 3 4 2
A= B= 1 0 0 2 1 .
C=
0 2 −1 2
− 13 1
4 2
0 0 12 2
43