Robust Graph Dictionary Learning
Robust Graph Dictionary Learning
A BSTRACT
Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse
linear combinations of basis elements (atoms) and is widely used in machine
learning, computer vision, and signal processing. To extend DL to graphs,
Vincent-Cuaz et al. 2021 propose a method, called GDL, which describes the
topology of each graph with a pairwise relation matrix (PRM) and compares
PRMs via the Gromov-Wasserstein Discrepancy (GWD). However, the lack of ro-
bustness often excludes GDL from a variety of real-world applications since GWD
is sensitive to the structural noise in graphs. This paper proposes an improved
graph dictionary learning algorithm based on a robust Gromov-Wasserstein dis-
crepancy (RGWD) which has theoretically sound properties and an efficient nu-
merical scheme. Based on such a discrepancy, our dictionary learning algorithm
can learn atoms from noisy graph data. Experimental results demonstrate that our
algorithm achieves good performance on both simulated and real-world datasets.
1 I NTRODUCTION
Dictionary learning (DL) seeks to learn a set of basis elements (atoms) from data and approximates
data samples by sparse linear combinations of these basis elements (Mallat, 1999; Mairal et al.,
2009; Tošić and Frossard, 2011), which has numerous machine learning applications including di-
mensionality reduction (Feng et al., 2013; Wei et al., 2018), classification (Raina et al., 2007; Mairal
et al., 2008), and clustering (Ramirez et al., 2010; Sprechmann and Sapiro, 2010), to name a few.
Although DL has received significant attention, it mostly focuses on vectorized data of the same
dimension and is not amenable to graph data (Xu, 2020; Vincent-Cuaz et al., 2021; 2022). Many
exciting machine learning tasks use graphs to capture complex structures (Backstrom and Leskovec,
2011; Sadreazami et al., 2017; Naderializadeh et al., 2020; Jin et al., 2017; Agrawal et al., 2018).
DL for graphs is more challenging due to the lack of effective means to compare graphs. Specifi-
cally, evaluating the similarity between one observed graph and its approximation is difficult, since
they are often with different numbers of nodes and the node correspondence across graphs is often
unknown (Xu, 2020; Vincent-Cuaz et al., 2021).
The seminal work of Vincent-Cuaz et al. (2021) proposes a DL method for graphs based on
the Gromov-Wasserstein Discrepancy (GWD) that is a variant of the Gromov-Wasserstein dis-
tance. Gromov-Wasserstein distance compares probability distributions supported on different met-
ric spaces using pairwise distances (Mémoli, 2011). By expressing each graph as a probability
measure and capturing the graph topology with a pairwise relation matrix (PRM), comparing graphs
can be naturally formulated as computing the GWD, since both the node correspondence and the
discrepancy of the compared graphs are calculated (Peyré et al., 2016; Xu et al., 2019b; Chowdhury
and Mémoli, 2019). However, observed graphs often contain structural noise including spurious or
missing edges, which leads to the differences between the obtained PRMs and the true ones (Donnat
et al., 2018; Xu et al., 2019b). Since GWD lacks robustness (Séjourné et al., 2021; Vincent-Cuaz
et al., 2022; Tran et al., 2022), the inaccuracies of PRMs may severely affect GWD and the effec-
tiveness of DL in real-world applications.
Contributions. To handle the inaccuracies of PRMs, this paper first proposes a novel robust
Gromov-Wasserstein discrepancy (RGWD) which adopts a minimax formulation. We prove that the
inner maximization problem has a closed-form solution and derive an efficient numerical scheme
to approximate RGWD. Under suitable assumptions, such a numerical scheme is guaranteed to find
a δ-stationary solution within O( δ12 ) iterations. We further prove that RGWD is lower bounded
1
Under review as a conference paper at ICLR 2023
and the lower bound is achieved if and only if two graphs are isomorphic. Therefore, RGWD can
be employed to compare graphs. RGWD also satisfies the triangle inequality which is of its own
interest and allows numerous potential applications. A robust graph dictionary learning (RGDL)
algorithm is thereby developed to learn atoms from noisy graph data, which assesses the quality of
approximated graphs via RGWD. Numerical experiments on both synthetic and real-world datasets
demonstrate that RGDL achieves good performance.
The rest of the paper is organized as follows. In Sec. 2, a comprehensive review of the background
is given. Sec. 3 presents RGWD and the numerical approximation scheme for RGWD. RGDL is
delineated in Sec. 4. Empirical results are demonstrated in Sec. 5. We finally discuss related work
in Sec. 6.
2 P RELIMINARY
2.1 O PTIMAL T RANSPORT
We first present the notation used throughout this paper and then review the definition of the
Gromov-Wasserstein distance that originates from the optimal transport theory (Villani, 2008; Peyre
and Cuturi, 2018).
Notation. We use bold lowercase symbols (e.g. x), bold uppercase letters (e.g. A), uppercase
calligraphic fonts (e.g. X ), and Greek letters (e.g. α) to denote vectors, matrices, spaces (sets), and
measures, respectively. 1d ∈ Rd is a d-dimensional all-ones vector. ∆d is the probability simplex
Pd
with d bins, namely the set of probability vectors ∆ = a ∈ Rd+ | i=1 ai = 1 . A[i, :] and A[:, j]
d
are the ith row and the j th column of matrix A respectively. Given a matrix A, we denote by kAkF
its Frobenius norm, by kAk0 its number of non-zero elements, and by kAk∞ its element-wise `∞
norm (i.e., kAk∞ = maxij |Aij |). The cardinality of set A is denoted by |A|. The bracketed
PmJnK is the shorthand for integer sets {1, 2, . . . , n}. A discrete measure α can be denoted by
notation
α = i=1 pi δxi where δx is the Dirac at position x, i.e., a unit of mass infinitely concentrated at x.
where the feasible domain of the transport plan T = [Tii0 ] is given by the set
m×n
T1n = p, T> 1m = q ,
Π(p, q) = T ∈ R+
and Dii0 jj 0 calculates the difference between pairwise distances, i.e., Dii0 jj 0 = |dX (xi , xj ) −
dY (yi0 , yj 0 )| with x1 , x2 , . . ., xm ∈ X and y1 , y2 , . . ., yn ∈ Y.
In this subsection, we formalize the idea of comparing graphs with GWD, which addresses the
challenges that graphs are often with different numbers of nodes and the node correspondence is
unknown (Xu et al., 2019b; Xu, 2020; Vincent-Cuaz et al., 2021).
Pairwise relation and graph representation. Given a graph G with n nodes, assigning each node
an index i ∈ JnK, G can be expressed as a tuple (C, p), where C ∈ Rn×n is a matrix encoding
the pairwise relations (e.g. adjacency, shortest-path, Laplacian, or heat kernel) and p ∈ ∆n is a
probability vector modeling the relative importance of nodes within the graph (Peyré et al., 2016;
Xu et al., 2019b; Titouan et al., 2019; Vincent-Cuaz et al., 2022).
2
Under review as a conference paper at ICLR 2023
Gromov-Wasserstein Discrepancy GWD can be derived from the 2-GW distance by replacing
the metrics with pairwise relations (Xu et al., 2019b; Vincent-Cuaz et al., 2022). More specifically,
given the observed source graph G s and the target graph G t that can be expressed as (Cs , ps ) and
(Ct , pt ) respectively, GWD is defined as
s t 21
n
X n
X
s s t t s
Cit0 j 0 )2 Tii0 Tjj 0
GWD (C , p ), (C , p ) = min (Cij − ,
T∈Π(ps ,pt )
i,j=1 i0 ,j 0 =1
where ns and nt are the numbers of nodes of G s and G t respectively. GWD computes both a soft
assignment matrix between the nodes of the two graphs and a notion of discrepancy between them.
For conciseness, we abbreviate GWD (Cs , ps ), (Ct , pt ) to GWD(Cs , Ct ) in the sequel.
Traditional DL approximates data vectors as sparse linear combinations of basis elements (atoms)
(Mallat, 1999; Mairal et al., 2009; Tošić and Frossard, 2011; Jiang et al., 2015), and is usually
formulated as
XK M
X 2
k
min X[:, k] − wm D[:, m] + λΩ(wk ), (1)
D∈C,W 2
k=1 m=1
where X ∈ Rd×K is the data matrix whose columns represent samples, the matrix D ∈ Rd×M
contains M atoms to learn and is constrained to the following set
C = {D ∈ Rd×M |∀m ∈ JM K, kD[:, m]k2 ≤ 1},
W ∈ RM ×K is the new representation of data whose k th -column wk = [wmk
]m∈JM K stores the
th k k
embedding of the k sample, and λΩ(w ) promotes the sparsity of w . Such a formulation only
applies to vectorized data.
Recently, Xu 2020 proposes to approximate graphs via the highly non-linear GW barycenter. Specif-
k k
ically, given a dataset of K graphs which has PRMs {Ck }k∈JKK such that Ck ∈ Rn ×n , the basis
elements {C̄m }m∈JM K are learned by solving
K
X
min GWD2 Ck , B wk , {C̄m }m∈JM K ,
{C̄m }m∈JM K ,{wk }k∈JKK
k=1
where w ∈ ∆ is referred to as the embedding of the k th graph G k , and the GW barycenter
k M
DL for graphs. To overcome the above computational issues, Vincent-Cuaz et al. 2021 propose
GDL which approximates each graph as a weighted sum of PRMs and is formulated as
K
X M
X
2 k k
min GWD C , wm C̄m + λΩ(wk ),
{C̄m }m∈JM K ,{wk }k∈JKK
k=1 m=1
where each atom C̄m is a na × na matrix. In contrast to the `2 loss in Eq. (1), GWD is used to assess
PM k
the quality of the linear representation m=1 wm C̄m for k ∈ JKK. However, the observed graphs
often contain noisy edges or miss some edges in real-world applications (Clauset et al., 2008; Xu
et al., 2019b; Shi et al., 2019; Piccioli et al., 2022), which leads to the inaccuracies of the PRMs Ck ,
∗
that is, the deviation between Ck and the true PRM Ck . Since GWD lacks robustness (Séjourné
et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022), the quality of the learned dictionary may
be severely affected.
3
Under review as a conference paper at ICLR 2023
Definition 1 Given the observed source graph G s and the target graph G t that can be expressed as
(Cs , ps ) and (Ct , pt ) respectively, RGWD is defined by the solution to the following problem
12
RGWD (Cs , ps ), (Ct , pt ), = mins t max f (T, E; Cs , Ct ) ,
T∈Π(p ,p ) E∈U
and the perturbation E is in the bounded set U = {E|E = E> and kEk∞ ≤ }.
RGWD requires the sought transport plan to have low transportation costs for all perturbation E in
set U . For succinctness, we omit Cs and Ct in f (T, E; Cs , Ct ) in the following.
The properties of RGWD are presented as follows. Firstly, although RGWD involves a non-convex
non-concave minimax optimization problem, the inner maximization problem has a closed-form
solution, which allows an efficient numerical scheme for RGWD. Secondly, RGWD has a lower
bound that is achieved if and only if the expressions of compared graphs are identical up to a per-
mutation, which implies RGWD can be employed to evaluate the similarity between one observed
graph and its approximation in DL. Thirdly, RGWD satisfies the triangle inequality, which allows
numerous potential applications including clustering (Elkan, 2003; HajKacem et al., 2019), metric
learning (Pitis et al., 2019), and Bayesian learning (Moore, 2000; Xiao et al., 2019). Finally, arbi-
trarily changing the node orders does not affect the value of RGWD. More formally, we state the
properties in the following theorem.
Theorem 1 Given two observed graphs G s = (Cs , ps ) and G t = (Ct , pt ) with ns and nt nodes
respectively, RGWD satisfies
where the equality holds if and only if there exists a bijective π ∗ : Jns K → Jnt K such that
psi = ptπ∗ (i) for all i ∈ Jns K and Cij
s
= Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K.
4
Under review as a conference paper at ICLR 2023
4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs
and Qt ,
RGWD (Cs , ps ), (Ct , pt ), = RGWD (Qs> Cs Qs , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .
As is implied by Theorem 1, RGWD does not define a distance between metric-measure spaces.
Firstly, the identity axiom is not satisfied. Secondly, the symmetry generally does not hold either,
which we exemplify below.
We derive a gradient based numerical scheme to solve RGWD by exploiting the property that the
inner maximization problem has a closed-form solution, which is summarized in Algorithm 1. In
each iteration, Eτ that solves the inner problem for current Tτ is calculated. Then, the transport
plan is updated using the projected gradient descent.
−4Cs Tτ Ct + Eτ ,
To present the convergence guarantee of Algorithm 1, we introduce the notion of the Moreau en-
velope. The stationarity of any function h(x) can be quantified by the norm of the gradient of its
Moreau envelope hλ (x) = minx0 h(x0 ) + 2λ 1
kx − x0 k2 . The following theorem gives the conver-
gence rate of Algorithm 1 and the proof is deferred to the appendix.
Theorem 2 Define φ(·) = maxE∈U f (·, E). The output T̂ of Algorithm 1 with step-size η = √Nγ+1
satisfies
φ1/2l (T0 ) − minT∈Π(ps ,pt ) φ(T) + lL2 γ 2
E k∇φ1/2l (T̂)k2 ≤ 2
√ ,
γ N +1
√
where
√ l = 2 max{10n3 U12 + 6n3 U1 + 4nU1 U2 + 4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } and L =
2 max{(4U1 +2)U22 n3 , 2(2U1 +)2 U2 n3 } with n = max{ns , nt }, U1 = max{kCs k∞ , kCt k∞ }
and U2 = max{kpt k2 , maxT0 ∈Π(ps ,pt ) kT0 kF }.
5
Under review as a conference paper at ICLR 2023
When U1 and are of the order O( n13 ), both l and L are of the order O(1) and Theorem 2 states that
an δ-stationary solution can be obtained within O( δ12 ) iterations. Note that we can multiply Cs , Ct ,
and by the same number without affecting the resulted transport plan.
Solving wk . We now formulate the problem of obtaining the embedding of the k th graph G k when
the dictionary is fixed and the PRM is inaccurate. Given dictionary {C̄m }m∈JM K where each C̄m ∈
a a
Rn ×n , the embedding of G k expressed by (Ck , pk ) is calculated by solving
X M
min RGWD2 wmk
C̄m , p̄ , (Ck , pk ), − λkwk k2 , (4)
wk ∈∆M
m=1
where λ ≥ 0 induces a negative quadratic regularization promoting sparsity on the simplex (Li
et al., 2020; Vincent-Cuaz et al., 2021). When wk is fixed, updating Tk and Ek can be solved by
Algorithm 1 whose convergence is guaranteed by Theorem 2. For fixed Tk and Ek , the problem of
updating wk is a non-convex problem that can be tackled by a conditional gradient algorithm. Note
that for non-convex problems, the conditional gradient algorithm is proved to converge to a local
stationary point (Lacoste-Julien, 2016). Such a procedure is described from Line 5 to Line 11 in
Algorithm 2, which we observe converges within tens of iterations empirically.
Stochastic updates. To enhance computational efficiency, each atom is updated with stochastic
estimates of the gradient. At each stochastic update, b embedding learning problems are solved
6
Under review as a conference paper at ICLR 2023
independently for the current dictionary using the procedure stated above, where b is the size of the
sampled mini-batch. Each atom is then updated using the stochastic gradient given in Eq. (3). Note
that the symmetry of each atom is preserved as long as the initialized atom is symmetric, since the
stochastic gradients are symmetric.
5 E XPERIMENTS
This section provides empirical evidence that RGDL performs well in the unsupervised graph clus-
tering task on both synthetic and real-world datasets. The heat kernel matrix is employed for the
PRM since it captures both global and local topology and achieves good performance in many tasks
(Donnat et al., 2018; Tsitsulin et al., 2018; Chowdhury and Needham, 2021).
We first test RGDL in the graph clustering task on datasets simulated according to the well-studied
Stochastic Block Model (SBM) (Holland et al., 1983; Wang and Wong, 1987). RGDL is compared
against the following state-of-the-art graph clustering methods: (i) GDL (Vincent-Cuaz et al., 2021)
learns graph dictionaries via GWD; (ii) Gromov-Wasserstein Factorization (GWF) (Xu, 2020) that
approximates graphs via GW barycenters; (iii) Spectral Clustering (SC) of Shi and Malik (2000);
Stella and Shi (2003) applied to the matrix with each entry storing the GWD between two graphs.
Table 1: Average (stdev) ARI scores for the first scenario of synthetic datasets.
balanced unbalanced
σ 0.05 0.10 0.15 0.05 0.10 0.15
GDL 0.119(0.017) 0.031(0.012) 0.016(0.006) 0.049(0.019) 0.018(0.004) 0.018(0.001)
GWF 0.071(0.007) 0.034(0.003) 0.008(0.001) 0.052(0.020) 0.014(0.001) 0.015(0.001)
SC 0.057(0.002) 0.033(0.002) 0.010(0.001) 0.054(0.024) 0.015(0.004) 0.010(0.001)
RGDL(=0.01) 0.316(0.005) 0.161(0.005) 0.052(0.002) 0.246(0.013) 0.039(0.009) 0.024 (0.001)
RGDL(=0.1) 0.853(0.003) 0.756(0.018) 0.439(0.015) 0.765(0.022) 0.694(0.046) 0.499(0.016)
RGDL(=0.2) 0.975(0.025) 0.879(0.023) 0.736(0.020) 0.866(0.023) 0.815(0.028) 0.770(0.016)
RGDL(=0.3) 0.975(0.025) 0.879(0.023) 0.869(0.013) 0.943(0.001) 0.916(0.027) 0.848(0.061)
RGDL(=10) 0.975(0.025) 0.950(0.000) 0.950(0.000) 0.943(0.001) 0.943(0.001) 0.943(0.001)
RGDL(=30) 0.781(0.046) 0.779(0.070) 0.728(0.085) 0.723(0.067) 0.698(0.057) 0.666(0.040)
Table 2: Average (stdev) ARI scores for the second scenario of synthetic datasets.
balanced unbalanced
ρ 0.00 0.05 0.10 0.00 0.05 0.1
GDL 0.260(0.020) 0.187(0.013) 0.024(0.004) 0.152(0.008) 0.018(0.001) 0.005(0.001)
GWF 0.182(0.006) 0.086(0.004) 0.027(0.005) 0.020(0.005) 0.016(0.002) 0.010(0.002)
SC 0.204(0.002) 0.129(0.017) 0.016(0.005) 0.129(0.008) 0.013(0.001) 0.011(0.005)
RGDL(=0.01) 0.451(0.014) 0.449(0.016) 0.449(0.016) 0.401(0.002) 0.401(0.002) 0.399(0.000)
RGDL(=0.1) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=0.2) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=0.3) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=10) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=30) 0.896(0.070) 0.888(0.080) 0.857(0.044) 0.864(0.057) 0.827(0.043) 0.816(0.074)
Dataset generation. We consider two scenarios of inaccuracies. In the first scenario (S1), Gaus-
sian noise is added into the heat kernel matrix of each graph. More specifically, denoting the
heat kernel matrix of the k th graph as Ck∗ for k ∈ JKK, the PRM available to DL methods is
Ck = Ck∗ + Z + Z> where each entry Zij of Z is sampled from the Gaussian distribution N (0, σ).
In the second scenario (S2), we randomly add ρ|E| edges into the graph and then randomly remove
ρ|E| edges while keeping the graph connected, where E is the edge set of the graph. The heat kernel
matrix is then constructed for the modified graph. Such two scenarios allow us to study the per-
formance of RGDL against different scales of inaccuracies. In both S1 and S2, we generate two
datasets, both of which involve three generative structures (also used to label graphs): dense (only
one community), two communities, and three communities. We fix p = 0.1 as the probability of
7
Under review as a conference paper at ICLR 2023
Evaluating the performance. The learned embeddings of the graphs are used as input for a k-
means algorithm to cluster graphs. We use the well-known Adjusted Rand Index (ARI) (Hubert and
Arabie, 1985; Steinley, 2004), to evaluate the quality of clustering by comparing it with the graph
labels. RGDL with varied is compared against GDL, GWF, and SC. RGDL, GDL and GWF use
three atoms which are R6×6 matrices. We run each method for 5 times and report the averaged ARI
scores and the standard deviations. Experimental results reported in Table 1 and Table 2 demonstrate
RGDL outperforms baselines significantly.
Influence of . RGDL with moderate values outperforms baseline methods by a large margin
and is more robust to the noise. Even when is relatively small (=0.01), RGDL achieves better per-
formance than them. Increasing within a suitable range can boost ARI and RGDL is not sensitive
to the choice of . If becomes too large, the performance of RGDL slowly decreases. In practice,
when a small quantity of data labels are available, can be chosen according to the performance on
this small subset of data.
0.5 0.30
0.12
0.4 0.25
0.10
0.20
0.3 0.08
0.15 0.06
ARI
ARI
ARI
0.2
0.10 0.04
0.1
0.05 0.02
0.0 0.00 0.00
0 50 100 150 200 250 300 350 0 100 200 300 400 500 0 20 40 60 80 100 120
time(sec) time(sec) time(sec)
Figure 1: ARI scores vs. time on MUTAG (left), BZR(middle), and Peking 1(right) datasets.
We further use RGDL to cluster real-world graphs. We consider widely utilized benchmark datasets
including MUTAG (Debnath et al., 1991), BZR (Sutherland et al., 2003), and Peking 1 (Pan et al.,
2016). The labels of the graphs are employed as the ground truth to evaluate the estimated clustering
results. For each dataset, the size of the atoms is set as the median of the numbers of graph nodes
following Vincent-Cuaz et al. (2021). The number of atoms M is set as M = β(# classes) where
β is chosen from {2, 3, 4, 5}. RGDL is run with different values of . Specifically, is chosen from
{U, 10−1 U, 10−2 U, 10−3 U } where U = maxk∈JKK kCk k∞ .
Results. The experimental results on real-world graphs are reported in Figure 1. RGDL with =
10−1 U or = 10−2 U outperforms baselines on
all datasets, which implies that the observed graphs
contain structural noise and 10−2 U, 10−1 U is often a suitable range for . The time required
for RGDL to converge is comparable to that of state-of-the-art of methods. Since GDL, GWF,
and RGDL can output graph embeddings, we further illustrate the embeddings generated by them
respectively based on PCA. As is shown in Figure 2, the embeddings of the two types of the graphs
are less likely to be mixed together, which explains why RGDL achieves higher ARI values.
8
Under review as a conference paper at ICLR 2023
Figure 2: PCA-based visualization of embeddings produced by GDL (left), GWF (middle), and
RGDL (right) respectively for the graphs in MUTAG dataset. The color of each point indicate the
type of the corresponding graph. RGDL achieves the best clustering results.
6 R ELATED W ORK
Unbalanced OT. Enhancing the robustness of the optimal transport plan has received wide atten-
tion recently (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022; Chapel
et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022). Originally, robust variants of classical
OT are proposed to compare distributions supported on the same metric space (Balaji et al., 2020;
Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022), which model the noise as outlier supports
and reduce the influence of outlier supports by allowing mass destruction and creation. Following
the same spirit, variants of the GW distance that also relax the marginal constraints are proposed
(Chapel et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022). However, these methods do
not take the inaccuracies of the pairwise distances/similarities into account. The proposed RGWD
aims to handle such cases.
Graph representation learning and graph comparison. Comparing graphs often requires learn-
ing meaningful graph representations. Some methods manually design representations that are in-
variant under graph isomorphism (Bagrow and Bollt, 2019; Tam and Dunson, 2022). Such rep-
resentations are often sophisticated and require domain knowledge. Graph neural network-based
methods learn the representations of graphs in an end-to-end manner (Scarselli et al., 2008; Zhang
et al., 2018; Lee et al., 2018; Errica et al., 2019), which however requires a large amount of labeled
data. Another family of methods that uses graph representations implicitly is referred to as graph
kernels (Shervashidze et al., 2009; Vishwanathan et al., 2010). GWD and its variants based methods
can estimate the node correspondence and provide an interpretable discrepancy between compared
graphs (Xu et al., 2019b; Titouan et al., 2019; Barbe et al., 2020; Chapel et al., 2020). In this paper,
we propose a novel graph dictionary learning method based on a robust variant of GWD to learn
representations of graphs which are useful in downstream tasks.
Non-linear combination of atoms. Classic DL methods are linear in the sense that they attempt to
approximate each vectorized datum by a linear combination of a few basis elements. Recently, non-
linear operations are also considered. In order to exploit the non-linear nature of data, Autoencoder-
based methods encode them to low-dimensional vectors using neural networks, and decode data
with another neural network (Hinton and Salakhutdinov, 2006; Hu and Tan, 2018). Another family
of methods replace the linear combinations by geodesic interpolations (Boissard et al., 2011; Bigot
et al., 2013; Seguy and Cuturi, 2015; Schmitz et al., 2018). More closely related to our work,
Xu 2020 proposes to approximate graphs via the GW barycenter of graph atoms, which however
involves a complicated and computational demanding optimization problem.
7 C ONCLUSION
In this paper, we propose a novel graph dictionary learning algorithm that is robust to the structural
noise of graphs. We first propose a robust variant of GWD, referred to as RGWD, which involves
a minimax optimization problem. Exploiting the fact that the inner maximization problem has a
closed-form solution, an efficient numerical scheme is derived. Based on RGWD, a robust dictio-
nary learning algorithm for graphs called RGDL is derived to learn atoms from noisy graph data.
Numerical results on both simulated and real-world datasets demonstrate that RGDL achieves good
performance in the presence of structural noise.
9
Under review as a conference paper at ICLR 2023
R EFERENCES
Monica Agrawal, Marinka Zitnik, and Jure Leskovec. Large-scale analysis of disease pathways in
the human interactome. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018: Proceedings of
the Pacific Symposium, pages 111–122. World Scientific, 2018.
Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and recommending links
in social networks. In Proceedings of the fourth ACM international conference on Web search
and data mining, pages 635–644, 2011.
James P Bagrow and Erik M Bollt. An information-theoretic, all-scales approach to comparing
networks. Applied Network Science, 4(1):1–15, 2019.
Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in
generative modeling and domain adaptation. Advances in Neural Information Processing Systems,
33:12934–12944, 2020.
Amélie Barbe, Marc Sebban, Paulo Gonçalves, Pierre Borgnat, and Rémi Gribonval. Graph diffu-
sion wasserstein distances. In ECML PKDD, 2020.
Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic pca in the wasserstein
space. arXiv preprint arXiv:1307.7721, 2013.
Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution’s template estimate
with wasserstein metrics. arXiv preprint arXiv:1111.5927, 2011.
Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and
Trends® in Machine Learning, 8(3-4):231–357, 2015.
Laetitia Chapel, Gilles Gasso, et al. Partial optimal tranport with applications on positive-unlabeled
learning. In NeurIPS, 2020.
Samir Chowdhury and Facundo Mémoli. The gromov–wasserstein distance between networks and
stable network invariants. Information and Inference: A Journal of the IMA, 8(4):757–787, 2019.
Samir Chowdhury and Tom Needham. Generalized spectral clustering via gromov-wasserstein
learning. In International Conference on Artificial Intelligence and Statistics, pages 712–720.
PMLR, 2021.
Aaron Clauset, Cristopher Moore, and M E J Newman. Hierarchical structure and the prediction of
missing links in networks. Nature, 453(7191):98–101, 2008.
Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Cor-
win Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro com-
pounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal
chemistry, 34(2):786–797, 1991.
Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embed-
dings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD international conference
on knowledge discovery & data mining, pages 1320–1329, 2018.
Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu
Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in
neural information processing systems, 32, 2019.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport:
Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In Interna-
tional conference on machine learning, pages 1367–1376. PMLR, 2018.
Charles Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th
international conference on Machine Learning (ICML-03), pages 147–153, 2003.
Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neu-
ral networks for graph classification. In International Conference on Learning Representations,
2019.
10
Under review as a conference paper at ICLR 2023
Zhizhao Feng, Meng Yang, Lei Zhang, Yan Liu, and David Zhang. Joint discriminative dimensional-
ity reduction and dictionary learning for face recognition. Pattern Recognition, 46(8):2134–2143,
2013.
Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N’Cir, and Nadia Essoussi. Overview of
scalable partitional methods for big data clustering. In Clustering Methods for Big Data Analytics,
pages 1–23. Springer, 2019.
Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural
networks. science, 313(5786):504–507, 2006.
Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First
steps. Social networks, 5(2):109–137, 1983.
Junlin Hu and Yap-Peng Tan. Nonlinear dictionary learning with application to image classification.
Pattern Recognition, 75:282–291, 2018.
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218,
1985.
Wenhao Jiang, Feiping Nie, and Heng Huang. Robust dictionary learning with capped l1-norm. In
Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction
outcomes with weisfeiler-lehman network. Advances in neural information processing systems,
30, 2017.
Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint
arXiv:1607.00345, 2016.
Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust
optimal transport: Computational complexity and barycenter computation. Advances in Neural
Information Processing Systems, 34:21947–21959, 2021.
John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification using structural attention.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pages 1666–1674, 2018.
Ping Li, Syama Sundar Rangapuram, and Martin Slawski. Methods for sparse and low-rank recovery
under simplex constraints. Statistica Sinica, 30(2):557–577, 2020.
Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis Bach. Supervised
dictionary learning. Advances in neural information processing systems, 21, 2008.
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for
sparse coding. In Proceedings of the 26th annual international conference on machine learning,
pages 689–696, 2009.
Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foun-
dations of computational mathematics, 11(4):417–487, 2011.
Andrew W Moore. The anchors hierarchy: using the triangle inequality to survive high dimensional
data. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages
397–405, 2000.
Debarghya Mukherjee, Aritra Guha, Justin M Solomon, Yuekai Sun, and Mikhail Yurochkin.
Outlier-robust optimal transport. In International Conference on Machine Learning, pages 7850–
7860. PMLR, 2021.
Navid Naderializadeh, Mark Eisen, and Alejandro Ribeiro. Wireless power control via counterfac-
tual optimization of graph neural networks. In 2020 IEEE 21st International Workshop on Signal
Processing Advances in Wireless Communications (SPAWC), pages 1–5. IEEE, 2020.
11
Under review as a conference paper at ICLR 2023
Sloan Nietert, Ziv Goldfeld, and Rachel Cummings. Outlier-robust optimal transport: Duality, struc-
ture, and statistical analysis. In International Conference on Artificial Intelligence and Statistics,
pages 11691–11719. PMLR, 2022.
Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, and Chengqi Zhang. Task sensitive feature
exploration and learning for multitask graph classification. IEEE transactions on cybernetics, 47
(3):744–758, 2016.
Gabriel Peyre and Marco Cuturi. Computational optimal transport. arXiv: Machine Learning, 2018.
Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and
distance matrices. In ICML, 2016.
Giovanni Piccioli, Guilhem Semerjian, Gabriele Sicuro, and Lenka Zdeborová. Aligning random
graphs with a sub-tree similarity message-passing algorithm. Journal of Statistical Mechanics:
Theory and Experiment, 2022(6):063401, 2022.
Silviu Pitis, Harris Chan, Kiarash Jamali, and Jimmy Ba. An inductive bias for distances: Neural
nets that respect the triangle inequality. In International Conference on Learning Representations,
2019.
Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning:
transfer learning from unlabeled data. In Proceedings of the 24th international conference on
Machine learning, pages 759–766, 2007.
Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictio-
nary learning with structured incoherence and shared features. In 2010 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pages 3501–3508. IEEE, 2010.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco Cuturi,
Gabriel Peyré, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal transport-based
unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643–678,
2018.
Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the
optimal transport metric. Advances in Neural Information Processing Systems, 28, 2015.
Thibault Séjourné, François-Xavier Vialard, and Gabriel Peyré. The unbalanced gromov wasserstein
distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems,
34:8766–8779, 2021.
Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Ef-
ficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pages
488–495. PMLR, 2009.
Chuan Shi, Binbin Hu, Wayne Xin Zhao, and Philip S Yu. Heterogeneous information network
embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31
(2):357–370, 2019.
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):888–905, 2000.
Pablo Sprechmann and Guillermo Sapiro. Dictionary learning and sparse coding for unsupervised
clustering. In 2010 IEEE international conference on acoustics, speech and signal processing,
pages 2042–2045. IEEE, 2010.
12
Under review as a conference paper at ICLR 2023
Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3):
386, 2004.
X Yu Stella and Jianbo Shi. Multiclass spectral clustering. In Computer Vision, IEEE International
Conference on, volume 2, pages 313–313. IEEE Computer Society, 2003.
Jeffrey J Sutherland, Lee A O’brien, and Donald F Weaver. Spline-fitting with a genetic algorithm:
A method for developing classification structure- activity relationships. Journal of chemical in-
formation and computer sciences, 43(6):1906–1915, 2003.
Edric Tam and David Dunson. Multiscale graph comparison via the embedded laplacian distance.
arXiv preprint arXiv:2201.12064, 2022.
Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, and Rémi Flamary. Optimal
transport for structured data with application on graphs. In ICML, 2019.
Ivana Tošić and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2):
27–38, 2011.
Quang Huy Tran, Hicham Janati, Nicolas Courty, Rémi Flamary, Ievgen Redko, Pinar Demetci, and
Ritambhara Singh. Unbalanced co-optimal transport. arXiv preprint arXiv:2205.14923, 2022.
Anton Tsitsulin, Davide Mottin, Panagiotis Karras, Alexander Bronstein, and Emmanuel Müller.
Netlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 2347–2356, 2018.
Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,
2008.
Cédric Vincent-Cuaz, Titouan Vayer, Rémi Flamary, Marco Corneli, and Nicolas Courty. Online
graph dictionary learning. In International Conference on Machine Learning, pages 10564–
10574. PMLR, 2021.
Cédric Vincent-Cuaz, Rémi Flamary, Marco Corneli, Titouan Vayer, and Nicolas Courty. Semi-
relaxed gromov-wasserstein divergence and applications on graphs. In International Confer-
ence on Learning Representations, 2022. URL https://openreview.net/forum?id=
RShaMexjc-x.
S Vishwanathan, N Schraudolph, R Kondor, and KM Borgwardt. Graph kernels. the journal ofma-
chine learning research. 2010.
Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the
American Statistical Association, 82(397):8–19, 1987.
Xian Wei, Hao Shen, Yuanxiang Li, Xuan Tang, Fengxiang Wang, Martin Kleinsteuber, and Yi Lu
Murphey. Reconstructible nonlinear dimensionality reduction via joint dictionary learning. IEEE
transactions on neural networks and learning systems, 30(1):175–189, 2018.
Teng Xiao, Jiaxin Ren, Zaiqiao Meng, Huan Sun, and Shangsong Liang. Dynamic bayesian metric
learning for personalized product search. In Proceedings of the 28th ACM International Confer-
ence on Information and Knowledge Management, pages 1693–1702, 2019.
Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph
partitioning and matching. In NeurIPS, 2019a.
Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning
for graph matching and node embedding. In ICML, 2019b.
Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. In Proceedings of
the AAAI conference on artificial intelligence, volume 34, pages 6478–6485, 2020.
Yangyang Xu. Iteration complexity of inexact augmented lagrangian methods for constrained con-
vex programming. Mathematical Programming, 185(1):199–244, 2021.
13
Under review as a conference paper at ICLR 2023
Pinar Yanardag and S Vishwanathan. Deep graph kernels in: Proceedings of the 21th acm sigkdd
international conference on knowledge discovery and data mining, 1365–1374. ACM, New York,
2015.
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning
architecture for graph classification. In Proceedings of the AAAI conference on artificial intelli-
gence, volume 32, 2018.
Tong Zhang, Yun Wang, Zhen Cui, Chuanwei Zhou, Baoliang Cui, Haikuan Huang, and Jian Yang.
Deep wasserstein graph discriminant learning for graph classification. In AAAI, pages 10914–
10922, 2021.
14
Under review as a conference paper at ICLR 2023
Appendix
The appendix is organized as follows. We first provide omitted proofs in the main paper in Sec.
A. Then, algorithmic details are presented in Sec. B. Finally, Sec. C gives additional experimental
results.
A O MITTED P ROOFS
Theorem 1 Given two observed graphs G s = (Cs , ps ) and G t = (Ct , pt ) with ns and nt nodes
respectively, RGWD satisfies
where the equality holds if and only if there exists a bijective π ∗ : Jns K → Jnt K such that
psi = ptπ∗ (i) for all i ∈ Jns K and Cij
s
= Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K.
4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs
and Qt ,
RGWD (Cs , ps ), (Ct , pt ), = RGWD (Qs> Cs Qs , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .
15
Under review as a conference paper at ICLR 2023
Note that when there exists a bijective π ∗ : Jns K → Jnt K such that psi = ptπ∗ (i) for all i ∈ Jns K and
s
Cij = Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K, choosing the transport plan T̂ = [T̂ii0 ] where
pi , if i0 = π ∗ (i),
s
T̂ii0 =
0, otherwise,
we have for all i0 , j 0 ∈ Jnt K,
s
n
X
s
T̂ii0 T̂jj 0 (Cij − Cit0 j 0 ) = T̂π∗−1 (i0 )i0 T̂π∗−1 (j 0 )j 0 Cπs ∗−1 (i0 )π∗−1 (j 0 ) − Cit0 j 0 = 0,
i,j=1
which implies that Ei0 j 0 (T) = for all i0 , j 0 ∈ Jnt K. We then have
s t s
n
X n
X n
X
s
(Cij − Cit0 j 0 2
− Ei0 j 0 ) T̂ii0 T̂jj 0 = s
(Cij − Cπt ∗ (i)π∗ (j) − )2 T̂iπ∗ (i) T̂jπ∗ (j) = 2 .
i,j=1 i0 ,j 0 =1 i,j=1
Therefore, in such a case, RGWD (Cs , ps ), (Ct , pt ), = . On the other hand, when such a
bijective does not exist,
v
u
u Xns nt
X
s s t t
s − C t )2 T 0 T 0 > ,
RGWD (C , p ), (C , p ), ≥ +
t 2 mins t (Cij i0 j 0 ii jj
T∈Π(p ,p )
i,j=1 i0 ,j 0 =1
Pns Pnt s
where the strict inequality is due to the fact that i,j=1 i0 ,j 0 =1 (Cij − Cit0 j 0 )2 Tii0 Tjj 0 > 0.
(iii) Thirdly, we prove the triangle inequality. Given tuples (C1 , p1 ), (C2 , p2 ), and (C3 , p3 )
which have the node numbers n1 , n2 , and n3 respectively, let (T∗12 , E∗12 ), (T∗23 , E∗23 ), and
(T∗13 , E∗13 ) be the solutions of RGWD 1 1 2 2 2 2 3 3
(C , p ), (C , p ), , RGWD (C , p ), (C , p ), ,
1 1 3 3 13 13
and RGWD (C , p ), (C , p ), . Define T = [Ti1 i3 ] where
2
n
X Ti∗12 T ∗23
1 i2 i2 i3
Ti13
1 i3
= .
i =1
p2i2
2
Then we have
RGWD (C1 , p1 ), (C3 , p3 ),
v
u n1 n3
u X X 2 13 13
≤t Ci11 j1 − Ci33 j3 − Ei∗13
3 j3
Ti1 i3 Tj1 j3
i1 ,j1 =1 i3 ,j3 =1
v
u n1 n 3 n 2 n 2
u X X 2 X Ti∗12 T ∗23 X
1 i2 i2 i3
Tj∗12 T ∗23
1 j2 j2 j3
=t Ci11 j1 − Ci33 j3 − Ei∗13
3 j3
i1 ,j1 =1 i3 ,j3 =1 i =1
p2i2 j =1
p2j2
2 2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
=t Ci11 j1 − Ci22 j2 + Ci22 j2 − Ci33 j3 − Ei∗13
3 j3
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
p2i2 p2j2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
≤t Ci11 j1 − Ci22 j2 2 2
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
pi2 pj2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
+t Ci22 j2 − Ci33 j3 − Ei∗13
3 j3 2 2
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
pi2 pj2
v v
u n1 n2 u n2 n3
u X X 2 u X X 2 ∗23 ∗23
=t Ci11 j1 − Ci22 j2 Ti∗12 T ∗12
1 i2 j1 j2
+ t Ci22 j2 − Ci33 j3 − Ei∗13
3 j3
Ti2 i3 Tj2 j3
i1 ,j1 =1 i2 ,j2 =1 i2 ,j2 =1 i3 ,j3 =1
16
Under review as a conference paper at ICLR 2023
(iv) Finally, we prove the invariance to the node order permutation. Denote the solution to the
objective of RGWD by T∗ = [Tii∗0 ] and E∗ = [Ei∗0 j 0 ], which implies
( Pns
∗ , if i,j=1 Tii∗0 Tjj ∗ s t
0 (Cij − Ci0 j 0 ) ≤ 0,
Ei0 j 0 =
−, otherwise,
and
s t s t
n
X n
X n
X n
X
s
(Cij − Cit0 j 0 − Ei∗0 j 0 )2 Tii∗0 Tjj
∗
0 ≤ max s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 ,
0 0
E∈U
i,j=1 i ,j =1 i,j=1 i0 ,j 0 =1
for all T ∈ Π(ps , pt ). The two permutation operations can be equivalently denoted as two bijectives
π s and π t . Denote
C̃s = [C̃ij
s s
] where C̃ij = Cπs s −1 (i)πs −1 (j) ,
C̃t = [C̃it0 j 0 ] where C̃it0 j 0 = Cπt t −1 (i0 )πt −1 (j 0 ) ,
Ẽ∗ = [Ẽi∗0 j 0 ] where Ẽi∗0 j 0 = Eπ∗t −1 (i0 )πt −1 (j 0 ) ,
T̃∗ = [T̃iis 0 ] where T̃iis 0 = Tπ∗s −1 (i)πt −1 (i0 ) .
We first prove Ẽ∗ solves the inner maximization problem for T̃∗ . For all i0 , j 0 ∈ Jnt K, when
∗ ∗ s t ∗ ∗ s t
P P
ij T̃ii0 T̃jj 0 (C̃ij − C̃i0 j 0 ) ≤ 0, we have ij Tiπ t −1 (i0 ) Tjπ t −1 (j 0 ) (Cij − Cπ t −1 (i0 )π t −1 (j 0 ) ) ≤ 0,
which is consistent withẼ ∗ i0 j 0 = . The case when ij T̃ii∗0 T̃jj ∗ s t
P
0 (C̃ij − C̃i0 j 0 ) > 0 is similar. Since
s t s t
n
X n
X n
X n
X
s
(Cij − Cit0 j 0 − Ei∗0 j 0 )2 Tii∗0 Tjj
∗
0 = s
(C̃ij − C̃it0 j 0 − Ẽi∗0 j 0 )2 T̃ii∗0 T̃jj
∗
0
i,j=1 i0 ,j 0 =1 i,j=1 i0 ,j 0 =1
and
s t s t
n
X n
X n
X n
X
s
max (Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 = max s
(C̃ij − C̃it0 j 0 − Ei0 j 0 )2 T̃ii0 T̃jj 0 ,
E∈U E∈U
i,j=1 i0 ,j 0 =1 i,j=1 i0 ,j 0 =1
where T̃ii0 = Tπs −1 (i)πt −1 (i0 ) , T̃∗ and Ẽ∗ solve the optimization problem of
RGWD (Q C Q , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .
s> s s
Proof: (i) We first prove that f (·) is L-Lipschitz. For all T, T0 ∈ Π(ps , pt ) and E, E0 ∈ U ,
f (T, E) − f (T0 , E0 )
X X
≤ s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0
iji0 j 0 iji0 j 0
X X
+ s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0
iji0 j 0 iji0 j 0
X X
+ s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0 −
s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 0 Tjj
0
0 .
iji0 j 0 iji0 j 0
17
Under review as a conference paper at ICLR 2023
iji0 j 0 iji0 j 0
X
≤ s
|Cij 0
− Cit0 j 0 − Ei00 j 0 |2 Tii0 Tjj 0 − Tjj 0
iji0 j 0
X
0
≤(2U1 + )2 U2 Tjj 0 − Tjj 0
iji0 j 0
0
≤(2U1 + ) U2 n2 Tjj 0 − Tjj
2
0 .
iji0 j 0 iji0 j 0
0
≤(2U1 + )2 U2 n2 Tjj 0 − Tjj 0 .
i0 j 0 jj 0
q
≤L kE − E0 k2F + kT − T0 k2F .
(ii) Now we prove that f (·) is l-smooth, which requires finding a constant l satisfying
where vec(X) means the vectorization of matrix X and [ ba ] denotes the concatenation of vectors a
and b. Since the left hand side satisfies
vec ∇T f (T0 , E0 )
vec ∇T f (T, E)
−
vec ∇E f (T, E) vec ∇E f (T0 , E0 )
2
q
0 0 2
= k∇T f (T, E) − ∇T f (T , E )kF + k∇E f (T, E) − ∇E f (T0 , E0 )k2F
≤k∇T f (T, E) − ∇T f (T0 , E0 )kF + k∇E f (T, E) − ∇E f (T0 , E0 )kF ,
and the right hand side satisfies
vec(T) vec(T0 )
h i h i q
l − 0 =l kE − E0 k2F + kT − T0 k2F
vec(E) vec(E ) 2
l l
≥ √ kE − E0 kF + √ kT − T0 kF ,
2 2
18
Under review as a conference paper at ICLR 2023
>
Since ∇E f (T, E) = 2(E+C> )pt pt −2T> Cs T,k∇E f (T, E)−∇E f (T0 , E0 )kF can be bounded
as follows,
k∇E f (T, E) − ∇E f (T0 , E0 )kF
≤k2(E − E0 )kF kpt k2F + 4kTkF kCs kF kT − T0 kF
≤2U22 k(E − E0 )kF + 4nU1 U2 kT − T0 kF
.
Combining the above four relations, we have
k∇T f (T, E) − ∇T f (T0 , E0 )kF + k∇E f (T, E) − ∇E f (T0 , E0 )kF
≤ max{10n3 U12 + 6n3 U1 + 4nU1 U2 + 4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } kE − E0 kF + kT − T0 kF ,
Theorem 2 Define φ(·) = maxE∈U f (·, E). The output T̂ of Algorithm 1 with step-size η = √Nγ+1
satisfies
φ1/2l (T0 ) − minT∈Π(ps ,pt ) φ(T) + lL2 γ 2
E k∇φ1/2l (T̂)k2 ≤ 2
√ ,
γ N +1
√
where
√ l = 2 max{10n3 U12 + 6n3 U1 + 4nU1 U2 + 4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } and L =
2 max{(4U1 +2)U22 n3 , 2(2U1 +)2 U2 n3 } with n = max{ns , nt }, U1 = max{kCs k∞ , kCt k∞ }
and U2 = max{kpt k2 , maxT0 ∈Π(ps ,pt ) kT0 kF }.
19
Under review as a conference paper at ICLR 2023
Proof: By the smoothness of f (·), for any T̃ ∈ Π(ps , pt ) and Tτ from Algorithm 1, we have
l
φ(T̃) ≥ f (T̃, Eτ ) ≥ f (Tτ , Eτ ) + h∇T f (Tτ , Eτ ), T̃ − Tτ i − kT̃ − Tτ k2F
2 (6)
l
= φ(Tτ ) + h∇T f (Tτ , Eτ ), T̃ − Tτ i − kT̃ − Tτ k2F .
2
Let T̂τ = argminT∈Π(ps ,pt ) φ(T) + lkT − Tτ k2F . We have
2 1 2
≥lkTτ − T̂τ kF = k∇φ1/2l (Tτ )kF .
4l
Plugging this in (7) and combining Lemma 3 proves the result.
B A LGORITHMIC D ETAILS
The Projected Gradient Descent (PGD) consists of the following three steps in each iteration τ .
Find Eτ that maximizes f (Tτ , E). By Theorem 1, we need to calculate an auxiliary matrix
G = T> s
τ C Tτ − C
t
T>
τ Tτ ,
20
Under review as a conference paper at ICLR 2023
Projection into the feasible domain. This requires solving the following problem
1
min kT − Hτ k2F , s.t. T1n = p, T> 1m = q.
T≥0 2
This optimization problem has a strongly convex objective and linear constraints, and hence can be
2
solved efficiently via Augmented Lagrangian Method with computational complexity n ρ| 1/2 log ρ|
(Xu,
2021) where ρ measures the optimality, that is, the violation of the two linear constraints. When
ρ = O n12 , this step also has cubic costs if we ignore the log term.
3 2
| log ρ|
Therefore, the overall complexity of PGD obtaining a δ-stationary solution is O nδ2 + nρ1/2 δ2
.
C A DDITIONAL E XPERIMENTS
C.1 A DDITIONAL E XPERIMENTAL R ESULTS OF G RAPH C LUSTERING
Sensitivity analysis of λ. As is discussed in Li et al. (2020); Vincent-Cuaz et al. (2021), the neg-
ative quadratic term can promote the sparsity of graph embeddings. We further conduct sensitivity
analysis
−4of λ by varying the value in {0, 10−5 , 10−4 , 10−3 , 10−2 , 10−1 }. As is shown in Table 3,
−2
λ ∈ 10 , 10 often yields good performance. The experiments in the main paper are run with
λ = 10−3 .
Table 3: ARI scores of RGDL with varied λ’s.
The learned embeddings of graphs can also be used in the graph classification task. RGDL is thus
compared against GDL (Vincent-Cuaz et al., 2021), GWF (Xu, 2020), and other state-of-the-art
graph classification methods including WGDL (Zhang et al., 2021) and GNTK (Du et al., 2019)
on the benchmark datasets MUTAG (Debnath et al., 1991), IMDB-B, and IMDB-M (Yanardag and
Vishwanathan, 2015). RGDL, GDL, and GWF use 3-NN as the classifier due to its simplicity. We
perform a 10-fold nested cross validation (using 9 folds for training, 1 for testing, and reporting the
average accuracy of this experiment repeated 10 times) by keeping same folds across methods.
The results are reported in Table 4 and RGDL outperforms or matches state-of-the-art methods.
RGDL outperforms GDL and GWF significantly, which indicates the necessity of taking into ac-
count the structural noise of observed graphs. Although WGDL and GNTK have similar perfor-
mance, they are more computation and memory demanding due to the usage of graph neural net-
works.
21