0% found this document useful (0 votes)

15 views21 pages

Robust Graph Dictionary Learning

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views21 pages

Robust Graph Dictionary Learning

Uploaded by

mymnaka82125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Under review as a conference paper at ICLR 2023

ROBUST G RAPH D ICTIONARY L EARNING

Anonymous authors
Paper under double-blind review

A BSTRACT
Traditional Dictionary Learning (DL) aims to approximate data vectors as sparse
linear combinations of basis elements (atoms) and is widely used in machine
learning, computer vision, and signal processing. To extend DL to graphs,
Vincent-Cuaz et al. 2021 propose a method, called GDL, which describes the
topology of each graph with a pairwise relation matrix (PRM) and compares
PRMs via the Gromov-Wasserstein Discrepancy (GWD). However, the lack of ro-
bustness often excludes GDL from a variety of real-world applications since GWD
is sensitive to the structural noise in graphs. This paper proposes an improved
graph dictionary learning algorithm based on a robust Gromov-Wasserstein dis-
crepancy (RGWD) which has theoretically sound properties and an efficient nu-
merical scheme. Based on such a discrepancy, our dictionary learning algorithm
can learn atoms from noisy graph data. Experimental results demonstrate that our
algorithm achieves good performance on both simulated and real-world datasets.

1 I NTRODUCTION
Dictionary learning (DL) seeks to learn a set of basis elements (atoms) from data and approximates
data samples by sparse linear combinations of these basis elements (Mallat, 1999; Mairal et al.,
2009; Tošić and Frossard, 2011), which has numerous machine learning applications including di-
mensionality reduction (Feng et al., 2013; Wei et al., 2018), classification (Raina et al., 2007; Mairal
et al., 2008), and clustering (Ramirez et al., 2010; Sprechmann and Sapiro, 2010), to name a few.
Although DL has received significant attention, it mostly focuses on vectorized data of the same
dimension and is not amenable to graph data (Xu, 2020; Vincent-Cuaz et al., 2021; 2022). Many
exciting machine learning tasks use graphs to capture complex structures (Backstrom and Leskovec,
2011; Sadreazami et al., 2017; Naderializadeh et al., 2020; Jin et al., 2017; Agrawal et al., 2018).
DL for graphs is more challenging due to the lack of effective means to compare graphs. Specifi-
cally, evaluating the similarity between one observed graph and its approximation is difficult, since
they are often with different numbers of nodes and the node correspondence across graphs is often
unknown (Xu, 2020; Vincent-Cuaz et al., 2021).
The seminal work of Vincent-Cuaz et al. (2021) proposes a DL method for graphs based on
the Gromov-Wasserstein Discrepancy (GWD) that is a variant of the Gromov-Wasserstein dis-
tance. Gromov-Wasserstein distance compares probability distributions supported on different met-
ric spaces using pairwise distances (Mémoli, 2011). By expressing each graph as a probability
measure and capturing the graph topology with a pairwise relation matrix (PRM), comparing graphs
can be naturally formulated as computing the GWD, since both the node correspondence and the
discrepancy of the compared graphs are calculated (Peyré et al., 2016; Xu et al., 2019b; Chowdhury
and Mémoli, 2019). However, observed graphs often contain structural noise including spurious or
missing edges, which leads to the differences between the obtained PRMs and the true ones (Donnat
et al., 2018; Xu et al., 2019b). Since GWD lacks robustness (Séjourné et al., 2021; Vincent-Cuaz
et al., 2022; Tran et al., 2022), the inaccuracies of PRMs may severely affect GWD and the effec-
tiveness of DL in real-world applications.

Contributions. To handle the inaccuracies of PRMs, this paper first proposes a novel robust
Gromov-Wasserstein discrepancy (RGWD) which adopts a minimax formulation. We prove that the
inner maximization problem has a closed-form solution and derive an efficient numerical scheme
to approximate RGWD. Under suitable assumptions, such a numerical scheme is guaranteed to find
a δ-stationary solution within O( δ12 ) iterations. We further prove that RGWD is lower bounded

1
Under review as a conference paper at ICLR 2023

and the lower bound is achieved if and only if two graphs are isomorphic. Therefore, RGWD can
be employed to compare graphs. RGWD also satisfies the triangle inequality which is of its own
interest and allows numerous potential applications. A robust graph dictionary learning (RGDL)
algorithm is thereby developed to learn atoms from noisy graph data, which assesses the quality of
approximated graphs via RGWD. Numerical experiments on both synthetic and real-world datasets
demonstrate that RGDL achieves good performance.
The rest of the paper is organized as follows. In Sec. 2, a comprehensive review of the background
is given. Sec. 3 presents RGWD and the numerical approximation scheme for RGWD. RGDL is
delineated in Sec. 4. Empirical results are demonstrated in Sec. 5. We finally discuss related work
in Sec. 6.

2 P RELIMINARY
2.1 O PTIMAL T RANSPORT

We first present the notation used throughout this paper and then review the definition of the
Gromov-Wasserstein distance that originates from the optimal transport theory (Villani, 2008; Peyre
and Cuturi, 2018).

Notation. We use bold lowercase symbols (e.g. x), bold uppercase letters (e.g. A), uppercase
calligraphic fonts (e.g. X ), and Greek letters (e.g. α) to denote vectors, matrices, spaces (sets), and
measures, respectively. 1d ∈ Rd is a d-dimensional all-ones vector. ∆d is the probability simplex
Pd
with d bins, namely the set of probability vectors ∆ = a ∈ Rd+ | i=1 ai = 1 . A[i, :] and A[:, j]
d

are the ith row and the j th column of matrix A respectively. Given a matrix A, we denote by kAkF
its Frobenius norm, by kAk0 its number of non-zero elements, and by kAk∞ its element-wise `∞
norm (i.e., kAk∞ = maxij |Aij |). The cardinality of set A is denoted by |A|. The bracketed
PmJnK is the shorthand for integer sets {1, 2, . . . , n}. A discrete measure α can be denoted by
notation
α = i=1 pi δxi where δx is the Dirac at position x, i.e., a unit of mass infinitely concentrated at x.

Gromov-Wasserstein distance. Optimal Transport addresses the problem of transporting one

probability measure towards another probability measure with the minimum cost (Villani, 2008;
Peyre and Cuturi, 2018). The induced cost defines a distance between the two probability measures.
Gromov-Wasserstein (GW) distance extends classic optimal transport to compare probability mea-
sures supported on different spaces (Mémoli, 2011). Let X and Y be two sample spaces.PEndowing
m
the spacesPX and Y with metrics (distances) dX and dY , the r-GW distance between α = i=1 pi δxi
n
and β = i0 =1 qi δxi is defined as
m
X X n r1
r
GWr (α, β) := min Dii0 jj 0 Tii0 Tjj 0 ,
T∈Π(p,q)
i,j=1 i0 ,j 0 =1

where the feasible domain of the transport plan T = [Tii0 ] is given by the set
m×n
T1n = p, T> 1m = q ,

Π(p, q) = T ∈ R+
and Dii0 jj 0 calculates the difference between pairwise distances, i.e., Dii0 jj 0 = |dX (xi , xj ) −
dY (yi0 , yj 0 )| with x1 , x2 , . . ., xm ∈ X and y1 , y2 , . . ., yn ∈ Y.

2.2 G RAPH R EPRESENTATION AND C OMPARISON

In this subsection, we formalize the idea of comparing graphs with GWD, which addresses the
challenges that graphs are often with different numbers of nodes and the node correspondence is
unknown (Xu et al., 2019b; Xu, 2020; Vincent-Cuaz et al., 2021).

Pairwise relation and graph representation. Given a graph G with n nodes, assigning each node
an index i ∈ JnK, G can be expressed as a tuple (C, p), where C ∈ Rn×n is a matrix encoding
the pairwise relations (e.g. adjacency, shortest-path, Laplacian, or heat kernel) and p ∈ ∆n is a
probability vector modeling the relative importance of nodes within the graph (Peyré et al., 2016;
Xu et al., 2019b; Titouan et al., 2019; Vincent-Cuaz et al., 2022).

2
Under review as a conference paper at ICLR 2023

Gromov-Wasserstein Discrepancy GWD can be derived from the 2-GW distance by replacing
the metrics with pairwise relations (Xu et al., 2019b; Vincent-Cuaz et al., 2022). More specifically,
given the observed source graph G s and the target graph G t that can be expressed as (Cs , ps ) and
(Ct , pt ) respectively, GWD is defined as
s t 21
n
X n
X
s s t t s
Cit0 j 0 )2 Tii0 Tjj 0

GWD (C , p ), (C , p ) = min (Cij − ,
T∈Π(ps ,pt )
i,j=1 i0 ,j 0 =1

where ns and nt are the numbers of nodes of G s and G t respectively. GWD computes both a soft
assignment matrix between the nodes of the two graphs and a notion of discrepancy between them.
For conciseness, we abbreviate GWD (Cs , ps ), (Ct , pt ) to GWD(Cs , Ct ) in the sequel.

2.3 D ICTIONARY L EARNING

Traditional DL approximates data vectors as sparse linear combinations of basis elements (atoms)
(Mallat, 1999; Mairal et al., 2009; Tošić and Frossard, 2011; Jiang et al., 2015), and is usually
formulated as
XK M
X 2
k
min X[:, k] − wm D[:, m] + λΩ(wk ), (1)
D∈C,W 2
k=1 m=1

where X ∈ Rd×K is the data matrix whose columns represent samples, the matrix D ∈ Rd×M
contains M atoms to learn and is constrained to the following set
C = {D ∈ Rd×M |∀m ∈ JM K, kD[:, m]k2 ≤ 1},
W ∈ RM ×K is the new representation of data whose k th -column wk = [wmk
]m∈JM K stores the
th k k
embedding of the k sample, and λΩ(w ) promotes the sparsity of w . Such a formulation only
applies to vectorized data.
Recently, Xu 2020 proposes to approximate graphs via the highly non-linear GW barycenter. Specif-
k k
ically, given a dataset of K graphs which has PRMs {Ck }k∈JKK such that Ck ∈ Rn ×n , the basis
elements {C̄m }m∈JM K are learned by solving
K
X
min GWD2 Ck , B wk , {C̄m }m∈JM K ,
{C̄m }m∈JM K ,{wk }k∈JKK
k=1

where w ∈ ∆ is referred to as the embedding of the k th graph G k , and the GW barycenter
k M

B wk , {C̄m }m∈JM K gives the approximation of G k and is defined as

M
X
B wk , {C̄m }m∈JM K = argmin k
wm GWD2 (B, C̄m ).
B m=1

Therefore, a complex bi-level optimization problem is involved, which is computationally inefficient

(Vincent-Cuaz et al., 2021).

DL for graphs. To overcome the above computational issues, Vincent-Cuaz et al. 2021 propose
GDL which approximates each graph as a weighted sum of PRMs and is formulated as
K
X M
X
2 k k
min GWD C , wm C̄m + λΩ(wk ),
{C̄m }m∈JM K ,{wk }k∈JKK
k=1 m=1

where each atom C̄m is a na × na matrix. In contrast to the `2 loss in Eq. (1), GWD is used to assess
PM k
the quality of the linear representation m=1 wm C̄m for k ∈ JKK. However, the observed graphs
often contain noisy edges or miss some edges in real-world applications (Clauset et al., 2008; Xu
et al., 2019b; Shi et al., 2019; Piccioli et al., 2022), which leads to the inaccuracies of the PRMs Ck ,
∗
that is, the deviation between Ck and the true PRM Ck . Since GWD lacks robustness (Séjourné
et al., 2021; Vincent-Cuaz et al., 2022; Tran et al., 2022), the quality of the learned dictionary may
be severely affected.

3
Under review as a conference paper at ICLR 2023

3 ROBUST G ROMOV-WASSERSTEIN D ISCREPANCY

To deal with the inaccuracies of PRMs, this section first defines a robust variant of GWD, referred
to as RGWD. The properties of RGWD are then rigorously analyzed. We then derive a theoretically
guaranteed numerical scheme for calculating RGWD approximately. Due to the limit of space, all
proofs can be found in the appendix.

Definition 1 Given the observed source graph G s and the target graph G t that can be expressed as
(Cs , ps ) and (Ct , pt ) respectively, RGWD is defined by the solution to the following problem
12
RGWD (Cs , ps ), (Ct , pt ), = mins t max f (T, E; Cs , Ct ) ,
T∈Π(p ,p ) E∈U

where the objective f (·) is given by

s t
n
X n
X
s t s
f (T, E; C , C ) = (Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 ,
i,j=1 i0 ,j 0 =1

and the perturbation E is in the bounded set U = {E|E = E> and kEk∞ ≤ }.

RGWD requires the sought transport plan to have low transportation costs for all perturbation E in
set U . For succinctness, we omit Cs and Ct in f (T, E; Cs , Ct ) in the following.

3.1 P ROPERTIES OF RGWD

The properties of RGWD are presented as follows. Firstly, although RGWD involves a non-convex
non-concave minimax optimization problem, the inner maximization problem has a closed-form
solution, which allows an efficient numerical scheme for RGWD. Secondly, RGWD has a lower
bound that is achieved if and only if the expressions of compared graphs are identical up to a per-
mutation, which implies RGWD can be employed to evaluate the similarity between one observed
graph and its approximation in DL. Thirdly, RGWD satisfies the triangle inequality, which allows
numerous potential applications including clustering (Elkan, 2003; HajKacem et al., 2019), metric
learning (Pitis et al., 2019), and Bayesian learning (Moore, 2000; Xiao et al., 2019). Finally, arbi-
trarily changing the node orders does not affect the value of RGWD. More formally, we state the
properties in the following theorem.

Theorem 1 Given two observed graphs G s = (Cs , ps ) and G t = (Ct , pt ) with ns and nt nodes
respectively, RGWD satisfies

1. for all T ∈ Π(ps , pt ), E(T) = [Ei0 j 0 (T)] where

( Pns s
, if i,j=1 Tii0 Tjj 0 (Cij − Cit0 j 0 ) ≤ 0,
Ei0 j 0 (T) =
−, otherwise.

solves the inner maximization problem

max f (T, E).
E∈U

2. RGWD is lower bounded, that is,

RGWD (Cs , ps ), (Ct , pt ), ≥ ,

where the equality holds if and only if there exists a bijective π ∗ : Jns K → Jnt K such that
psi = ptπ∗ (i) for all i ∈ Jns K and Cij
s
= Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K.

3. The triangle inequality holds for RGWD, i.e.,

RGWD (C1 , p1 ), (C3 , p3 ),

≤ RGWD (C1 , p1 ), (C2 , p2 ), + RGWD (C2 , p2 ), (C3 , p3 ), .

4
Under review as a conference paper at ICLR 2023

4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs
and Qt ,
RGWD (Cs , ps ), (Ct , pt ), = RGWD (Qs> Cs Qs , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .

As is implied by Theorem 1, RGWD does not define a distance between metric-measure spaces.
Firstly, the identity axiom is not satisfied. Secondly, the symmetry generally does not hold either,
which we exemplify below.

Example 1 (Asymmetry of RGWD) Consider the case ps = pt = [ 0.5 s 01

0.5 ], C = [ 1 0 ], C =
t
∗
0 4 s s t t
0.25 0.25
[ 4 0 ]. Then RGWD (C , p ), (C , p ), 1 = 11.5 with the solution given by T = [ 0.25 0.25 ]
and E∗ = [ −1 1 t t s s ∗ 0.25 0.25
1 −1 ]. In contrast, RGWD (C , p ), (C , p ), 1 = 10.5 with T = [ 0.25 0.25 ] and
E∗ = [ −1 −1 s s t t t t
−1 −1 ]. One has RGWD (C , p ), (C , p ), 1 6= RGWD (C , p ), (C , p ), 1
s s

Example 1 showcases that RGWD is asymmetric even if ns = nt .

3.2 N UMERICAL S CHEME OF RGWD

We derive a gradient based numerical scheme to solve RGWD by exploiting the property that the
inner maximization problem has a closed-form solution, which is summarized in Algorithm 1. In
each iteration, Eτ that solves the inner problem for current Tτ is calculated. Then, the transport
plan is updated using the projected gradient descent.

Algorithm 1 Projected Gradient Descent for RGWD

1: Input: Initialization T0 , step-size η, number of iterations N .
2: Output: Estimated optimal transport plan T̂ and its corresponding perturbation Ê.
3: for τ = 0, 1, . . . , N − 1 do
4: Find Eτ that maximizes f (Tτ , E).
5: Update the transport plan via

Tτ +1 = ProjΠ(ps ,pt ) Tτ − η∇T f (Tτ , Eτ ) ,

where the partial gradient takes the form

t> s>
t s

∇T f (Tτ , Eτ ) =2 Cs Cs Tτ 1n 1n 21n 1n Tτ Ct + Eτ Ct + Eτ

−4Cs Tτ Ct + Eτ ,

with denoting the element-wise multiplication.

6: end for
7: Pick τ uniformly at random from {1, 2, . . . , T }.
8: Set T̂ ← Tτ .
9: Find Ê that maximizes f (T̂, E).

To present the convergence guarantee of Algorithm 1, we introduce the notion of the Moreau en-
velope. The stationarity of any function h(x) can be quantified by the norm of the gradient of its
Moreau envelope hλ (x) = minx0 h(x0 ) + 2λ 1
kx − x0 k2 . The following theorem gives the conver-
gence rate of Algorithm 1 and the proof is deferred to the appendix.

Theorem 2 Define φ(·) = maxE∈U f (·, E). The output T̂ of Algorithm 1 with step-size η = √Nγ+1
satisfies
φ1/2l (T0 ) − minT∈Π(ps ,pt ) φ(T) + lL2 γ 2
E k∇φ1/2l (T̂)k2 ≤ 2

√ ,
γ N +1
√
where
√ l = 2 max{10n3 U12 + 6n3 U1 + 4nU1 U2 + 4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } and L =
2 max{(4U1 +2)U22 n3 , 2(2U1 +)2 U2 n3 } with n = max{ns , nt }, U1 = max{kCs k∞ , kCt k∞ }
and U2 = max{kpt k2 , maxT0 ∈Π(ps ,pt ) kT0 kF }.

5
Under review as a conference paper at ICLR 2023

When U1 and are of the order O( n13 ), both l and L are of the order O(1) and Theorem 2 states that
an δ-stationary solution can be obtained within O( δ12 ) iterations. Note that we can multiply Cs , Ct ,
and by the same number without affecting the resulted transport plan.

4 ROBUST G RAPH D ICTIONARY L EARNING

The problem of learning a robust dictionary for graph data is now formulated as follows. Given a
dataset of K graphs expressed by {(Ck , pk )}k∈JKK , estimating the optimal dictionary is formalized
by
XK XM
2
min RGWD wm C̄ , p̄ , (C , p ), − λkwk k2 ,
k m k k
(2)
{C̄m }m∈JM K ,{wk }k∈JKK
k=1 m=1
m k
where {C̄ }m∈JM K and {w }k∈JKK are the dictionary and graph embeddings respectively, and p̄
is obtained by sorting and averaging {pk }k∈JKK following Xu et al. (2019a). To resolve (2), we
propose a nested iterative optimization algorithm that is summarized in Algorithm 2. The main idea
is that the dictionary and embeddings are updated alternatingly. We discuss some crucial details
below.

Algorithm 2 Robust Graph Dictionary Learning (RGDL)

1: Input: The dataset {Ck , pk }k∈JKK , the initial dictionary {C̄m }m∈JM K , the number of iterations
T , mini-batch size b.
2: Output: The learned dictionary {C̄m }m∈JM K .
3: for t = 0, 1, . . . , T − 1 do
4: Sample a mini-batch of graphs whose indices are denoted by B such that |B| = b.
5: for k ∈ B do
1 M >
6: Initialize wk = M 1 and Tk = p̄pk .
7: repeat
8: Calculate (Tk , Ek ) via Algorithm 1 with fixed wk .
9: Compute wk solving (4) for the fixed Tk and Ek with conditional gradient.
10: until Convergence
11: end for
12: Update the atom C̄m for m ∈ JM K with stochastic gradient ∇ ˆ C̄m which has the form
M
ˆ C̄m = 2 >
X
m0
X
∇ k
wm k
wm 0 C̄ p̄p̄> − Tk (Ck + Ek )Tk . (3)
b 0
k∈B m =1

13: end for

Solving wk . We now formulate the problem of obtaining the embedding of the k th graph G k when
the dictionary is fixed and the PRM is inaccurate. Given dictionary {C̄m }m∈JM K where each C̄m ∈
a a
Rn ×n , the embedding of G k expressed by (Ck , pk ) is calculated by solving
X M
min RGWD2 wmk
C̄m , p̄ , (Ck , pk ), − λkwk k2 , (4)
wk ∈∆M
m=1

where λ ≥ 0 induces a negative quadratic regularization promoting sparsity on the simplex (Li
et al., 2020; Vincent-Cuaz et al., 2021). When wk is fixed, updating Tk and Ek can be solved by
Algorithm 1 whose convergence is guaranteed by Theorem 2. For fixed Tk and Ek , the problem of
updating wk is a non-convex problem that can be tackled by a conditional gradient algorithm. Note
that for non-convex problems, the conditional gradient algorithm is proved to converge to a local
stationary point (Lacoste-Julien, 2016). Such a procedure is described from Line 5 to Line 11 in
Algorithm 2, which we observe converges within tens of iterations empirically.

Stochastic updates. To enhance computational efficiency, each atom is updated with stochastic
estimates of the gradient. At each stochastic update, b embedding learning problems are solved

6
Under review as a conference paper at ICLR 2023

independently for the current dictionary using the procedure stated above, where b is the size of the
sampled mini-batch. Each atom is then updated using the stochastic gradient given in Eq. (3). Note
that the symmetry of each atom is preserved as long as the initialized atom is symmetric, since the
stochastic gradients are symmetric.

5 E XPERIMENTS
This section provides empirical evidence that RGDL performs well in the unsupervised graph clus-
tering task on both synthetic and real-world datasets. The heat kernel matrix is employed for the
PRM since it captures both global and local topology and achieves good performance in many tasks
(Donnat et al., 2018; Tsitsulin et al., 2018; Chowdhury and Needham, 2021).

5.1 S IMULATED DATASETS

We first test RGDL in the graph clustering task on datasets simulated according to the well-studied
Stochastic Block Model (SBM) (Holland et al., 1983; Wang and Wong, 1987). RGDL is compared
against the following state-of-the-art graph clustering methods: (i) GDL (Vincent-Cuaz et al., 2021)
learns graph dictionaries via GWD; (ii) Gromov-Wasserstein Factorization (GWF) (Xu, 2020) that
approximates graphs via GW barycenters; (iii) Spectral Clustering (SC) of Shi and Malik (2000);
Stella and Shi (2003) applied to the matrix with each entry storing the GWD between two graphs.

Table 1: Average (stdev) ARI scores for the first scenario of synthetic datasets.

balanced unbalanced
σ 0.05 0.10 0.15 0.05 0.10 0.15
GDL 0.119(0.017) 0.031(0.012) 0.016(0.006) 0.049(0.019) 0.018(0.004) 0.018(0.001)
GWF 0.071(0.007) 0.034(0.003) 0.008(0.001) 0.052(0.020) 0.014(0.001) 0.015(0.001)
SC 0.057(0.002) 0.033(0.002) 0.010(0.001) 0.054(0.024) 0.015(0.004) 0.010(0.001)
RGDL(=0.01) 0.316(0.005) 0.161(0.005) 0.052(0.002) 0.246(0.013) 0.039(0.009) 0.024 (0.001)
RGDL(=0.1) 0.853(0.003) 0.756(0.018) 0.439(0.015) 0.765(0.022) 0.694(0.046) 0.499(0.016)
RGDL(=0.2) 0.975(0.025) 0.879(0.023) 0.736(0.020) 0.866(0.023) 0.815(0.028) 0.770(0.016)
RGDL(=0.3) 0.975(0.025) 0.879(0.023) 0.869(0.013) 0.943(0.001) 0.916(0.027) 0.848(0.061)
RGDL(=10) 0.975(0.025) 0.950(0.000) 0.950(0.000) 0.943(0.001) 0.943(0.001) 0.943(0.001)
RGDL(=30) 0.781(0.046) 0.779(0.070) 0.728(0.085) 0.723(0.067) 0.698(0.057) 0.666(0.040)

Table 2: Average (stdev) ARI scores for the second scenario of synthetic datasets.

balanced unbalanced
ρ 0.00 0.05 0.10 0.00 0.05 0.1
GDL 0.260(0.020) 0.187(0.013) 0.024(0.004) 0.152(0.008) 0.018(0.001) 0.005(0.001)
GWF 0.182(0.006) 0.086(0.004) 0.027(0.005) 0.020(0.005) 0.016(0.002) 0.010(0.002)
SC 0.204(0.002) 0.129(0.017) 0.016(0.005) 0.129(0.008) 0.013(0.001) 0.011(0.005)
RGDL(=0.01) 0.451(0.014) 0.449(0.016) 0.449(0.016) 0.401(0.002) 0.401(0.002) 0.399(0.000)
RGDL(=0.1) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=0.2) 1.000(0.000) 1.000(0.000) 0.975(0.025) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=0.3) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=10) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000) 1.000(0.000)
RGDL(=30) 0.896(0.070) 0.888(0.080) 0.857(0.044) 0.864(0.057) 0.827(0.043) 0.816(0.074)

Dataset generation. We consider two scenarios of inaccuracies. In the first scenario (S1), Gaus-
sian noise is added into the heat kernel matrix of each graph. More specifically, denoting the
heat kernel matrix of the k th graph as Ck∗ for k ∈ JKK, the PRM available to DL methods is
Ck = Ck∗ + Z + Z> where each entry Zij of Z is sampled from the Gaussian distribution N (0, σ).
In the second scenario (S2), we randomly add ρ|E| edges into the graph and then randomly remove
ρ|E| edges while keeping the graph connected, where E is the edge set of the graph. The heat kernel
matrix is then constructed for the modified graph. Such two scenarios allow us to study the per-
formance of RGDL against different scales of inaccuracies. In both S1 and S2, we generate two
datasets, both of which involve three generative structures (also used to label graphs): dense (only
one community), two communities, and three communities. We fix p = 0.1 as the probability of

7
Under review as a conference paper at ICLR 2023

inter-community connectivity and 1 − p as the probability of intra-community connectivity. The

first dataset includes 20 graphs for each generative structure and thus is referred to as the balanced
dataset. The second dataset consists of 12, 18, and 30 graphs for the three generative structures re-
spectively, and is hence named as the unbalanced dataset. The number of graph nodes is uniformly
sampled from [30, 50]. The magnitude of the observed PRM Ck satisfies kCk k∞ ≤ 15.

Evaluating the performance. The learned embeddings of the graphs are used as input for a k-
means algorithm to cluster graphs. We use the well-known Adjusted Rand Index (ARI) (Hubert and
Arabie, 1985; Steinley, 2004), to evaluate the quality of clustering by comparing it with the graph
labels. RGDL with varied is compared against GDL, GWF, and SC. RGDL, GDL and GWF use
three atoms which are R6×6 matrices. We run each method for 5 times and report the averaged ARI
scores and the standard deviations. Experimental results reported in Table 1 and Table 2 demonstrate
RGDL outperforms baselines significantly.

Influence of . RGDL with moderate values outperforms baseline methods by a large margin
and is more robust to the noise. Even when is relatively small (=0.01), RGDL achieves better per-
formance than them. Increasing within a suitable range can boost ARI and RGDL is not sensitive
to the choice of . If becomes too large, the performance of RGDL slowly decreases. In practice,
when a small quantity of data labels are available, can be chosen according to the performance on
this small subset of data.

GDL RGDL( = U) RGDL( = 10 2U) SC

GWF RGDL( = 10 1U) RGDL( = 10 3U)

0.5 0.30
0.12
0.4 0.25
0.10
0.20
0.3 0.08
0.15 0.06
ARI

ARI

0.2
0.10 0.04
0.1
0.05 0.02
0.0 0.00 0.00
0 50 100 150 200 250 300 350 0 100 200 300 400 500 0 20 40 60 80 100 120
time(sec) time(sec) time(sec)

Figure 1: ARI scores vs. time on MUTAG (left), BZR(middle), and Peking 1(right) datasets.

5.2 R EAL - WORLD DATASETS

We further use RGDL to cluster real-world graphs. We consider widely utilized benchmark datasets
including MUTAG (Debnath et al., 1991), BZR (Sutherland et al., 2003), and Peking 1 (Pan et al.,
2016). The labels of the graphs are employed as the ground truth to evaluate the estimated clustering
results. For each dataset, the size of the atoms is set as the median of the numbers of graph nodes
following Vincent-Cuaz et al. (2021). The number of atoms M is set as M = β(# classes) where
β is chosen from {2, 3, 4, 5}. RGDL is run with different values of . Specifically, is chosen from
{U, 10−1 U, 10−2 U, 10−3 U } where U = maxk∈JKK kCk k∞ .

Results. The experimental results on real-world graphs are reported in Figure 1. RGDL with =
10−1 U or = 10−2 U outperforms baselines on
all datasets, which implies that the observed graphs
contain structural noise and 10−2 U, 10−1 U is often a suitable range for . The time required

for RGDL to converge is comparable to that of state-of-the-art of methods. Since GDL, GWF,
and RGDL can output graph embeddings, we further illustrate the embeddings generated by them
respectively based on PCA. As is shown in Figure 2, the embeddings of the two types of the graphs
are less likely to be mixed together, which explains why RGDL achieves higher ARI values.

8
Under review as a conference paper at ICLR 2023

Figure 2: PCA-based visualization of embeddings produced by GDL (left), GWF (middle), and
RGDL (right) respectively for the graphs in MUTAG dataset. The color of each point indicate the
type of the corresponding graph. RGDL achieves the best clustering results.

6 R ELATED W ORK
Unbalanced OT. Enhancing the robustness of the optimal transport plan has received wide atten-
tion recently (Balaji et al., 2020; Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022; Chapel
et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022). Originally, robust variants of classical
OT are proposed to compare distributions supported on the same metric space (Balaji et al., 2020;
Mukherjee et al., 2021; Le et al., 2021; Nietert et al., 2022), which model the noise as outlier supports
and reduce the influence of outlier supports by allowing mass destruction and creation. Following
the same spirit, variants of the GW distance that also relax the marginal constraints are proposed
(Chapel et al., 2020; Séjourné et al., 2021; Vincent-Cuaz et al., 2022). However, these methods do
not take the inaccuracies of the pairwise distances/similarities into account. The proposed RGWD
aims to handle such cases.
Graph representation learning and graph comparison. Comparing graphs often requires learn-
ing meaningful graph representations. Some methods manually design representations that are in-
variant under graph isomorphism (Bagrow and Bollt, 2019; Tam and Dunson, 2022). Such rep-
resentations are often sophisticated and require domain knowledge. Graph neural network-based
methods learn the representations of graphs in an end-to-end manner (Scarselli et al., 2008; Zhang
et al., 2018; Lee et al., 2018; Errica et al., 2019), which however requires a large amount of labeled
data. Another family of methods that uses graph representations implicitly is referred to as graph
kernels (Shervashidze et al., 2009; Vishwanathan et al., 2010). GWD and its variants based methods
can estimate the node correspondence and provide an interpretable discrepancy between compared
graphs (Xu et al., 2019b; Titouan et al., 2019; Barbe et al., 2020; Chapel et al., 2020). In this paper,
we propose a novel graph dictionary learning method based on a robust variant of GWD to learn
representations of graphs which are useful in downstream tasks.
Non-linear combination of atoms. Classic DL methods are linear in the sense that they attempt to
approximate each vectorized datum by a linear combination of a few basis elements. Recently, non-
linear operations are also considered. In order to exploit the non-linear nature of data, Autoencoder-
based methods encode them to low-dimensional vectors using neural networks, and decode data
with another neural network (Hinton and Salakhutdinov, 2006; Hu and Tan, 2018). Another family
of methods replace the linear combinations by geodesic interpolations (Boissard et al., 2011; Bigot
et al., 2013; Seguy and Cuturi, 2015; Schmitz et al., 2018). More closely related to our work,
Xu 2020 proposes to approximate graphs via the GW barycenter of graph atoms, which however
involves a complicated and computational demanding optimization problem.

7 C ONCLUSION
In this paper, we propose a novel graph dictionary learning algorithm that is robust to the structural
noise of graphs. We first propose a robust variant of GWD, referred to as RGWD, which involves
a minimax optimization problem. Exploiting the fact that the inner maximization problem has a
closed-form solution, an efficient numerical scheme is derived. Based on RGWD, a robust dictio-
nary learning algorithm for graphs called RGDL is derived to learn atoms from noisy graph data.
Numerical results on both simulated and real-world datasets demonstrate that RGDL achieves good
performance in the presence of structural noise.

9
Under review as a conference paper at ICLR 2023

R EFERENCES
Monica Agrawal, Marinka Zitnik, and Jure Leskovec. Large-scale analysis of disease pathways in
the human interactome. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018: Proceedings of
the Pacific Symposium, pages 111–122. World Scientific, 2018.
Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and recommending links
in social networks. In Proceedings of the fourth ACM international conference on Web search
and data mining, pages 635–644, 2011.
James P Bagrow and Erik M Bollt. An information-theoretic, all-scales approach to comparing
networks. Applied Network Science, 4(1):1–15, 2019.
Yogesh Balaji, Rama Chellappa, and Soheil Feizi. Robust optimal transport with applications in
generative modeling and domain adaptation. Advances in Neural Information Processing Systems,
33:12934–12944, 2020.
Amélie Barbe, Marc Sebban, Paulo Gonçalves, Pierre Borgnat, and Rémi Gribonval. Graph diffu-
sion wasserstein distances. In ECML PKDD, 2020.
Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic pca in the wasserstein
space. arXiv preprint arXiv:1307.7721, 2013.
Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution’s template estimate
with wasserstein metrics. arXiv preprint arXiv:1111.5927, 2011.
Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and
Trends® in Machine Learning, 8(3-4):231–357, 2015.
Laetitia Chapel, Gilles Gasso, et al. Partial optimal tranport with applications on positive-unlabeled
learning. In NeurIPS, 2020.
Samir Chowdhury and Facundo Mémoli. The gromov–wasserstein distance between networks and
stable network invariants. Information and Inference: A Journal of the IMA, 8(4):757–787, 2019.
Samir Chowdhury and Tom Needham. Generalized spectral clustering via gromov-wasserstein
learning. In International Conference on Artificial Intelligence and Statistics, pages 712–720.
PMLR, 2021.
Aaron Clauset, Cristopher Moore, and M E J Newman. Hierarchical structure and the prediction of
missing links in networks. Nature, 453(7191):98–101, 2008.
Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and Cor-
win Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro com-
pounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal
chemistry, 34(2):786–797, 1991.
Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. Learning structural node embed-
dings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD international conference
on knowledge discovery & data mining, pages 1320–1329, 2018.
Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu
Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in
neural information processing systems, 32, 2019.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport:
Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In Interna-
tional conference on machine learning, pages 1367–1376. PMLR, 2018.
Charles Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the 20th
international conference on Machine Learning (ICML-03), pages 147–153, 2003.
Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neu-
ral networks for graph classification. In International Conference on Learning Representations,
2019.

10
Under review as a conference paper at ICLR 2023

Zhizhao Feng, Meng Yang, Lei Zhang, Yan Liu, and David Zhang. Joint discriminative dimensional-
ity reduction and dictionary learning for face recognition. Pattern Recognition, 46(8):2134–2143,
2013.
Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N’Cir, and Nadia Essoussi. Overview of
scalable partitional methods for big data clustering. In Clustering Methods for Big Data Analytics,
pages 1–23. Springer, 2019.
Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural
networks. science, 313(5786):504–507, 2006.
Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First
steps. Social networks, 5(2):109–137, 1983.
Junlin Hu and Yap-Peng Tan. Nonlinear dictionary learning with application to image classification.
Pattern Recognition, 75:282–291, 2018.
Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2(1):193–218,
1985.
Wenhao Jiang, Feiping Nie, and Heng Huang. Robust dictionary learning with capped l1-norm. In
Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction
outcomes with weisfeiler-lehman network. Advances in neural information processing systems,
30, 2017.
Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint
arXiv:1607.00345, 2016.
Khang Le, Huy Nguyen, Quang M Nguyen, Tung Pham, Hung Bui, and Nhat Ho. On robust
optimal transport: Computational complexity and barycenter computation. Advances in Neural
Information Processing Systems, 34:21947–21959, 2021.
John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification using structural attention.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pages 1666–1674, 2018.
Ping Li, Syama Sundar Rangapuram, and Martin Slawski. Methods for sparse and low-rank recovery
under simplex constraints. Statistica Sinica, 30(2):557–577, 2020.
Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis Bach. Supervised
dictionary learning. Advances in neural information processing systems, 21, 2008.
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for
sparse coding. In Proceedings of the 26th annual international conference on machine learning,
pages 689–696, 2009.
Stéphane Mallat. A wavelet tour of signal processing. Elsevier, 1999.
Facundo Mémoli. Gromov–wasserstein distances and the metric approach to object matching. Foun-
dations of computational mathematics, 11(4):417–487, 2011.
Andrew W Moore. The anchors hierarchy: using the triangle inequality to survive high dimensional
data. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages
397–405, 2000.
Debarghya Mukherjee, Aritra Guha, Justin M Solomon, Yuekai Sun, and Mikhail Yurochkin.
Outlier-robust optimal transport. In International Conference on Machine Learning, pages 7850–
7860. PMLR, 2021.
Navid Naderializadeh, Mark Eisen, and Alejandro Ribeiro. Wireless power control via counterfac-
tual optimization of graph neural networks. In 2020 IEEE 21st International Workshop on Signal
Processing Advances in Wireless Communications (SPAWC), pages 1–5. IEEE, 2020.

11
Under review as a conference paper at ICLR 2023

Sloan Nietert, Ziv Goldfeld, and Rachel Cummings. Outlier-robust optimal transport: Duality, struc-
ture, and statistical analysis. In International Conference on Artificial Intelligence and Statistics,
pages 11691–11719. PMLR, 2022.

Shirui Pan, Jia Wu, Xingquan Zhu, Guodong Long, and Chengqi Zhang. Task sensitive feature
exploration and learning for multitask graph classification. IEEE transactions on cybernetics, 47
(3):744–758, 2016.

Gabriel Peyre and Marco Cuturi. Computational optimal transport. arXiv: Machine Learning, 2018.

Gabriel Peyré, Marco Cuturi, and Justin Solomon. Gromov-wasserstein averaging of kernel and
distance matrices. In ICML, 2016.

Giovanni Piccioli, Guilhem Semerjian, Gabriele Sicuro, and Lenka Zdeborová. Aligning random
graphs with a sub-tree similarity message-passing algorithm. Journal of Statistical Mechanics:
Theory and Experiment, 2022(6):063401, 2022.

Silviu Pitis, Harris Chan, Kiarash Jamali, and Jimmy Ba. An inductive bias for distances: Neural
nets that respect the triangle inequality. In International Conference on Learning Representations,
2019.

Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning:
transfer learning from unlabeled data. In Proceedings of the 24th international conference on
Machine learning, pages 759–766, 2007.

Ignacio Ramirez, Pablo Sprechmann, and Guillermo Sapiro. Classification and clustering via dictio-
nary learning with structured incoherence and shared features. In 2010 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pages 3501–3508. IEEE, 2010.

Hamidreza Sadreazami, Arash Mohammadi, Amir Asif, and Konstantinos N Plataniotis.

Distributed-graph-based statistical approach for intrusion detection in cyber-physical systems.
IEEE Transactions on Signal and Information Processing over Networks, 4(1):137–147, 2017.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.

Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco Cuturi,
Gabriel Peyré, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal transport-based
unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643–678,
2018.

Vivien Seguy and Marco Cuturi. Principal geodesic analysis for probability measures under the
optimal transport metric. Advances in Neural Information Processing Systems, 28, 2015.

Thibault Séjourné, François-Xavier Vialard, and Gabriel Peyré. The unbalanced gromov wasserstein
distance: Conic formulation and relaxation. Advances in Neural Information Processing Systems,
34:8766–8779, 2021.

Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. Ef-
ficient graphlet kernels for large graph comparison. In Artificial intelligence and statistics, pages
488–495. PMLR, 2009.

Chuan Shi, Binbin Hu, Wayne Xin Zhao, and Philip S Yu. Heterogeneous information network
embedding for recommendation. IEEE Transactions on Knowledge and Data Engineering, 31
(2):357–370, 2019.

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on
pattern analysis and machine intelligence, 22(8):888–905, 2000.

Pablo Sprechmann and Guillermo Sapiro. Dictionary learning and sparse coding for unsupervised
clustering. In 2010 IEEE international conference on acoustics, speech and signal processing,
pages 2042–2045. IEEE, 2010.

12
Under review as a conference paper at ICLR 2023

Douglas Steinley. Properties of the hubert-arable adjusted rand index. Psychological methods, 9(3):
386, 2004.
X Yu Stella and Jianbo Shi. Multiclass spectral clustering. In Computer Vision, IEEE International
Conference on, volume 2, pages 313–313. IEEE Computer Society, 2003.
Jeffrey J Sutherland, Lee A O’brien, and Donald F Weaver. Spline-fitting with a genetic algorithm:
A method for developing classification structure- activity relationships. Journal of chemical in-
formation and computer sciences, 43(6):1906–1915, 2003.
Edric Tam and David Dunson. Multiscale graph comparison via the embedded laplacian distance.
arXiv preprint arXiv:2201.12064, 2022.
Vayer Titouan, Nicolas Courty, Romain Tavenard, Chapel Laetitia, and Rémi Flamary. Optimal
transport for structured data with application on graphs. In ICML, 2019.
Ivana Tošić and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2):
27–38, 2011.
Quang Huy Tran, Hicham Janati, Nicolas Courty, Rémi Flamary, Ievgen Redko, Pinar Demetci, and
Ritambhara Singh. Unbalanced co-optimal transport. arXiv preprint arXiv:2205.14923, 2022.
Anton Tsitsulin, Davide Mottin, Panagiotis Karras, Alexander Bronstein, and Emmanuel Müller.
Netlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 2347–2356, 2018.
Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media,
2008.
Cédric Vincent-Cuaz, Titouan Vayer, Rémi Flamary, Marco Corneli, and Nicolas Courty. Online
graph dictionary learning. In International Conference on Machine Learning, pages 10564–
10574. PMLR, 2021.
Cédric Vincent-Cuaz, Rémi Flamary, Marco Corneli, Titouan Vayer, and Nicolas Courty. Semi-
relaxed gromov-wasserstein divergence and applications on graphs. In International Confer-
ence on Learning Representations, 2022. URL https://openreview.net/forum?id=
RShaMexjc-x.
S Vishwanathan, N Schraudolph, R Kondor, and KM Borgwardt. Graph kernels. the journal ofma-
chine learning research. 2010.
Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the
American Statistical Association, 82(397):8–19, 1987.
Xian Wei, Hao Shen, Yuanxiang Li, Xuan Tang, Fengxiang Wang, Martin Kleinsteuber, and Yi Lu
Murphey. Reconstructible nonlinear dimensionality reduction via joint dictionary learning. IEEE
transactions on neural networks and learning systems, 30(1):175–189, 2018.
Teng Xiao, Jiaxin Ren, Zaiqiao Meng, Huan Sun, and Shangsong Liang. Dynamic bayesian metric
learning for personalized product search. In Proceedings of the 28th ACM International Confer-
ence on Information and Knowledge Management, pages 1693–1702, 2019.
Hongteng Xu, Dixin Luo, and Lawrence Carin. Scalable gromov-wasserstein learning for graph
partitioning and matching. In NeurIPS, 2019a.
Hongteng Xu, Dixin Luo, Hongyuan Zha, and Lawrence Carin Duke. Gromov-wasserstein learning
for graph matching and node embedding. In ICML, 2019b.
Hongtengl Xu. Gromov-wasserstein factorization models for graph clustering. In Proceedings of
the AAAI conference on artificial intelligence, volume 34, pages 6478–6485, 2020.
Yangyang Xu. Iteration complexity of inexact augmented lagrangian methods for constrained con-
vex programming. Mathematical Programming, 185(1):199–244, 2021.

13
Under review as a conference paper at ICLR 2023

Pinar Yanardag and S Vishwanathan. Deep graph kernels in: Proceedings of the 21th acm sigkdd
international conference on knowledge discovery and data mining, 1365–1374. ACM, New York,
2015.
Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning
architecture for graph classification. In Proceedings of the AAAI conference on artificial intelli-
gence, volume 32, 2018.
Tong Zhang, Yun Wang, Zhen Cui, Chuanwei Zhou, Baoliang Cui, Haikuan Huang, and Jian Yang.
Deep wasserstein graph discriminant learning for graph classification. In AAAI, pages 10914–
10922, 2021.

14
Under review as a conference paper at ICLR 2023

Appendix
The appendix is organized as follows. We first provide omitted proofs in the main paper in Sec.
A. Then, algorithmic details are presented in Sec. B. Finally, Sec. C gives additional experimental
results.

A O MITTED P ROOFS
Theorem 1 Given two observed graphs G s = (Cs , ps ) and G t = (Ct , pt ) with ns and nt nodes
respectively, RGWD satisfies

1. for all T ∈ Π(ps , pt ), E(T) = [Ei0 j 0 (T)] where

( Pns s
, if i,j=1 Tii0 Tjj 0 (Cij − Cit0 j 0 ) ≤ 0,
Ei0 j 0 (T) =
−, otherwise.
solves the inner maximization problem
max f (T, E).
E∈U

2. RGWD is lower bounded, that is,

RGWD (Cs , ps ), (Ct , pt ), ≥ ,

where the equality holds if and only if there exists a bijective π ∗ : Jns K → Jnt K such that
psi = ptπ∗ (i) for all i ∈ Jns K and Cij
s
= Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K.

3. The triangle inequality holds for RGWD, i.e.,

RGWD (C1 , p1 ), (C3 , p3 ),

≤ RGWD (C1 , p1 ), (C2 , p2 ), + RGWD (C2 , p2 ), (C3 , p3 ), .

4. RGWD is invariant to the permutation of node orders, i.e., for all permutation matrices Qs
and Qt ,
RGWD (Cs , ps ), (Ct , pt ), = RGWD (Qs> Cs Qs , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .

Proof: (i) The objective can be rewritten as follows,

s t
n
X n
X
s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0
i,j=1 i0 ,j 0 =1
s t
n
X n
X nt X
X ns n
X
s
s
= (Cij − Cit0 j 0 )2 Tii0 Tjj 0 + Tii0 Tjj 0 Ei20 j 0 −2 s
Tii0 Tjj 0 (Cij − Cit0 j 0 )Ei0 j 0
i,j=1 i0 ,j 0 =1 i0 ,j 0 =1 i,j=1 i,j=1
s t t
n
X n
X n
X ns
X
s
= (Cij − Cit0 j 0 )2 Tii0 Tjj 0 + t t 2
pi0 pj 0 Ei0 j 0 − 2 s t
Tii0 Tjj 0 (Cij − Ci0 j 0 )Ei0 j 0 ,
i,j=1 i0 ,j 0 =1 i0 ,j 0 =1 i,j=1
(5)
which by the property of quadratic functions yields the closed-form solution E(T) = [Ei0 j 0 (T)],
where ( Pns s
, if i,j=1 Tii0 Tjj 0 (Cij − Cit0 j 0 ) ≤ 0,
Ei0 j 0 (T) =
−, otherwise.
It is easy to verify that the such a choice guarantees the symmetry of E(T).
(ii) We now prove the lower boundedness. By Eq. (5), we have
s t s t
n
X n
X n
X n
X
s
min max (Cij − Cit0 j 0 2
− Ei0 j 0 ) Tii0 Tjj 0 ≥ min Tii0 Tjj 0 2 = 2 .
T∈Π(ps ,pt ) E∈U 0 0
T∈Π(ps ,pt )
i,j=1 i ,j =1 i,j=1 i0 ,j 0 =1

15
Under review as a conference paper at ICLR 2023

Note that when there exists a bijective π ∗ : Jns K → Jnt K such that psi = ptπ∗ (i) for all i ∈ Jns K and
s
Cij = Cπt ∗ (i)π∗ (j) for all i, j ∈ Jns K, choosing the transport plan T̂ = [T̂ii0 ] where

pi , if i0 = π ∗ (i),
s
T̂ii0 =
0, otherwise,
we have for all i0 , j 0 ∈ Jnt K,
s
n
X
s
T̂ii0 T̂jj 0 (Cij − Cit0 j 0 ) = T̂π∗−1 (i0 )i0 T̂π∗−1 (j 0 )j 0 Cπs ∗−1 (i0 )π∗−1 (j 0 ) − Cit0 j 0 = 0,
i,j=1

which implies that Ei0 j 0 (T) = for all i0 , j 0 ∈ Jnt K. We then have
s t s
n
X n
X n
X
s
(Cij − Cit0 j 0 2
− Ei0 j 0 ) T̂ii0 T̂jj 0 = s
(Cij − Cπt ∗ (i)π∗ (j) − )2 T̂iπ∗ (i) T̂jπ∗ (j) = 2 .
i,j=1 i0 ,j 0 =1 i,j=1

Therefore, in such a case, RGWD (Cs , ps ), (Ct , pt ), = . On the other hand, when such a
bijective does not exist,
v
u
u Xns nt
X
s s t t
s − C t )2 T 0 T 0 > ,
RGWD (C , p ), (C , p ), ≥ +
t 2 mins t (Cij i0 j 0 ii jj
T∈Π(p ,p )
i,j=1 i0 ,j 0 =1
Pns Pnt s
where the strict inequality is due to the fact that i,j=1 i0 ,j 0 =1 (Cij − Cit0 j 0 )2 Tii0 Tjj 0 > 0.
(iii) Thirdly, we prove the triangle inequality. Given tuples (C1 , p1 ), (C2 , p2 ), and (C3 , p3 )
which have the node numbers n1 , n2 , and n3 respectively, let (T∗12 , E∗12 ), (T∗23 , E∗23 ), and
(T∗13 , E∗13 ) be the solutions of RGWD 1 1 2 2 2 2 3 3

(C , p ), (C , p ), , RGWD (C , p ), (C , p ), ,
1 1 3 3 13 13
and RGWD (C , p ), (C , p ), . Define T = [Ti1 i3 ] where
2
n
X Ti∗12 T ∗23
1 i2 i2 i3
Ti13
1 i3
= .
i =1
p2i2
2

Then we have
RGWD (C1 , p1 ), (C3 , p3 ),

v
u n1 n3
u X X 2 13 13
≤t Ci11 j1 − Ci33 j3 − Ei∗13
3 j3
Ti1 i3 Tj1 j3
i1 ,j1 =1 i3 ,j3 =1
v
u n1 n 3 n 2 n 2
u X X 2 X Ti∗12 T ∗23 X
1 i2 i2 i3
Tj∗12 T ∗23
1 j2 j2 j3
=t Ci11 j1 − Ci33 j3 − Ei∗13
3 j3
i1 ,j1 =1 i3 ,j3 =1 i =1
p2i2 j =1
p2j2
2 2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
=t Ci11 j1 − Ci22 j2 + Ci22 j2 − Ci33 j3 − Ei∗13
3 j3
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
p2i2 p2j2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
≤t Ci11 j1 − Ci22 j2 2 2
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
pi2 pj2
v
u n1 n2 n3
u X X X 2 Ti∗12 T ∗23 Tj∗12
1 i2 i2 i3
T ∗23
1 j2 j2 j3
+t Ci22 j2 − Ci33 j3 − Ei∗13
3 j3 2 2
i1 ,j1 =1 i2 ,j2 =1 i3 ,j3 =1
pi2 pj2
v v
u n1 n2 u n2 n3
u X X 2 u X X 2 ∗23 ∗23
=t Ci11 j1 − Ci22 j2 Ti∗12 T ∗12
1 i2 j1 j2
+ t Ci22 j2 − Ci33 j3 − Ei∗13
3 j3
Ti2 i3 Tj2 j3
i1 ,j1 =1 i2 ,j2 =1 i2 ,j2 =1 i3 ,j3 =1

≤ RGWD (C , p ), (C , p ), + RGWD (C2 , p2 ), (C3 , p3 ), .

1 1 2 2

16
Under review as a conference paper at ICLR 2023

(iv) Finally, we prove the invariance to the node order permutation. Denote the solution to the
objective of RGWD by T∗ = [Tii∗0 ] and E∗ = [Ei∗0 j 0 ], which implies
( Pns
∗ , if i,j=1 Tii∗0 Tjj ∗ s t
0 (Cij − Ci0 j 0 ) ≤ 0,
Ei0 j 0 =
−, otherwise,

and
s t s t
n
X n
X n
X n
X
s
(Cij − Cit0 j 0 − Ei∗0 j 0 )2 Tii∗0 Tjj
∗
0 ≤ max s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 ,
0 0
E∈U
i,j=1 i ,j =1 i,j=1 i0 ,j 0 =1

for all T ∈ Π(ps , pt ). The two permutation operations can be equivalently denoted as two bijectives
π s and π t . Denote
C̃s = [C̃ij
s s
] where C̃ij = Cπs s −1 (i)πs −1 (j) ,
C̃t = [C̃it0 j 0 ] where C̃it0 j 0 = Cπt t −1 (i0 )πt −1 (j 0 ) ,
Ẽ∗ = [Ẽi∗0 j 0 ] where Ẽi∗0 j 0 = Eπ∗t −1 (i0 )πt −1 (j 0 ) ,
T̃∗ = [T̃iis 0 ] where T̃iis 0 = Tπ∗s −1 (i)πt −1 (i0 ) .

We first prove Ẽ∗ solves the inner maximization problem for T̃∗ . For all i0 , j 0 ∈ Jnt K, when
∗ ∗ s t ∗ ∗ s t
P P
ij T̃ii0 T̃jj 0 (C̃ij − C̃i0 j 0 ) ≤ 0, we have ij Tiπ t −1 (i0 ) Tjπ t −1 (j 0 ) (Cij − Cπ t −1 (i0 )π t −1 (j 0 ) ) ≤ 0,
which is consistent withẼ ∗ i0 j 0 = . The case when ij T̃ii∗0 T̃jj ∗ s t
P
0 (C̃ij − C̃i0 j 0 ) > 0 is similar. Since

s t s t
n
X n
X n
X n
X
s
(Cij − Cit0 j 0 − Ei∗0 j 0 )2 Tii∗0 Tjj
∗
0 = s
(C̃ij − C̃it0 j 0 − Ẽi∗0 j 0 )2 T̃ii∗0 T̃jj
∗
0

i,j=1 i0 ,j 0 =1 i,j=1 i0 ,j 0 =1

and
s t s t
n
X n
X n
X n
X
s
max (Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 = max s
(C̃ij − C̃it0 j 0 − Ei0 j 0 )2 T̃ii0 T̃jj 0 ,
E∈U E∈U
i,j=1 i0 ,j 0 =1 i,j=1 i0 ,j 0 =1

where T̃ii0 = Tπs −1 (i)πt −1 (i0 ) , T̃∗ and Ẽ∗ solve the optimization problem of

RGWD (Q C Q , Qs> ps ), (Qt> Ct Qt , Qt> pt ), .
s> s s

To prove Theorem 2, we require the following lemma.

√
Lemma 3 f (·) is l-smooth and L-Lipschitz, where
√ l = 2 max{10n3 U12 + 6n3 U1 + 4nU1 U2 +
4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } and L = 2 max{(4U1 + 2)U22 n3 , 2(2U1 + )2 U2 n3 } with
n = max{ns , nt }, U1 = max{kCs k∞ , kCt k∞ } and U2 = max{kpt k2 , maxT0 ∈Π(ps ,pt ) kT0 kF }.

Proof: (i) We first prove that f (·) is L-Lipschitz. For all T, T0 ∈ Π(ps , pt ) and E, E0 ∈ U ,

f (T, E) − f (T0 , E0 )
X X
≤ s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0
iji0 j 0 iji0 j 0
X X
+ s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0

iji0 j 0 iji0 j 0
X X
+ s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0 −
s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 0 Tjj
0
0 .

iji0 j 0 iji0 j 0

17
Under review as a conference paper at ICLR 2023

For the first term,

X X
s
(Cij − Cit0 j 0 − Ei0 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0
iji0 j 0 iji0 j 0
X
= s
(Cij − Cit0 j 0 − Ei0 j 0 + Cij
s
− Cit0 j 0 − Ei00 j 0 )(Ei00 j 0 − Ei0 j 0 )Tii0 Tjj 0
iji0 j 0
X
≤ s
Cij − Cit0 j 0 − Ei0 j 0 + Cij
s
− Cit0 j 0 − Ei00 j 0 Ei00 j 0 − Ei0 j 0 Tii0 Tjj 0
iji0 j 0
X
≤(4U1 + 2)U22 n2 Ei00 j 0 − Ei0 j 0 .
i0 j 0

For the second term,

X X
s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj 0 − s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0

iji0 j 0 iji0 j 0
X
≤ s
|Cij 0
− Cit0 j 0 − Ei00 j 0 |2 Tii0 Tjj 0 − Tjj 0

iji0 j 0
X
0
≤(2U1 + )2 U2 Tjj 0 − Tjj 0

iji0 j 0
0
≤(2U1 + ) U2 n2 Tjj 0 − Tjj
2
0 .

For the third term,

X X
s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 Tjj
0
0 −
s
(Cij − Cit0 j 0 − Ei00 j 0 )2 Tii0 0 Tjj
0
0

iji0 j 0 iji0 j 0
0
≤(2U1 + )2 U2 n2 Tjj 0 − Tjj 0 .

Combining the three relations above, we have

f (T, E) − f (T0 , E0 )
n o X X
≤ max (4U1 + 2)U22 n2 , (2U1 + )2 U2 n2 |Ei00 j 0 − Ei0 j 0 | + 0
|Tjj 0 − Tjj 0|

i0 j 0 jj 0
q
≤L kE − E0 k2F + kT − T0 k2F .

(ii) Now we prove that f (·) is l-smooth, which requires finding a constant l satisfying

vec ∇T f (T0 , E0 ) vec(T0 )

vec ∇T f (T, E) vec(T)
h i h i
− ≤ l − ,
vec ∇E f (T, E) vec ∇E f (T0 , E0 ) vec(E) vec(E0 ) 2
2

where vec(X) means the vectorization of matrix X and [ ba ] denotes the concatenation of vectors a
and b. Since the left hand side satisfies
vec ∇T f (T0 , E0 )

vec ∇T f (T, E)
−
vec ∇E f (T, E) vec ∇E f (T0 , E0 )
2
q
0 0 2
= k∇T f (T, E) − ∇T f (T , E )kF + k∇E f (T, E) − ∇E f (T0 , E0 )k2F
≤k∇T f (T, E) − ∇T f (T0 , E0 )kF + k∇E f (T, E) − ∇E f (T0 , E0 )kF ,
and the right hand side satisfies
vec(T) vec(T0 )
h i h i q
l − 0 =l kE − E0 k2F + kT − T0 k2F
vec(E) vec(E ) 2
l l
≥ √ kE − E0 kF + √ kT − T0 kF ,
2 2

18
Under review as a conference paper at ICLR 2023

it suffices to find a constant l satisfying

l l
k∇T f (T, E)−∇T f (T0 , E0 )kF +k∇E f (T, E)−∇E f (T0 , E0 )kF ≤ √ kE−E0 kF + √ kT−T0 kF .
2 2
We bound k∇T f (T, E) − ∇T f (T0 , E0 )kF as follows,
k∇T f (T, E) − ∇T f (T0 , E0 )kF
≤2nkCs Cs kF kT − T0 kF + 4kCs kF kT(Ct + E) − T0 (Ct + E0 )kF
+2nkT(Ct + E) (Ct + E) − T0 (Ct + E0 ) (Ct + E0 )kF .
For the first term,
2nkCs Cs kF kT − T0 kF ≤ 2nkCs k2F ≤ 2n3 U12 kT − T0 kF .
For the second term,
4kCs kF kT(Ct + E) − T0 (Ct + E0 )kF

≤4kCs kF kT − T0 kF kCt kF + kEkF + 4kCs kF kT0 kF kE − E0 kF
≤(4n2 U12 + 4n2 U1 )kT − T0 kF + 4nU1 U2 kE − E0 kF .
For the third term,
2nkT(Ct + E) (Ct + E) − T0 (Ct + E0 ) (Ct + E0 )kF
≤2nkT(Ct + E) (Ct + E) − T0 (Ct + E) (Ct + E)kF
+2nkT0 (Ct + E) (Ct + E) − T0 (Ct + E0 ) (Ct + E)kF
+2nkT0 (Ct + E0 ) (Ct + E) − T0 (Ct + E0 ) (Ct + E0 )kF
≤4n kCt k2F + kEk2F kT − T0 kF + 2nkT0 kF kCt kF + kEkF kE − E0 kF

+2nkT0 kF kCt kF + kE0 kF kE − E0 kF

≤ 4n3 U12 + 4n3 2 kT − T0 kF + (4n2 U1 U2 + 4n2 U2 )kE − E0 kF

>
Since ∇E f (T, E) = 2(E+C> )pt pt −2T> Cs T,k∇E f (T, E)−∇E f (T0 , E0 )kF can be bounded
as follows,
k∇E f (T, E) − ∇E f (T0 , E0 )kF
≤k2(E − E0 )kF kpt k2F + 4kTkF kCs kF kT − T0 kF
≤2U22 k(E − E0 )kF + 4nU1 U2 kT − T0 kF
.
Combining the above four relations, we have
k∇T f (T, E) − ∇T f (T0 , E0 )kF + k∇E f (T, E) − ∇E f (T0 , E0 )kF

≤ max{10n3 U12 + 6n3 U1 + 4nU1 U2 + 4n3 2 , 6n2 U1 U2 + 2U22 + 4n2 U2 } kE − E0 kF + kT − T0 kF ,

which yields the desired result.

19
Under review as a conference paper at ICLR 2023

Proof: By the smoothness of f (·), for any T̃ ∈ Π(ps , pt ) and Tτ from Algorithm 1, we have
l
φ(T̃) ≥ f (T̃, Eτ ) ≥ f (Tτ , Eτ ) + h∇T f (Tτ , Eτ ), T̃ − Tτ i − kT̃ − Tτ k2F
2 (6)
l
= φ(Tτ ) + h∇T f (Tτ , Eτ ), T̃ − Tτ i − kT̃ − Tτ k2F .
2
Let T̂τ = argminT∈Π(ps ,pt ) φ(T) + lkT − Tτ k2F . We have

φ1/2l (Tτ +1 ) ≤ φ(T̂τ ) + lkTτ +1 − T̂τ k2F

≤ φ(T̂τ ) + lkTτ − η∇T f (Tτ , Eτ ) − T̂τ k2F
≤ φ(T̂τ ) + lkTτ − T̂τ k2F + 2lηh∇T f (Tτ , Eτ ), T̂τ − Tτ i + η 2 lk∇T f (Tτ , Eτ )k2F
≤ φ1/2l (Tτ ) + 2lηh∇T f (Tτ , Eτ ), T̂τ − Tτ i + η 2 lk∇T f (Tτ , Eτ )k2F
l
≤ φ1/2l (Tτ ) + 2ηl φ(T̂τ ) − φ(Tτ ) + kT̂τ − Tτ k2F + η 2 lL2 ,
2
where the second line uses Lemma 3.1 of Bubeck et al. (2015) and the last line follows from (6).
Taking a telescopic sum over τ , we obtain
N
X l
φ1/2l (TN ) ≤ φ1/2l (T0 ) + 2ηl φ(T̂τ ) − φ(Tτ ) + kT̂τ − Tτ k2F + η 2 lL2 .
τ =0
2
Rearranging this, we obtain
N
1 X l φ
1/2l (T0 ) − minT∈Π(ps ,pt ) φ(T) ηL2
−φ(T̂τ )+φ(Tτ − kT̂τ −Tτ k2F ≤ + . (7)
N + 1 τ =0 2 2ηlN 2

Since φ(T) + lkT − Tτ k2F is l-strongly convex, we have

l
− φ(T̂τ ) + φ(Tτ ) − kT̂τ − Tτ k2F
2
l
≥ kTτ − T̂τ kF + φ(Tτ ) + lkTτ − Tτ k2F − min φ(T) + lkT − Tτ k2F
2
2 T

2 1 2
≥lkTτ − T̂τ kF = k∇φ1/2l (Tτ )kF .
4l
Plugging this in (7) and combining Lemma 3 proves the result.

B A LGORITHMIC D ETAILS
The Projected Gradient Descent (PGD) consists of the following three steps in each iteration τ .

Find Eτ that maximizes f (Tτ , E). By Theorem 1, we need to calculate an auxiliary matrix
G = T> s
τ C Tτ − C
t
T>
τ Tτ ,

where denotes the element-wise multiplication. And we have

, if Gi0 j 0 ≤ 0,
Ei0 j 0 (Tτ ) =
−, otherwise.
Such a step involves computational cost O(n3 ) where n = max{ns , nt }.

Gradient descent. Calculate Hτ = Tτ − η∇T f (Tτ , Eτ ) where

t> s>
t s

∇T f (Tτ , Eτ ) = 2 Cs Cs Tτ 1n 1n 21n 1n Tτ Ct + Eτ Ct + Eτ − 4Cs Tτ Ct + Eτ ,

which also involves computational cost O(n3 ).

20
Under review as a conference paper at ICLR 2023

Projection into the feasible domain. This requires solving the following problem
1
min kT − Hτ k2F , s.t. T1n = p, T> 1m = q.
T≥0 2

This optimization problem has a strongly convex objective and linear constraints, and hence can be
2
solved efficiently via Augmented Lagrangian Method with computational complexity n ρ| 1/2 log ρ|
(Xu,
2021) where ρ measures the optimality, that is, the violation of the two linear constraints. When
ρ = O n12 , this step also has cubic costs if we ignore the log term.

3 2

| log ρ|
Therefore, the overall complexity of PGD obtaining a δ-stationary solution is O nδ2 + nρ1/2 δ2
.

C A DDITIONAL E XPERIMENTS
C.1 A DDITIONAL E XPERIMENTAL R ESULTS OF G RAPH C LUSTERING

Sensitivity analysis of λ. As is discussed in Li et al. (2020); Vincent-Cuaz et al. (2021), the neg-
ative quadratic term can promote the sparsity of graph embeddings. We further conduct sensitivity
analysis
−4of λ by varying the value in {0, 10−5 , 10−4 , 10−3 , 10−2 , 10−1 }. As is shown in Table 3,
−2

λ ∈ 10 , 10 often yields good performance. The experiments in the main paper are run with
λ = 10−3 .
Table 3: ARI scores of RGDL with varied λ’s.

Datasets 0 10−5 10−4 10−3 10−2 10−1

MUTAG 0.4389 0.4561 0.4636 0.458 0.4661 0.4412
BZR 0.2515 0.2515 0.2707 0.2707 0.2707 0
Peking 1 0.1195 0.1195 0.1195 0.1195 0.1195 0.1079

C.2 G RAPH C LASSIFICATION

The learned embeddings of graphs can also be used in the graph classification task. RGDL is thus
compared against GDL (Vincent-Cuaz et al., 2021), GWF (Xu, 2020), and other state-of-the-art
graph classification methods including WGDL (Zhang et al., 2021) and GNTK (Du et al., 2019)
on the benchmark datasets MUTAG (Debnath et al., 1991), IMDB-B, and IMDB-M (Yanardag and
Vishwanathan, 2015). RGDL, GDL, and GWF use 3-NN as the classifier due to its simplicity. We
perform a 10-fold nested cross validation (using 9 folds for training, 1 for testing, and reporting the
average accuracy of this experiment repeated 10 times) by keeping same folds across methods.

Table 4: Graph classification results.

Datasets RGDL GDL GWF WGDL GNTK

IMDB-B 77.7(1.5) 59.0(2.6) 53.0(6.2) 79.7(3.6) 76.9(3.6)
IMDB-M 52.9(2.5) 44.2(2.3) 42.0(3.7) 53.5(5.0) 52.8(4.6)
MUTAG 98.2(3.0) 89.5(5.3) 86.0(3.0) 94.7(2.6) 90.0(8.5)

The results are reported in Table 4 and RGDL outperforms or matches state-of-the-art methods.
RGDL outperforms GDL and GWF significantly, which indicates the necessity of taking into ac-
count the structural noise of observed graphs. Although WGDL and GNTK have similar perfor-
mance, they are more computation and memory demanding due to the usage of graph neural net-
works.

Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
No ratings yet
Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
28 pages
LinearFGW: Optimal Transport for Graphs
No ratings yet
LinearFGW: Optimal Transport for Graphs
10 pages
Department of Mathematics, The Ohio State University.: Bstract
No ratings yet
Department of Mathematics, The Ohio State University.: Bstract
23 pages
These Cedric
No ratings yet
These Cedric
204 pages
Scalable Gromov-Wasserstein Learning For Graph Partitioning and Matching
No ratings yet
Scalable Gromov-Wasserstein Learning For Graph Partitioning and Matching
11 pages
Graph Diffusion Wasserstein Distances
No ratings yet
Graph Diffusion Wasserstein Distances
17 pages
NeurIPS 2023 Maximum Independent Set Self Training Through Dynamic Programming Paper Conference
No ratings yet
NeurIPS 2023 Maximum Independent Set Self Training Through Dynamic Programming Paper Conference
22 pages
ADMM For Combinatorial Graph Problems: Preprint
No ratings yet
ADMM For Combinatorial Graph Problems: Preprint
20 pages
A Convergent Single-Loop Algorithm For Relaxation of Gromov-Wasserstein in Graph Data - Supp
No ratings yet
A Convergent Single-Loop Algorithm For Relaxation of Gromov-Wasserstein in Graph Data - Supp
31 pages
General Graph Random Features
No ratings yet
General Graph Random Features
18 pages
Cmu850 f20
No ratings yet
Cmu850 f20
309 pages
Final Weekly Report5
No ratings yet
Final Weekly Report5
10 pages
DoRoOs Final
No ratings yet
DoRoOs Final
20 pages
Michael Scholkemper Damin K Uhn Gerion Nabbefeld Simon Musall BJ Orn Kampa Michael T. Schaub
No ratings yet
Michael Scholkemper Damin K Uhn Gerion Nabbefeld Simon Musall BJ Orn Kampa Michael T. Schaub
7 pages
Wasserstein Weisfeiler-Lehman Kernels
No ratings yet
Wasserstein Weisfeiler-Lehman Kernels
19 pages
Graphs and Beyond - Faster Algorithms For High Dimensional Convex Optimization - Jakub Pachocki
No ratings yet
Graphs and Beyond - Faster Algorithms For High Dimensional Convex Optimization - Jakub Pachocki
202 pages
Studies in Graph Theory-Distance Related Concepts in Graphs
No ratings yet
Studies in Graph Theory-Distance Related Concepts in Graphs
133 pages
Shervashidze 11 A
No ratings yet
Shervashidze 11 A
23 pages
Topological Graph Neural Networks - Horn
No ratings yet
Topological Graph Neural Networks - Horn
27 pages
Graphical Model Structure Learning
No ratings yet
Graphical Model Structure Learning
28 pages
Linear Distance Metric Learning With Noisy Labels: Meysam Alishahi
No ratings yet
Linear Distance Metric Learning With Noisy Labels: Meysam Alishahi
53 pages
Graph Algorithms Explained
No ratings yet
Graph Algorithms Explained
91 pages
Labeling of Certain Planar Graphs: Tiziana
No ratings yet
Labeling of Certain Planar Graphs: Tiziana
12 pages
Combinatorial Optimization and Reasoning With Graph Neural Networks
No ratings yet
Combinatorial Optimization and Reasoning With Graph Neural Networks
58 pages
Sublinear Cuts Are The Exception in Bdf-Girgs: Marc Kaufmann Raghu Raman Ravi Ulysse Schaller
No ratings yet
Sublinear Cuts Are The Exception in Bdf-Girgs: Marc Kaufmann Raghu Raman Ravi Ulysse Schaller
23 pages
SADMJ12
No ratings yet
SADMJ12
19 pages
Approximating Steiner Networks With Node Weights: Zeev Nutov
No ratings yet
Approximating Steiner Networks With Node Weights: Zeev Nutov
26 pages
Cmu850 f20
No ratings yet
Cmu850 f20
285 pages
Solving Graph Compression Via Optimal Transport: Preprint. Under Review
No ratings yet
Solving Graph Compression Via Optimal Transport: Preprint. Under Review
16 pages
Symmetry-Aware Gflownets: Hohyun Kim Seunggeun Lee Min-Hwan Oh
No ratings yet
Symmetry-Aware Gflownets: Hohyun Kim Seunggeun Lee Min-Hwan Oh
29 pages
Intro To Data Science
No ratings yet
Intro To Data Science
47 pages
Combinatorial Optimization and Reasoning With Graph Neural Networks
No ratings yet
Combinatorial Optimization and Reasoning With Graph Neural Networks
61 pages
Hausdorff and Wasserstein Metrics On Graphs and Other Structured Data
No ratings yet
Hausdorff and Wasserstein Metrics On Graphs and Other Structured Data
41 pages
Paper 399
No ratings yet
Paper 399
17 pages
10.1007@S10107 016 1090 7
No ratings yet
10.1007@S10107 016 1090 7
41 pages
Learning of Structured Graph Dictionaries
No ratings yet
Learning of Structured Graph Dictionaries
4 pages
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
No ratings yet
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
52 pages
Nguyen 20 C
No ratings yet
Nguyen 20 C
11 pages
Hirata Et Al 2008
No ratings yet
Hirata Et Al 2008
10 pages
29307-Article Text-33361-1-2-20240324
No ratings yet
29307-Article Text-33361-1-2-20240324
9 pages
Minimum-Distance Pattern Classification
No ratings yet
Minimum-Distance Pattern Classification
104 pages
Distance Metric Learning Based On The Class Center and Nearest Neighbor
No ratings yet
Distance Metric Learning Based On The Class Center and Nearest Neighbor
35 pages
Sensor Network Localization
No ratings yet
Sensor Network Localization
47 pages
OOD Graph
No ratings yet
OOD Graph
21 pages
GNN Foundations Frontiers and Applications Chapter2
No ratings yet
GNN Foundations Frontiers and Applications Chapter2
10 pages
Assign 5
No ratings yet
Assign 5
5 pages
Graph-Based Statistical Inference
No ratings yet
Graph-Based Statistical Inference
20 pages
Ex8 Sol
No ratings yet
Ex8 Sol
4 pages
Stochastic Iterative Graph Matching
No ratings yet
Stochastic Iterative Graph Matching
11 pages
Learning With L1-Graph For Image Analysis-rD5
No ratings yet
Learning With L1-Graph For Image Analysis-rD5
9 pages
Pattern Vectors From Algebraic Graph Theory
No ratings yet
Pattern Vectors From Algebraic Graph Theory
14 pages
13 Network Ion
No ratings yet
13 Network Ion
23 pages
Tutorial On Spectral Clustering
No ratings yet
Tutorial On Spectral Clustering
26 pages
Scalable Graph Classification Via Random Walk Fingerprints (Extended Abstract)
No ratings yet
Scalable Graph Classification Via Random Walk Fingerprints (Extended Abstract)
4 pages
LIPIcs APPROX18
No ratings yet
LIPIcs APPROX18
23 pages
Matrix and Discrepancy View of Generalized Random
No ratings yet
Matrix and Discrepancy View of Generalized Random
15 pages
Relations, Partition and Poset: Basic Definitions
No ratings yet
Relations, Partition and Poset: Basic Definitions
14 pages
Learning Graphs From Data A Signal Representation Perspective
No ratings yet
Learning Graphs From Data A Signal Representation Perspective
20 pages
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
No ratings yet
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
46 pages
Quantitative Stability of Regularized Optimal Transport
No ratings yet
Quantitative Stability of Regularized Optimal Transport
35 pages
Metric Measure Spaces Analysis
No ratings yet
Metric Measure Spaces Analysis
71 pages
Multi-Marginal Optimal Transport Defines A Generalized Metric
No ratings yet
Multi-Marginal Optimal Transport Defines A Generalized Metric
17 pages
Uniqueness and Monge Solutions in The Multimarginal OT Problem
No ratings yet
Uniqueness and Monge Solutions in The Multimarginal OT Problem
20 pages
A Multiscale Approach To Optimal Transport
No ratings yet
A Multiscale Approach To Optimal Transport
19 pages
Transport Inequalities. A Survey
No ratings yet
Transport Inequalities. A Survey
82 pages
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
No ratings yet
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
30 pages
A Linear Transportation LP Distance For Pattern Recognition
No ratings yet
A Linear Transportation LP Distance For Pattern Recognition
41 pages
Tensor Optimal Transport Distance Between Sets of
No ratings yet
Tensor Optimal Transport Distance Between Sets of
33 pages
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
No ratings yet
Understanding The Basis of Graph Signal Processing Via An Intuitive Example-Driven Approach
10 pages
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
No ratings yet
Slides - Graph Signal Processing: Fundamentals and Applications To Diffusion Processes
118 pages
Slides - Graph Signal Processing and Applications in Neuroscience
No ratings yet
Slides - Graph Signal Processing and Applications in Neuroscience
103 pages
Spectral Distances On Graphs
No ratings yet
Spectral Distances On Graphs
11 pages
A Geometric View of Optimal Transportation and Generative Model
No ratings yet
A Geometric View of Optimal Transportation and Generative Model
21 pages
Hypergraph Co-Optimal Transport Framework
No ratings yet
Hypergraph Co-Optimal Transport Framework
21 pages
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
13 pages
The Emerging Field of Signal Processing On Graphs
No ratings yet
The Emerging Field of Signal Processing On Graphs
14 pages
Robust Shape Matching With OT
No ratings yet
Robust Shape Matching With OT
175 pages
A285mel1 Na1 L DWG 0001 01 A4 Code 1
No ratings yet
A285mel1 Na1 L DWG 0001 01 A4 Code 1
3 pages
V1 N2 1980 Rabenhorst
No ratings yet
V1 N2 1980 Rabenhorst
6 pages
LBYEC3P Exp01 - Prelim Report
No ratings yet
LBYEC3P Exp01 - Prelim Report
7 pages
Email Marketing Assignment - Omera Anjum - PGSDM
No ratings yet
Email Marketing Assignment - Omera Anjum - PGSDM
9 pages
MAT1023 Ruhuna
No ratings yet
MAT1023 Ruhuna
80 pages
TEST 18 (T20 gd2 11.1)
No ratings yet
TEST 18 (T20 gd2 11.1)
5 pages
Delta Robot Kinematics
No ratings yet
Delta Robot Kinematics
10 pages
Payra Port Bridge Electrical Works
No ratings yet
Payra Port Bridge Electrical Works
4 pages
CEng 6104-Course Outline March 2023
No ratings yet
CEng 6104-Course Outline March 2023
2 pages
Academic Writing Intro Course
No ratings yet
Academic Writing Intro Course
6 pages
CV-86 Piezoelectric Velocity Transducer Specs
No ratings yet
CV-86 Piezoelectric Velocity Transducer Specs
1 page
VOSviewer: Advanced Text Mining
No ratings yet
VOSviewer: Advanced Text Mining
5 pages
Latest Dissertation Topics For Mba Marketing
100% (2)
Latest Dissertation Topics For Mba Marketing
7 pages
Planning Pack 2016
No ratings yet
Planning Pack 2016
48 pages
Audi 80/90 Wiring Diagram Guide
No ratings yet
Audi 80/90 Wiring Diagram Guide
20 pages
Dissertation Topics For Civil Engineering Students
100% (2)
Dissertation Topics For Civil Engineering Students
8 pages
Revision Questions
No ratings yet
Revision Questions
2 pages
Data Mining For Business Analytics Concepts Techniques and Applications in R Unlocked Test Bank
0% (1)
Data Mining For Business Analytics Concepts Techniques and Applications in R Unlocked Test Bank
329 pages
Fish-Ridge Wind Turbine
No ratings yet
Fish-Ridge Wind Turbine
19 pages
Thesis Body Structure
100% (3)
Thesis Body Structure
7 pages
Quiz 2
No ratings yet
Quiz 2
4 pages
2024 CISA Study Text
No ratings yet
2024 CISA Study Text
330 pages
DS-K1T341CMF Datasheet 20231227
No ratings yet
DS-K1T341CMF Datasheet 20231227
4 pages
Python Note 5
No ratings yet
Python Note 5
10 pages
Imagen Turbo-Compresor Solar
No ratings yet
Imagen Turbo-Compresor Solar
2 pages
CSE-224 (Fundamentals of Android)
No ratings yet
CSE-224 (Fundamentals of Android)
2 pages
Number System Conversion
No ratings yet
Number System Conversion
30 pages
5.hmt-B19162a-M02 - Piping Diagram of Ballast Water System - 1.0
No ratings yet
5.hmt-B19162a-M02 - Piping Diagram of Ballast Water System - 1.0
6 pages
Consent Form Version 6
No ratings yet
Consent Form Version 6
2 pages
Office of The Sangguniang Kabataan
No ratings yet
Office of The Sangguniang Kabataan
5 pages

Robust Graph Dictionary Learning

Uploaded by

Robust Graph Dictionary Learning

Uploaded by

Under review as a conference paper at ICLR 2023

ROBUST G RAPH D ICTIONARY L EARNING

Gromov-Wasserstein distance. Optimal Transport addresses the problem of transporting one

2.2 G RAPH R EPRESENTATION AND C OMPARISON

2.3 D ICTIONARY L EARNING

B wk , {C̄m }m∈JM K gives the approximation of G k and is defined as

Therefore, a complex bi-level optimization problem is involved, which is computationally inefficient

3 ROBUST G ROMOV-WASSERSTEIN D ISCREPANCY

where the objective f (·) is given by

3.1 P ROPERTIES OF RGWD

1. for all T ∈ Π(ps , pt ), E(T) = [Ei0 j 0 (T)] where

solves the inner maximization problem

2. RGWD is lower bounded, that is,

3. The triangle inequality holds for RGWD, i.e.,

≤ RGWD (C1 , p1 ), (C2 , p2 ),  + RGWD (C2 , p2 ), (C3 , p3 ),  .

Example 1 (Asymmetry of RGWD) Consider the case ps = pt = [ 0.5 s 01

Example 1 showcases that RGWD is asymmetric even if ns = nt .

3.2 N UMERICAL S CHEME OF RGWD

Algorithm 1 Projected Gradient Descent for RGWD

where the partial gradient takes the form

with denoting the element-wise multiplication.

4 ROBUST G RAPH D ICTIONARY L EARNING

Algorithm 2 Robust Graph Dictionary Learning (RGDL)

13: end for

5.1 S IMULATED DATASETS

inter-community connectivity and 1 − p as the probability of intra-community connectivity. The

GDL RGDL( = U) RGDL( = 10 2U) SC

5.2 R EAL - WORLD DATASETS

Hamidreza Sadreazami, Arash Mohammadi, Amir Asif, and Konstantinos N Plataniotis.

1. for all T ∈ Π(ps , pt ), E(T) = [Ei0 j 0 (T)] where

2. RGWD is lower bounded, that is,

3. The triangle inequality holds for RGWD, i.e.,

≤ RGWD (C1 , p1 ), (C2 , p2 ),  + RGWD (C2 , p2 ), (C3 , p3 ),  .

Proof: (i) The objective can be rewritten as follows,

≤ RGWD (C , p ), (C , p ),  + RGWD (C2 , p2 ), (C3 , p3 ),  .

To prove Theorem 2, we require the following lemma.

For the first term,

For the second term,

For the third term,

Combining the three relations above, we have

vec ∇T f (T0 , E0 ) vec(T0 )

it suffices to find a constant l satisfying

+2nkT0 kF kCt kF + kE0 kF kE − E0 kF

≤ 4n3 U12 + 4n3 2 kT − T0 kF + (4n2 U1 U2 + 4n2 U2 )kE − E0 kF

which yields the desired result.

φ1/2l (Tτ +1 ) ≤ φ(T̂τ ) + lkTτ +1 − T̂τ k2F

Since φ(T) + lkT − Tτ k2F is l-strongly convex, we have

where denotes the element-wise multiplication. And we have

Gradient descent. Calculate Hτ = Tτ − η∇T f (Tτ , Eτ ) where

which also involves computational cost O(n3 ).

Datasets 0 10−5 10−4 10−3 10−2 10−1

C.2 G RAPH C LASSIFICATION

Table 4: Graph classification results.

Datasets RGDL GDL GWF WGDL GNTK

You might also like

≤ RGWD (C1 , p1 ), (C2 , p2 ), + RGWD (C2 , p2 ), (C3 , p3 ), .

≤ RGWD (C1 , p1 ), (C2 , p2 ), + RGWD (C2 , p2 ), (C3 , p3 ), .

≤ RGWD (C , p ), (C , p ), + RGWD (C2 , p2 ), (C3 , p3 ), .

≤ 4n3 U12 + 4n3 2 kT − T0 kF + (4n2 U1 U2 + 4n2 U2 )kE − E0 kF