0% found this document useful (0 votes)

15 views76 pages

Tesis 7

This master thesis explores the use of graph neural networks (GNNs) for classification tasks, particularly focusing on their application in biomedical data such as Parkinson's disease classification. It presents a comprehensive analysis of existing GNN methods, introduces a new self-attentive architecture for graph classification, and discusses the integration of attention mechanisms within GNNs. The thesis concludes with experimental results demonstrating the effectiveness of these models on specific datasets related to the disease.

Uploaded by

hasnain33669

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views76 pages

Tesis 7

Uploaded by

hasnain33669

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

POLITECNICO DI TORINO

Master of Science in Mathematical Engineering

Master Thesis

Graph neural networks for classification:

models and applications

Supervisors:
Prof. Pietro Liò
Prof. Elisa Ficarra

Candidate:
Chiara Sopegno

March 2020
A chi mi ha sostenuta in questi anni
Abstract

Graph neural networks have emerged in the past years as very promising
methods for the analysis of graph-structured data. Useful insights can in
fact be extracted by allowing learning models to take into account relation-
ships between entities in a graph. The main methods used in the context
of graph neural networks are here described and compared, with a focus on
the extension of convolutional layers to graph structures. Afterwards, an
analysis of how attention mechanisms integrate with graph neural networks
is introduced. In this context a new architecture is proposed, with a self-
attentive mechanism that allows a graph neural network to attend over its
own input in the context of graph classification. An application of these
methods to biomedical data is finally presented, with an example in the field
of Parkinson’s disease classification.
Contents

1 Introduction 1

2 Machine learning background 5

2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 5
2.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . 10

3 Methods 15
3.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Graph Convolutional Layers . . . . . . . . . . . . . . . 18
3.1.2 Graph Pooling Layers . . . . . . . . . . . . . . . . . . 28
3.2 Attention Mechanism in Graph Neural Networks . . . . . . . . 31
3.3 Self-Attentive Graph Classifier . . . . . . . . . . . . . . . . . . 35

4 Model application 39
4.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 DNA Methylation . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 SPECT Images . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Graph neural networks for Parkinson’s disease classification . . 43
4.3 Dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 47

3
4.4.1 DNA methylation data . . . . . . . . . . . . . . . . . . 47
4.4.2 SPECT imaging data . . . . . . . . . . . . . . . . . . . 49
4.5 Experiments and results . . . . . . . . . . . . . . . . . . . . . 51

5 Conclusions and future works 57

Bibliography 61
Chapter 1

Introduction

The field of deep learning has become very popular in the past years, bring-
ing to the creation of neural network models capable of reaching the state-
of-the-art performance in many machine learning tasks. The majority of
deep learning methods analyses data that naturally lie in euclidean spaces.
However, in many learning tasks, the input data inherently contain relation-
ships between their entities. This is for example the case of users in social
networks, neurons in brain connectomes or interacting particles in physical
systems, all of which constitute systems of naturally interconnected items.
By disregarding the connectivity between entities in such networks, relevant
information can be lost during the automated learning process. For this
reason, new models have been introduced to take into account the underly-
ing structure of this kind of data. The architectures of these models, called
graph neural networks, are built directly on graph structures, where nodes
represent the analyzed items and edges encode information about the rela-
tionships between them.
Graph neural networks have first been introduced by [41] as an extension of
2 CHAPTER 1. INTRODUCTION

recursive neural networks, but recently the field expanded with several models
extending the concepts of convolutional neural networks to graph-structured
data. Useful insights can in fact be extracted by allowing learning models to
take into account relationships between the items in input and capture this
information through a message-passing framework between the nodes of the
graphs.
At the same time, the mechanism of attention in deep learning models has
recently started to gain increasing popularity. Initially introduced in the
context of neural machine translation, it is starting to be integrated in other
fields such as computer vision. The basic idea behind these mechanisms is
to allow the models to focus on the most important part of the input during
training, while disregarding underlying noise in the signal. An important
advantage of such mechanisms lies in the interpretability of the resulting
models. It would be therefore desirable to introduce the characteristics of
attentive mechanisms also in the context of signals defined on non-euclidean
domains.

The contribution of this work is three-fold: firstly, an analysis of the main

methods introduced in the context of graph neural networks is presented,
with a systematic categorization of convolutional, pooling and attention
mechanisms; in second place, a new architecture is proposed, with a self-
attentive mechanism that allows a graph neural network to attend over its
own input in the context of graph classification; finally, an application of
these methods to biomedical data is presented, with examples in the field of
Parkinson’s disease classification.
The second chapter gives a background of the main methods that have been
developed in the field of deep learning in the past years. A first section
3

describes in a schematic way what convolutional neural networks are, by de-

scribing the main concepts and structures. In the second section, instead,
an introduction is given to what attention mechanisms are and what con-
tributions they can bring in many different contexts, ranging from natural
language processing to computer vision.
The third chapter gives a complete overview of state-of-the-art methods in
the field of graph neural networks. Extensions of convolutional and pooling
layers to graph-structured data are explained in details and advantages and
disadvantages of each method are analyzed. The emerging field of attentive
mechanisms in graph domains is then explored, analyzing both node-focused
and graph-focused structures. Finally, a new framework for the application
of self-attentive strategies when trying to classifying an entire graph is pro-
posed.
The forth chapter focuses on the application of these methods to data con-
cerning patients with Parkinson’s disease, especially considering the problem
of classifying data coming from the imaging and genomic fields. After a
general introduction to the disease and to the data modalities taken into
consideration, which include DNA methylation data and SPECT images of
the brain, a description of the dataset coming from the PPMI (Parkinson’s
Progression Markers Initiative) is given. The disease classification problem
is then re-framed as a graph classification instance and the application of the
described models is analyzed.
Chapter 2

Machine learning background

2.1 Convolutional Neural Networks

Artificial neural networks are machine learning techniques that allow to cre-
ate models mapping variables between an input space and an output space.
Convolutional neural networks (CNNs) are a specific kind of neural networks
commonly used in the fields of image and video analysis, as well as in natural
language processing. The basic structure of a CNN consists of a sequence of
layers of different type [17]:

• Convolutional layer : the characteristic layer of a CNN. It takes as input

a tensor of numbers (e.g. pixel values of an image) and convolves it
with one or more smaller tensors of parameters, called filters. The
operation consists in sliding the parameter tensor along the input: in
every position, the element-wise multiplication between the filter and
the corresponding part of the input is computed, and the resulting
values are added together. The final values are then inserted in order
in the output tensor, also called feature map. Intuitively, for the case
6 CHAPTER 2. MACHINE LEARNING BACKGROUND

of images, the filters act as feature extractors, detecting edges, shapes,

lines and textures, according to the values of their parameters.

• Activation layer : the layer that introduces non-linearity inside the neu-
ral network model, applying a non-linear function on the feature map.
The most used ones are ReLU , hyperbolic tangent and sigmoid func-
tion.

• Pooling layer : the layer that reduces the size of the feature map, by
retaining only the most important information in a spatial neighbour-
hood of its input. The most common pooling strategies usually apply
simple mathematical functions, such as summing the elements in the
selected patch or taking the average between them.

• Dense layer : a standard feed-forward neural network layer, in which

every input element is connected with every output.

• Output layer : the last layer of a CNN architecture. In the context

of multi-class recognition, it has as many outputs as the number of
classes. It usually applies the Sof tM ax function in order to return
class probabilities.

Formally, in the context of classification, the structure of the main body of

a CNN can be summarized in the following framework, introduced in [6]. If
we consider a compact Euclidean domain Ω ∈ Rd and a set of functions in
L2 (Ω), the objective is to find an approximation of a function y : L2 (Ω) → Y ,
where the target space Y is discrete and |Y | is equal to the number of classes.
In this context, the main body of a CNN can be described as follows:

UΘ (f ) = (CΓ(K) ◦ ... ◦ P ◦ ... ◦ CΓ(2) ◦ CΓ(1) )(f ) (2.1)

2.1. CONVOLUTIONAL NEURAL NETWORKS 7

where f (x) = (f1 (x), . . . , fp (x)) is the p-dimensional input to the network,
with f ∈ L2 (Ω), C and P denote convolutional and pooling layers respec-
tively, and Θ = (Γ(1), . . . , Γ(K)) is the hyper-vector of network parameters.
The generic convolutional layer g = CΓ (f ), with compactly supported filters
Γ = (γl,l0 ), can be described by the following equation:
p
!
X
gl (x) = ξ fl0 ? γl,l0 (x) , l = 1, . . . , p, l0 = 1, . . . , q (2.2)
l0 =1

where x ∈ Ω are the spatial coordinates, ξ is a non-linearity and:

Z
f ?γ = f (x − x0 )γ(x0 )dx0 (2.3)
Ω

denotes the standard convolution. It has to be noted that in standard CNNs

the filters are not actually reversed, so what is applied is a cross-correlation
and not a standard convolution. However, the two operations are equivalent
when applied with filters that are one the reversed of the other, and since
in CNNs the filters are learnt in the process, there’s not an actual difference
between them. Following (2.2), many feature maps are produced, resulting
in a q-dimensional output g(x) = (g1 (x), . . . , gp (x)) over the domain Ω.
For what concerns the pooling layer g = P (f ), it can instead be generally
described by:

gl (x) = h({fl (x0 ) : x0 ∈ N (x)}) , l = 1, . . . , q (2.4)

where h is a permutation-invariant function applied to a neighbourhood N (x)

of x. Examples of permutation-invariant function are the Lp norms, which
result in the application of average-pooling for p = 1 and max-pooling for
p = ∞.
It’s important to note that (2.1) describes the main structure of a CNN,
however sometimes the convolutional layers are followed by normalization
8 CHAPTER 2. MACHINE LEARNING BACKGROUND

layers that allow to improve the performance. Moreover, as previously said,

the results given by (2.1) is then fed into one or more final fully-connected
layers in order to perform the final classification.

Moving on to the learning process, the training of a convolutional neural

network is characterized by different steps. The first step consists in the
initialization of the network parameters, which are usually initialized using
small random values. Subsequently an iterative process starts, which involves
the following steps:

• Forward propagation: an input sample (or a batch of them) is shown to

the network, which computes the intermediate variables layer by layer
and in the last layer outputs the probabilities of class membership ŷ (i) .

• Loss evaluation: given the i-th sample, the probabilities of class mem-
bership ŷ (i) are compared with the one-hot encoding y (i) of the true
label, in order to compute the loss function. A common example is the
cross-entropy loss function:

L = L(ŷ (i) , y (i) ) = −(y (i) log ŷ (i) + (1 − y (i) ) log(1 − ŷ (i) )) (2.5)

• Back-propagation of the error : the derivatives of the function L with

respect to the network parameters are computed by applying recur-
sively the chain rule. This process provides the local gradient of L in
the current parameter configuration w̄:

∂L(w̄) ∂L(w̄)
∇L(w̄) = ,..., (2.6)
∂w1 ∂wm
• Parameters update: after back-propagating the error, the weights are
updated following a gradient descent formula:

wt+1 = wt − η∇L(wt ) (2.7)

2.1. CONVOLUTIONAL NEURAL NETWORKS 9

where η is a hyper-parameter controlling the length of the step.

The training process usually involves many iterations, with the goal of find-
ing a configuration of the parameters space that allows to minimize the loss
function. Each complete presentation of the dataset to the network is called
epoch. In most cases it has been observed that presenting the input samples
to the network in batches helps to stabilize the training procedure and im-
proves overall results.

The success of CNNs is primarily due to their ability to exploit the sta-
tionarity, locality and intrinsic hierarchical structure of the data [6].
In the visual domain, stationarity refers to the fact that many of the prop-
erties of an image are shift-invariant. Therefore, the use of filters that are
convolved across all the input allows to recognize identical features indepen-
dently of their spatial locations. This aspect allows therefore to share weights
across neurons, thus diminishing the number of parameters and reducing the
risk of overfitting.
Secondly, in images small sets of nearby adjacent pixels are able on their
own to encode significant information for the downstream task. This allows
the use of filters whose support is much smaller than the size of the input,
further diminishing the number of required parameters.
Moreover, by using a multi-layer structure that alternates pooling and con-
volutional operators, CNNs are able to learn multi-scale latent features in a
hierarchical way. This last aspect works particularly well in capturing the
intrinsic hierarchical structure of data like images, allowing the detection of
different aspects of the image at different layers of the CNN: the first layers
will detect edges and simpler shapes, while the last ones will be able to detect
entire objects and perform classification.
10 CHAPTER 2. MACHINE LEARNING BACKGROUND

2.2 Attention Mechanism

The idea behind any attention mechanism is to help a learning model to
understand which elements of the input are more relevant for the task at
hand. This aspect is generally addressed by learning a vector of attention
scores for each element in the input and using them to modulate the flow of
information from the input signal.
This idea was first introduced in the context of neural machine translation
by [3], to cope with the loss of information in encoder-decoder networks
when the input was a long sentence. In fact, while previous methods tried
to encode the whole input sentence into a fixed-length vector, the method
proposed in [3] uses a sequence of different latent vectors, which are then
fed to the attention mechanism in the decoder step. In this way the decoder
has the ability to choose adaptively which ones of the learnt embeddings are
more significant for constructing the correct translation.
More precisely, if we call {h1 , . . . , hT } ∈ Rd the encoded embeddings, the
implementation of this mechanism consists in computing a context vector ci
as the weighted sum of the embeddings:
T
X
ci = αij hj (2.8)
j=1

where the attention coefficients αij are computed based on an alignment

model between the input and the output sentence. Each αij should in fact
measure how well the latent embedding hj is aligned with an encoding of the
output sentence up until position i, denoted by si−1 ∈ Rd . The attention
coefficients can be computed by:

exp(eij )
αij = PT (2.9)
k=1 exp(eik )
2.2. ATTENTION MECHANISM 11

where the inputs eij to the softmax function are the alignment coefficients
[3]:

eij = a(si−1 , hj ) (2.10)

The alignment model a can be parametrized by a feedforward neural net-

work taking as input a linear combination of si−1 and hj , in what is called
additive attention. However, recently some works proposed a multiplicative
attention method, that uses as alignment model the dot product between the
two inputs [46]. It should be noted that the alignment model can be trained
together with the whole neural network, thus allowing great flexibility in the
model.
Interestingly, in the context of neural machine translation, if we consider the
coefficient αij to be the probability that si−1 is aligned with the embedding
hj , then the context vector ci can be interpreted as an expected value of the
embeddings over the possible alignments [3].

Equation (2.10) applies the alignment model to vectors that are computed
from the input and output sentences by using recurrent neural networks.
However, it’s also possible to apply a self-attention mechanism by computing
the alignment model directly between different positions of input and output
sequences, as proposed in the Transformer model by [46]. They propose in
fact to separate the output sequence into a set of queries, represented by a
matrix Q ∈ Rq×d , and the input one into a set of keys, encoded in K ∈ Rk×d ,
where q is the number of queries, k the number of keys and d is the shared
dimension of queries and keys. The attention scores are computed through
the alignment model:
QK T

A = SoftMax √ (2.11)
d
12 CHAPTER 2. MACHINE LEARNING BACKGROUND

in a scaled multiplicative attention formulation. If we then consider the val-

ues V ∈ Rk×v , we can use the attention scores to modulate their information
through C = AV , where C is the matrix containing the context vectors. The
values are often computed through a transformation of the keys themselves.
This formulation allows to avoid costly operations derived from the computa-
tion of embeddings through RNNs and yields very good experimental results.
Another improvement in attention models proposed by [46] is what is called
multi-head attention: instead of computing only one alignment model over
the queries and keys proposed, it is in fact possible to compute different
models, taking as inputs learned linear projections of those same queries and
keys. The attention mechanisms are then computed in parallel and the learnt
context vectors can be concatenated to form a unique output of the multi-
head attention layer. Through this formulation it’s possible to increase the
expressive capacity of the model [46].

Despite being introduced in the context of neural machine translation, at-

tentive models have also been exploited in the context of computer vision. It
is in fact possible to use the attention mechanism to adaptively select a por-
tion of an image, in order to disregard the non-relevant parts and diminish
both noise and number of parameters required. A first attempt was made by
constructing an RNN model which processed the input by sequentially fix-
ating on different parts of the image and, then, incrementally combined the
information by giving different weights to embeddings coming from different
regions of the image [35].
More recent approaches integrate instead attention mechanism in convolu-
tional neural networks through the use of learnable attention maps. Tra-
ditionally, attention maps have been extracted from intermediate layers of
2.2. ATTENTION MECHANISM 13

CNNs after these were fully trained. The objective was to understand the
relative importance that features in a layer had in the creation of the follow-
ing feature maps. However, the use of attention in a post-hoc formulation
doesn’t let it participate in the learning process in order to improve it.

A method that allows to integrate attention into training is proposed by

[22], where attention scores are learnt during training and used to create a
convex combination of the feature vectors in a specific layer. Their method
for computing attention scores proposes to re-purpose the global image rep-
resentation, computed by the network itself in its last layers, as a query that
attends over the feature vectors of intermediate layers, considered as keys.
The basic idea of this mechanism is summarized in figure 2.2.
We can denote the global feature vector as g ∈ Rd and by Ls = {l1s , . . . , lns } ∈
Rd the set of local feature vectors extracted at a given layer s ∈ S, where n
is the total number of spatial locations at layer s and S is a subset of layers
of interest. For computing the attention map at layer s, we first need to
compute each alignment score esi ∈ R between g and each one of the fea-
ture vectors lis through esi = a(g, lis ), where the alignment model a can be
both multiplicative or additive. The alignment scores are then normalized
following:
exp(esi )
αis = n
P s
(2.12)
j=1 exp(ej )

in a layer-wise formulation. Equation (2.12) generates the values in the

attention map A = {α1s , . . . , αns }, which will be used to filter the importance
of each local feature vector in layer s. For each layer of interest, a vector
representation is produced by:
X
gas = αis lis (2.13)
i
14 CHAPTER 2. MACHINE LEARNING BACKGROUND

Figure 2.1: Overview of the attention mechanism proposed by [22]

f
and the concatenation s∈S gas substitutes the vector g as global represen-
tation of the image.
This method allows for an integration of global and local features, attend-
ing one over the others and thus creating a multi-scale attention framework.
The attention scores represent how well a region of the intermediate feature
map relates to the global representation of the image itself, and enables the
learning process to amplify the influence of the most significant regions, dis-
carding the information coming from noisy regions. Moreover, the use of the
attention mechanism at different layers allows to focus on different spatial
resolutions.
Chapter 3

Methods

3.1 Graph Neural Networks

Convolutional neural networks represent the state-of-the-art methods for

data defined on grid-like Euclidean structures, like images and videos. How-
ever, many real-world scenarios are represented using data that are naturally
defined in non-Euclidean spaces, such as graphs and manifolds [6]. Some
examples of graph-structured data can be found in social networks, trans-
portation networks, brain connectomes and other kinds of biological net-
works. However, generalizing convolutional neural networks to graphs is not
easy, since convolutional and pooling operators are defined on regular grids.
The first approaches to applying deep learning to these kind of structured
data required a pre-processing phase in which the topological structure of the
input was encoded in simpler structures (e.g. vectors of reals). However, for
what concerned these methods, the quality of the results was deeply linked
with the quality of the encoding and, in the absence of a deep understanding
of the data domain, a considerable amount of the topological information
16 CHAPTER 3. METHODS

could be lost in the process. Moreover, the hand-engineering of features is

simply not feasible when dealing with high-dimensional data.
Graph neural networks have first been presented in [18] and [41] as an ex-
tension of recurrent neural networks. Recurrent neural networks are in fact
a type of neural network whose inner structure is composed by a directed
acyclic graph, so their basic scheme can be extended to more general kinds
of graph structures. The method was based on the repeated application
of contraction maps, until the states of nodes reached a stable fixed point.
Subsequently the state vectors of nodes were fed into a feed-forward neural
network to perform classification [41]. More recently instead, graph neural
networks have been re-framed as an extension of CNNs to graph domains,
by defining convolution and pooling operations to graph structures.

Formally, a graph can be defined as G = (V, E, W ), where V is a finite

set of n nodes, E is the set of edges between nodes and W ∈ RV+×V is a
weighted adjacency matrix. Edges are described as ordered pairs of nodes
(i, j) ∈ V × V , while the weight matrix is a non-negative matrix whose
elements satisfy the conditions:

Wij = 0 if (i, j) ∈
/E
(3.1)
Wij > 0 if (i, j) ∈ E
The set of nodes represents the units involved in the graph, for example peo-
ple in social networks or interacting particles in a physical system. Edges
represent instead specific relationships between the defined entities, though
they may have different interpretation depending on the context. For what
concerns instead the optional weight Wij associated to an edge (i, j), it usu-
ally defines the strength of that specific relationship.
3.1. GRAPH NEURAL NETWORKS 17

The amount of works that has been proposed in the field of deep learning
applied to graphs can be generally divided into two categories [41]:

• node-focused applications, where the final task is associated with indi-

vidual properties of each node (examples include semi-supervised node
classification, link prediction and node clustering)

• graph-focused application, where the model to be constructed is depen-

dent on the whole graph structure (examples include graph classifica-
tion, graph generation and estimation of properties of graphs)

Since images can be considered as graphs with a fixed euclidean grid, we

can re-frame tasks like image classification as an example of graph classi-
fication, while object localization and image-segmentation are examples of
node-focused applications, where the nodes are the pixels of the image.
In the context of this thesis we will focus on the second class of problems, the
one regarding graph focused applications and specifically graph classification.

In [7], graph neural network have been introduced in a CNN-like structure for
the first time. First of all, this involves giving a definition of what it means to
apply convolutions on graph structures data. The propagation of information
across elements is in fact at the basis of any convolutional neural network.
Then, even if not strictly necessary, an extension of the idea of pooling layer
should also be given, in order to allow the network to also exploit a possible
hierarchical structure of the data. Batch normalization and activation layers
can instead be trivially extended from the ones used in standard convolu-
tional neural networks. Moreover, the training can be performed by the use
of the back-propagation algorithm with chosen task-specific losses.
18 CHAPTER 3. METHODS

3.1.1 Graph Convolutional Layers

The first attempt to generalize convolutions to graph structures takes inspi-
ration from harmonic analysis on graphs [7] and it requires to introduce some
concepts from spectral graph theory in order to be described.
Given a graph structure G = (V, E, W ) with n nodes, we can consider the
Hilbert spaces L2 (V ) and L2 (E) defined by the following inner products:
X
hf, giL2 (V ) = fi gi (3.2)
i∈V
X
hF, GiL2 (E) = we Fe Ge (3.3)
e∈E

with f, g : V → R and F, G : E → R. By considering two functions f ∈

L2 (V ) and F ∈ L2 (E), we can then define the graph gradient as the operator
∆ : L2 (V ) → L2 (E) such that:

(∆f )(i,j) = fi − fj (3.4)

and the graph divergence div : L2 (E) → L2 (V ) expressed by:

X
(divF )i = wij F(i,j) (3.5)
j:(i,j)∈E

It follows that the graph Laplacian ∇ = −div∆ is an operator ∇ : L2 (V ) →

L2 (V ) such that:
X
(∇f )i = wij (fi − fj ) (3.6)
(i,j)∈E

If we apply the graph Laplacian operator to a signal x = (x1 , ..., xn )T ∈ Rn

such that x ∈ L2 (V ), we see that (3.6) can be rewritten in the following
matrix formulation:
∇x = (D − W )x (3.7)
3.1. GRAPH NEURAL NETWORKS 19

where D ∈ Rn×n is the diagonal matrix with Dii =

P
j wij . We can then
define the combinatorial Laplacian matrix as L = D − W , which is an im-
portant operator in graph spectral analysis. Alternatively, the normalized
definition of the Laplacian matrix L = I − D−1/2 W D−1/2 can be used. Both
definitions of the graph Laplacian are generalization of the Laplacian defined
in Euclidean spaces.

Given a signal x ∈ Rn over the nodes of the graph, we can define the smooth-
ness functional k∇xkW as:
X XX
k∇xkW = k∇xkW (i) = wij [x(i) − x(j)]2 (3.8)
i i j

where the notation k∇xkW (i) denotes the smoothness functional at node i.
Given this definition, we have that:
√
u0 = arg min k∇xk2W = (1/ n)1n
x∈R, kxk=1

is the smoothest signal on the graph structure (it is in fact a constant over
the nodes), while the other eigenvectors of L are in general given by:

ui = arg min k∇xk2W

x∈R, kxk=1, x⊥{u0 ,...,ui−1 }

Since L is a symmetric positive semidefinite matrix, it is possible to determine

the smoothness of the signal x by reading its coefficients in the basis {ui }n−1
i=0

of orthonormal eigenvectors. These vectors are also called the Fourier modes
n−1
of the graph and their associated non-negative eigenvalues {λi }i=0 identify
the frequencies of the graph.
The graph Fourier transform is defined as the projection onto the eigenbasis
of the graph Laplacian operator:

x̂ = U T x
20 CHAPTER 3. METHODS

while its inverse will be given by x = U x̂. The coefficients of a signal x in

the eigenvector basis are the equivalent of the Fourier coefficients of signals
defined on grids. Moreover, the Fourier basis can also be used to diagonalize
the Laplacian itself as follows:

L = U ΛU T , U = [u0 , . . . , un ] ∈ Rn×n , Λ = diag{λ0 , . . . , λn } ∈ Rn×n

Since it’s difficult to extend the traditional definition of convolution, based on

the existence of a translation operator, on the vertex domain of the graph, the
idea introduced in [7] is to define the convolution operation on the spectrum
of the weights. The spectral convolution of two signals x, y ∈ Rn on a graph
G can then be defined as:

UT x UT y

x?y =U = U diag(yˆ1 , . . . , yˆn )x̂ (3.9)

where is the Hadamard product. Equation (3.9) represents the equivalent

of the convolution theorem defined in Euclidean domains. A signal x can
then be filtered following:

x0 = gθ ? x = U G(Λ)U T x (3.10)

where G(Λ) is a diagonal matrix of learnable filters in the frequency domain.

If we assume that the input signal at a specif layer k is given by a matrix
X (k) ∈ Rn×fk , where fk is the number of features of each node at that layer,
then the rows of the feature matrix at the next layer k + 1 can be computed
as:
fk
!
(k+1)
X (k) (k)
xj =ξ U Gi,j U T xi , j = 1, . . . , fk+1 (3.11)
i=1
(k)
where ξ is a non-linear function and each Gi,j is a diagonal matrix of fil-
ters that can be learnt during the back-propagation process. We observe
3.1. GRAPH NEURAL NETWORKS 21

that the resulting feature matrix X (k+1) will have dimensions n × fk+1 . The
idea is similar to classical CNNs: the input signals are passed through these
learnable filters, which aggregate the information, and subsequently the non-
linearity ξ is applied to the output.
However, often only the smoothest part of the signal is useful in the analysis,
since it is the one that is less likely to carry signal noise. It can therefore
be useful to consider only the first d eigenvectors of the Laplacian and sub-
stitute U with Ud = [u0 , . . . , ud ] ∈ Rn×d . The cutoff value d depends on the
specif application, on the regularity of the graph and on the sample size. It
is important to notice that often the single high frequency eigenvectors do
not bring much information, but we can obtain much more information by
allowing the network to combine them together.
We have seen that equation (3.11) allows us to define a translation operation
in the spectral domain. By defining the convolution on this domain, we are
now able to allow weight sharing across different locations of the graph, how-
ever the filters used aren’t spatially localized. It can be observed that, in the
euclidean grid, the spatial localization is translated into smoothness in the
spectral domain. So, in order to learn filters that are actually localized in
the original spatial domain, it is necessary to learn spectral multipliers that
give a smooth signal in the Fourier domain. This can be done also in graphs,
by selecting only a subset of multipliers and using interpolation kernels to
obtain the rest, thus allowing localization of the filters in the spatial domain.
In particular the smooth filters can be described by:

(k) (k)
diag(Gi,j ) = Bαi,j (3.12)

where B = (bl,m ) = (βm (λl )) is a d × qk matrix of fixed interpolation kernels,

(k)
for example cubic splines, and αi,j are the qk interpolation coefficients.
22 CHAPTER 3. METHODS

In order to obtain filters with constant spatial support (i.e. independent of

the size of the input n), it is necessary to choose a sampling step γ ∼ n in the
spectral domain, thus giving a constant number qk ∼ n · γ −1 of coefficients
per filter [7].
Another limitation of convolutions in the Fourier domain is the O(n2 ) com-
putational cost, which is due to the multiplication with the Fourier basis,
that has to be performed twice in the forward and inverse Fourier transform.
Moreover, the filters in (3.11) learnt during the process are dependent on the
basis U (or Ud ) used, and therefore dependent on the specific structure of
the graph [50].

An improvement of the previous model is introduced by [11]. In order to

solve both the efficiency problem and the non-localization of filters, a solu-
tion is the use of polynomial filters:
r−1
X
Gθ (Λ) = θj Λj (3.13)
j=0

where r is the polynomial order and θ ∈ Rr is a vector of learnable coefficients.

It can be proven that such filters are localized in a spatial neighbourhood of
radius r − 1 from the central node in which they are applied [11]. This is
intuitive since the Laplacian is an operator that works on a neighbourhood
of radius 1 and any j-th power of the diagonalized Laplacian is thus working
on a j-hops neighbourhood [6]. Moreover, the number of parameters to be
learnt is also dependent on the size r of the filters and thus independent from
the input size.
However, by introducing such filters in (3.11), we still encounter the com-
putational complexity brought by the multiplication with the Fourier basis.
This problem can be solved through the use of a Chebyshev expansion.
3.1. GRAPH NEURAL NETWORKS 23

The Chebyshev polynomials are defined by the following recursive formula:

T0 (x) = 1,
T1 (x) = x, (3.14)

Tk (x) = 2xTk−1 (x) − Tk−2 (x)

p
and they form an orthogonal basis for L2 ([−1, 1], dy/ 1 − y 2 ), i.e. the
space of square integrable functions on [−1, 1] with respect to the measure
p
dy/ 1 − y 2 [20]. A generic filter can thus be parametrized by the truncated
Chebyshev expansion:
r−1
X
Gθ (Λ) = θj Tj (Λ̃) (3.15)
j=0

where the Chebyshev coefficients θ ∈ Rr are again learnable and the diagonal
Laplacian is rescaled through Λ̃ = 2Λ/λmax − In , so that its elements lie in
[−1, 1].
With the filters defined in (3.15), the convolution operation (3.11) on a signal
x ∈ Rn can be rewritten as follows:
r−1
X
0 T
x = gθ ? x = U Gθ (Λ)U x = θj U Tj (Λ̃)U T x
j=0
r−1
X
= θj Tj (L̃)x (3.16)
j=0
r−1
X
= θj x̄j
j=0

where L̃ = 2L/λmax −In is the rescaled Laplacian and the values x̄j = Tj (L̃)x
can be computed using the recursive relation:

x̄j = 2L̃x̄k−1 − x̄k−2 (3.17)

24 CHAPTER 3. METHODS

with x̄0 = x and x̄1 = L̃x. Thanks to the compactly supported filters, we
have defined filters which, to be computed, require a computational com-
plexity of O(|E|), i.e linear in the number of edges of the graph. This means
that for sparse graphs the computation can be performed in linear time with
respect to the input size n. We can thus see how avoiding an explicit use of
the Fourier transform brings a great computational advantage.

The filters defined in equation (3.15) can be further simplified in a first-

ordered approximation [23]. In fact, starting from equation (3.16), if we
consider the expansion only until r = 2, we obtain a function that is linear
with respect to L̃. Moreover, if we consider the approximation λmax ∼ 2, we
get a filter gθ leading to the following convolution:

gθ ? x ≈ θ0 x + θ1 (L − In )x = θ0 x − θ1 D−1/2 W D−1/2 x (3.18)

with parameters that are shared across the different locations of the graph.
Since we have taken a further approximation of the Chebyshev expansion,
these filters will aggregate the signal only across a 1-hop neighbourhood of
the graph. Nevertheless, the consecutive application of different filters will
allow to aggregate information from a larger neighbourhood.
If we further introduce the constraint θ∗ = θ0 = −θ1 , we obtain the following
filter:
gθ ? x = θ∗ (In + D−1/2 W D−1/2 )x (3.19)

However, since the matrix In + D−1/2 W D−1/2 has eigenvalues in [0, 2], we
can encounter problems of vanishing/exploding gradients. For this reason, a
renormalization of W is introduced:

gθ ? x = θ∗ (D̃−1/2 W̃ D̃−1/2 )x (3.20)

3.1. GRAPH NEURAL NETWORKS 25
P
where W̃ = W + In and D̃ = j W̃ij .
If we consider a multi-channel signal X (k) ∈ Rn×fk defined on the graph at
layer k of the neural network, we can then compute the signal at the next
layer generalizing the previous formula:

X (k+1) = ξ(D̃−1/2 W̃ D̃−1/2 X (k) Θ) (3.21)

where Θ ∈ Rfk ×fk+1 is the matrix of filter parameters, with fk+1 output
channels.
It should be noted that, when constructing the new feature matrix with this
type of convolutions, we give same importance to the features of the node
itself and to the ones of its neighbours. However, in some applications, it
might be useful to introduce a trade-off parameter µ in the definition of W̃ :

W̃ = W + µIn

where µ is also learnable through the gradient descent algorithm.

With this formulation we get single-parametric filters that allow to tackle
the overfitting problem on neighbourhoods where the distribution of node
degrees is wide. This property is crucial when building deeper graph neural
network with more than few convolutional layers and also in the case of ap-
plication to large-scale networks. It is also important to notice that, although
equations (3.15) and (3.20) are derived from the spectral convolutions, both
of them are in the end defined in the spatial domain, thus allowing for further
computational efficiency.

The main problem of the spectral approach defined by (3.11) can be find
in the definition of convolution dependent on the Fourier basis: this depen-
dence implies that a model learnt on a specif domain is not easily transferable
26 CHAPTER 3. METHODS

to a different one [36]. For this reason different approaches have been devel-
oped in order to allow the convolution operator to deal with different sized
neighborhoods [51].
One of the methods that overcome this limitation is introduced in [19]. The
basic idea of this method is to learn a function able to generate node em-
beddings only by looking at the neighbourhood of the single node. The
method is called GraphSAGE, which stands for SAmple and aggreGatE, and
a schematic representation can be found in figure 3.1.1. The main differ-
ence from previous methods is that it doesn’t learn one different embedding
for each node, but it learns an aggregation operator over a subset of the
neighbouring nodes:

(k+1)
xN (v) = s({xu(k) , ∀u ∈ N (v)}) (3.22)

where s is an aggregation function and N (v) is a fixed-size subset of the

neighbours of v. Since there’s no natural ordering between the neighbours
of a node, the aggregation function s needs to be a permutation-invariant
function, operating over an unordered set of vectors. The method requires
also that N (v) is sampled again at each iteration, with a uniform distribution
over {u ∈ V : (u, v) ∈ E}.
The hidden representation from (3.22) is then concatenated with the vector
of node features at layer k:

(k+1)
x(k+1)
v = ξ(Wk · [x(k)
v ||xN (v) ]) (3.23)

where Wk is a matrix of parameters to be learnt at layer k. By concatenating

the representation of nodes at the previous layer, we create a skip-connection
layer, where the network is allowed to maintain only the previous represen-
tation of the node, in the case it results to be the best option for the task.
3.1. GRAPH NEURAL NETWORKS 27

Figure 3.1: A schematic representation of the GraphSAGE method [19]

In [19], different functions s are proposed. The most simple idea is to use
the mean over the vectors of the neighbouring set. Alternatively a pooling
aggregator can be used, in the form:

s(T ) = max({σ(Wpool x(k)

u + b, u ∈ T }), T ∈V

where T is the subset of neighbouring nodes.

This method is particularly useful for graphs in which the nodes have ini-
tially a big amount of features describing their properties, for example social
networks with people’s information or citation data with text attributes, but
it also allows to exploit the structural information of the graph. Moreover,
while previous approaches were useful for the case in which all the graph
structure is previously known (e.g. graph-classification or semi-supervised
node classification), the convolution layer of GraphSAGE allows to extend
the learnt function also in inductive settings, when new nodes are added to
the graphs in following stages, as in the case of a new person joining a social
network.
28 CHAPTER 3. METHODS

3.1.2 Graph Pooling Layers

Extending the idea of pooling layers from CNNs to graph domains is partic-
ularly challenging. This is due to two main factors: the lack of a definition of
spatial locality and, in the setting of multiple domains, the variable number
of nodes in different data samples. In the context of graphs, in fact, there’s
not a clear definition of what a patch is and it is therefore complicated to
decide which nodes should be pooled together.
Despite the challenge, a notion of pooling is often necessary when dealing
with graph classification. It was therefore initially introduced in the form of
a readout layer where global pooling is performed: all the nodes embeddings
are pooled together and a simple permutation-invariant function is applied
to them. This kind of layer is usually inserted after the message-passing
and aggregating steps are performed by convolutional layers, and it’s usually
followed by one or more fully connected layers that allow to perform classi-
fication.

The downside of these methods is that they are intrinsically flat and don’t
allow to capture any hierarchical structure that might be present in the net-
work. This is the reason why the authors of [49] proposed a pooling layer
in which a soft cluster assignment is performed and afterwards the node fea-
tures of each cluster are aggregated together. This structure allows to create
a smaller graph structure in the next layer, where the information regarding
similar nodes has been aggregated together.
The model is based on learning a cluster assignment matrix through a graph
neural network module of convolutional layers. This matrix can be denoted
by S (k) ∈ Rnk ×nn+1 , where nk in the number of nodes at layer k and nk+1 is
the number of clusters that we want to generate (which coincides with the
3.1. GRAPH NEURAL NETWORKS 29

number of nodes of the input to the next layer). If we denote as X (k) ∈ Rnk ×fk
and W (k) ∈ Rnk ×nk the matrix of node features and the weighted adjacency
matrix respectively, at layer k, we can compute the soft assignment as:

S (k) = softmax GNN(X (k) , W (k) )

(3.24)

where GNN denotes a module of stacked convolutional layers. This clustering

module can be trained together with the rest of the neural network in an end-
to-end structure. Once the soft assignment is computed, it can be used to
aggregate the information of nodes in each cluster following:

T
X (k+1) = S (k) X (k) ∈ Rnk+1 ×fk (3.25)

where we can note that the number of node features fk remains unchanged
in the pooling process. Subsequently the graph structure between the new
cluster nodes can be generated through:

T
W (k+1) = S (k) W (k) S (k) ∈ Rnk+1 ×nk+1 (3.26)

where the elements of W (k+1) denote the strength of the connections between
each pair of clusters [49]. This method, called DiffPool, introduces a graph
coarsening procedure that allows to reduce the size of the graph when going
deeper in the neural network and this helps to detect possible hierarchical
structures in the graph.

While the previous approach works really well for small graphs, it can en-
counters problems when processing bigger structures. First of all, because of
the soft-clustering procedure, the resulting graph will not be sparse even if
the starting one had few connections [50]. Secondly, it requires to store at
each step the soft-clustering matrix, which at the beginning of the network
30 CHAPTER 3. METHODS

can be of the order of O(r|V |2 ), where r ∈ (0, 1] is the fraction of graph

vertices that is kept. For this reason, in [8] and [16], a new methodology is
proposed to retain only drne nodes, instead of creating drne clusters from
previous nodes, where n is the number of nodes in input to the layer. This is
achieved by simply dropping n − drne nodes based on a learnable projection
score. If we defined as p(k) ∈ Rfk the learnable projection vector and as
y (k) ∈ Rn the vector of scores, the method can be formalized as follows:

X (k) p(k)
y (k) =
kp(k) k (3.27)
(k)
i = τ (y , r)

where τ is a function that selects only the indices of the top drne elements
in y (k) and i ∈ Rr is a vector denoting the retained indices. The selection
step is then performed by:

X (k+1) = X (k) ((tanh(y (k) )T 1n ) i

(3.28)
W (k+1) = (W (k) )i,i

where is the Hadamard product and (·)i is the indexing operation based
on vector i. The idea behind this method is to relegate the step of the
aggregation of information between nodes in the convolutional layers, and
just perform a selection of the most important nodes in the pooling layer.
This selection, performed just by slicing the feature matrix and the adjacency
matrix, allows to maintain sparsity in the graph [8], a characteristic that was
missing in DiffPool.
3.2. ATTENTION MECHANISM IN GRAPH NEURAL NETWORKS 31

3.2 Attention Mechanism in Graph Neural

Networks
In real applications, we often have to deal with graph-structured data that
can be both really big and noisy. As previously said in 2.2, attention mecha-
nisms allow the network to focus on the specific parts of the input that turn
out to be most relevant. This benefit can be leveraged as well with graph
structured data, in an attempt to force the model to focus on the most im-
portant part of the graphs and thus improve the signal-to-noise ratio [29].
Moreover, if successfully applied, attention provides a tool for interpreting
the results given by the network and discovering the underlying dependencies
that have been learnt.
Given a set of nodes ΓV = {v0 , ..., vn } ⊂ V , not necessarily composed by
all the nodes in V , and a target object s, representing a specific entity
in the network, we can generally define attention in graphs as a function
φ0s : ΓV → [0, 1] that maps each node in ΓV to an attention score and that
satisfy the following condition:
|ΓV |
X
φ0s (vi ) = 1 (3.29)
i=0

Attention mechanisms in graphs can be generally divided into two classes:

attention-based node embeddings and attention-based graph embeddings
[29]. In the following paragraphs, we will give a general description of both.

In the case of node embeddings, the target object s is a generic node of

a graph vj and Γv is the set of neighbours N (vj ) of node vj . The attention
mechanism is applied to every node of the graph and it defines the relative
importance of every node vi in N (vj ) in creating the embedding of node vj .
32 CHAPTER 3. METHODS

The objective is to learn a generic attentive function φ : V → Rfk+1 , that

will be applied to every node of the graph at a specific layer k.
The previous mechanism is applied for example in the GAT convolutional
layer by [47], where the embedding of a node is computed by attending over
the feature vectors of nodes in its 1-hop neighbourhood.
(k) (k)
If we define as {x1 , . . . , xn } ∈ Rfk the input feature vectors of each node,
a linear transformation can be applied to them by a learnable weight ma-
trix W ∈ Rfk+1 ×fk and subsequently a masked attentional mechanism can be
applied to every node, resulting in the following scores:
(k) (k)
eij = a(W xi , W xj ) (3.30)

with i ∈ V and j ∈ N (i). The attention function a used is an feed-forward

neural network parametrized by a vector a ∈ R2fk+1 . The coefficients in
(3.30) are then fed to a softmax function, resulting in the following equation
for computing the attention scores [47]:

(k) (k)
exp LeakyReLU aT (W xi ||W xj )
αij = P (3.31)
T (k) (k)
k∈N (i) exp LeakyReLU a (W xi ||W xk )

where || denotes the concatenation of vectors.

These attention scores are then used to compute the output of the layer
(k+1) (k+1)
{x1 , . . . , xn } ∈ Rfk+1 through:
 
(k+1)
X (k)
xi =ξ αij W xj  (3.32)
j∈N (i)

where ξ is an appropriate non-linear function. Alternatively, a multi-head

formulation can be employed, following:
 
m
n X
(k+1) m (k)
xi = ξ αij W m xj  (3.33)
p=1 j∈N (i)
3.2. ATTENTION MECHANISM IN GRAPH NEURAL NETWORKS 33

(k+1)
with m representing the number of attention heads and xi ∈ Rmfk+1 .
The model described in (3.33) helps to bring more stable results, besides
being parallelizable and computational efficient. Moreover, differently from
many layers described in 3.1.1, it does not depend on knowing the graph
structure upfront, while at the same time taking into consideration all the
neighbours of a certain node when computing its embedding.

For what concern instead graph embeddings, the goal is to learn a func-
tion φ : G → Rm that maps the whole graph to a low-dimensional vector.
In the context of attention mechanisms this is done by learning an attention
function that gives different important scores to different subregions of the
graph. The embedding can then be fed into a standard neural network to
perform what is called attention-based graph classification [29].
One of the first works that proposed to use attention in the context of graph
classification is GAM [28], which is based on attention-guided walks over a
selected number of informative nodes. In particular, GAM uses an attention
mechanism over the neighbourhood of a node in order to decide which steps
to take at each next iteration. This mechanism is trained through a rein-
forcement learning approach and constitutes a model that is able to learn
which are the important parts of the graph. Moreover, in the case of very
large graphs, multiple walks can be initialized and performed in parallel.
An attention mechanism has also been introduced in the context of a read-out
layer in [33]. In fact, given the node feature vectors after the last convolu-
(K) (K)
tional layer {x1 , . . . , xn }, an embedding of the entire graph G can be
computed following the attention model:
!
X
(K) (K)
hG = ξ αi tanh g(xi ) (3.34)
i∈V
34 CHAPTER 3. METHODS

where g is a feed-forward neural network, ξ is a non-linear function and

(K) (K)
{α1 , . . . , αn } represents the attention scores at layer K. Each score is
given by:

(K) (K)
αi = σ a(xi ) (3.35)

where σ is the SoftMax function and a is a feed-forward neural network.

It has to be noticed that some instances of pooling in graphs can be re-

defined through an attentive framework [26]. After computing the attention
coefficients for nodes in a layer k, these can be used for selecting the most
important nodes through a thresholding operation of the form:

α(k) x(k) if α(k) > ᾱ

(k+1) i i i
xi = (3.36)
0

otherwise

where αi is the attention score of node i at layer k.

In this context, [30] proposes to take into account the graph topological
structure also when computing attention scores α ∈ Rn . This can be done by
integrating a convolutional structure, such as (3.20), in the formula, bringing
to:
α(k) = σ(D̃−1/2 W̃ D̃−1/2 X (k) p) (3.37)

where D̃ and W̃ are defined as in (3.20), X is the feature matrix at layer k and
p ∈ Rfk is the only vector of learnable parameter in the layer. The attention
scores learnt can then be used to modulate the information coming to the
next layer or can be used in (3.36) to create a pooling layer. Alternatively,
other convolutional operations can also be used in (3.37).
3.3. SELF-ATTENTIVE GRAPH CLASSIFIER 35

3.3 Self-Attentive Graph Classifier

The proposed method aims to extend the ideas introduced in [22] to the case
of graph inputs. The authors of [22] presented an attentive framework in
the context of convolutional neural networks for image classification. The
core idea consisted in re-purposing the global representation of the image
as a query that could attend over the hidden representations generated at
intermediate layers of the network. Since every image classification task can
be seen as a specific case of a graph classification problem, the main idea
introduced in this work is to extend their concept to the case of graph classi-
fication. Therefore, an attentive framework that can be applied to different
graph neural network architectures is here proposed.
In the setting of graph classification, the input to the problem is given by a
set of labeled graphs S = {(Gi , yi )}i∈I , where each Gi ∈ G represents a graph
and the corresponding yi ∈ Y is the label representing the class to which the
graph belongs. The cardinality of Y represents the number of different classes
in the dataset, while G is the set of input graphs. The main objective of the
learning process is to find an approximation fˆ of the theoretical function
f : G → Y that maps each graph to its corresponding label.
As previously said, given a generic graph G in input, it will be characterized
by its structure (V, E, W ) and by a set of node feature vectors, that are sum-
marized in the input feature matrix X (0) ∈ Rn0 ×f0 , where n0 is the number
of nodes and f0 the number of features. The more general setting, which
includes also vectors of features for every edge, is here excluded.
The starting feature matrix, together with the graph structure, is fed into the
neural network, which will process them through a sequence of convolutional
and pooling layers. After each layer k, a new feature matrix X (k) ∈ Rn is
36 CHAPTER 3. METHODS

produced, and, while in the case of convolutional layers the graph structure
remains the same, after pooling layers also the triplet (V (k) , E (k) , W (k) ) is
recomputed.
At each layer, the rows of X (k) represent the feature vectors of the nodes
present in that layer: {x1 , . . . , xnk } ∈ Rfk . Before the first layer, these vec-
tors will represent only the information pertaining the related node, but as
the layers go on, the node features will represent the aggregated information
of a portion of neighbouring nodes in the graph.
When arriving at the last layer, in order to perform the final graph classifica-
tion, it is necessary to produce a global vector g ∈ Rm , in which to aggregate
the processed information of the whole graph. This vector can be obtain
through different kind of read-out layers. Commonly used strategies include
taking the max or mean values over the features vectors of all the nodes, or
otherwise just summing them up. This brings to a global feature vector with
number of features m equal to the number fk of node features in the last
convolutional layer of the network. Alternatively, a read-out layer like the
one in (3.34) can be exploited.
The proposed method will use the global representation of the input, repre-
sented by vector g, as a query in the attention mechanism. Keys and values
will instead be constituted by the hidden representation of nodes in the in-
termediate layers (or by linear transformation of them). The basic idea is to
use this mechanism at different intermediate layers along the architecture, in
order to be able to get information from different scales.
Given a chosen layer l, the global vector g will be used to attend over the
set of features vector {x1 , . . . , xnl } ∈ Rfl learnt at that layer. This process
will produce attention scores {αil }i=1,...,nl through the application of a general
attention function. The scores are computed through the classical attentive
3.3. SELF-ATTENTIVE GRAPH CLASSIFIER 37

framework:
eli
αil = nl
P l
, ei = a(g, xli ) (3.38)
i=1 ei
where a is a generic function computing the alignment model. This function
will in fact return a value representing how well the feature vector of node i
at layer l relates with the global representation of the graph, thus creating a
multi-scale attention mechanism.
The function a can be a single-layer neural network taking as input a con-
catenation of the two vectors:

a(g, xli ) = pT (g||xli ) (3.39)

where p ∈ R(m+fl ) is the vector of parameters. Alternatively, the additive

mechanism used in [3] can be exploited:

a(g, xli ) = pT (g + W xli ) (3.40)

where, in the case m 6= fl , it’s first necessary to transform the input feature
vectors through a shared projection into a space of dimension m, with W ∈
Rm×fl . The same has to be done also if a multiplicative attention mechanism
is used, following:
a(g, xli ) = g T (W xli ) (3.41)

in a similarity-based computation that focuses on the alignment between

vectors. All the three attention strategies can be used interchangeably inside
the proposed framework.
Once the attention vectors are computed, they are employed in order to
compute an attentive representation of the selected layer l, given by:
nl
X
gal = αil xli (3.42)
i=1
38 CHAPTER 3. METHODS

By applying this mechanism at different layers l ∈ L, with L ⊂ {1, ..., K},

we obtain attentive representations {gal }l∈L at different resolutions. These
vectors can be concatenated together, potentially also with the global repre-
sentation g, to constitute the final vector which will be fed to a feed-forward
neural network for classification. As an alternative, if all the attentive vec-
tors have the same size, they can be summed up together.
With the described framework, a multi-scale attention mechanism has been
created in the context of graph classification. It provides a way to re-purpose
a representation of the graph itself as an attentional query, and allows for
a cyclic integration between local and global features of the graph structure
during the learning process.
Chapter 4

Model application

4.1 Parkinson’s disease

Parkinson’s disease (PD) is one of the most common neurodegenerative dis-

orders among old people. It mostly affects dopamine-producing neurons sit-
uated in a basal ganglia structure of the midbrain, called substantia nigra.
The loss of neurons that produce dopamine causes the degradation of this
area and the dysfunction of the cortico-basal ganglia-thalamocortical circuits,
which are deeply connected to movement [15].
For this reason, the main signal of the disease is to be found in movement-
related symptoms, which develop gradually over the years and may cause
tremors, that are usually worse on one side of the body even when both
sides are damaged. Other symptoms include stiffened muscles and slowed
movements, along with loss of automatic ones, like blinking and smiling, and
changes in speech and writing abilities [27].
The causes of PD are mostly unknown, but it is believed that specific ge-
netic mutations and environmental factors, such as nutrition or exposure to
40 CHAPTER 4. MODEL APPLICATION

certain pesticides, may play an important role. It is also known that the risk
factor is higher in men than in women and that the disease usually develops
in elderly people [34].
Parkinson’s disease can’t be cured, but the symptoms can be controlled and
contained for a fairly long period. Medications that increase the level of
dopamine can be useful for improving movements and diminishing tremors,
while a healthy lifestyle can help to reduce a lot of problems connected to
PD, such as constipation, lack of flexibility and high levels of anxiety [27].

4.1.1 DNA Methylation

Over the past years, many efforts have been made to find genetic signatures
related to Parkinson’s disease and acquired mutations in some specific genes
seem to be involved in the onset of sporadic PD. However, it is believed that
also epigenetic factors play a role in the development of the disease. Epige-
netic mechanisms involve chemical modification of the structure around the
DNA strands, without actually modifying the genomic sequence [34].
One of the most studied epigenetic modification is DNA methylation, a pro-
cess involved in both gene expression and cell differentiation. The methyla-
tion of DNA is characterized by the transferring of a methyl group to the
fifth carbon of a cytosine residue. It has been observed that this process
occurs almost exclusively on cytosines that are followed by a guanine base
on the DNA sequence, i.e at the so called CpG sites [37]. A representation
of the process is summarized in figure 4.1.2.
Since it has been demonstrated that epigenetic regulation of neural cells is
fundamental for neurogenesis and brain development, it is intuitive that an
alteration in these mechanisms can be involved in the onset of neurodegener-
ative diseases [34]. However, it has been proven that also DNA methylation
4.1. PARKINSON’S DISEASE 41

Figure 4.1: A schematic representation of DNA methylation catalyzed by

DNA methyltransferases (Dnmts) [37]

levels in blood correlate with the presence of Parkinson’s disease [10].

There are different ways through which methylation can affect the behaviour
of a cell, one of which consists in interfering with gene expression. If in fact
the methylated sequence is located in the gene promoter, the methyl group
interferes with the binding of the transcription factor and thus inhibits the
activation of the related gene [34].
Recent studies have analyzed how DNA methylation relates to aging pro-
cesses and they brought to the definition of an epigenetic clock model. This
model allows to estimate the age of different tissues just by looking at the
methylation state of a restricted set of CpG sites [21] . These results fur-
ther support the idea that there has to be a connection between epigenetic
biomarkers and neurodegenerative diseases.

4.1.2 SPECT Images

Brain imaging is one of the most commonly used techniques for the diagno-
sis of Parkinson’s disease. In particular, dopamine transporter single-photon
emission computed tomography (DaT SPECT) is one of the most effective
42 CHAPTER 4. MODEL APPLICATION

Figure 4.2: Representative slices from the volumetric SPECT images of a

healthy patient and one affected by PD, where the color scale represents the
amount of dopaminergic activity [42]

methods in detecting deficits inside the nigrostriatal dopamine system [43]. It

is in fact proved that, when the degenerative process is in the most advance
stage, around 50-60% of dopaminergic neurons in the substantia nigra are
lost and the content of dopamine is considerably decreased. This happens
in particular in the striatum, a cluster of neurons in the subcortical basal
ganglia of the forebrain [2].
Given the underlying biological process, SPECT imaging allows the retrieval
of 3D information related to the binding of the dopamine transporters (DaTs)
with 123 I-Ioflupane, which is administered to the patient with an intravenous
injection. In particular, it provides a quantitative information about the spa-
tial distribution of dopamine transporters, which is really useful in order to
make the correct diagnosis [42].
4.2. GRAPH NEURAL NETWORKS FOR PARKINSON’S DISEASE...43

4.2 Graph neural networks for Parkinson’s

disease classification

A wide variety of machine learning techniques have been employed in the

context of Parkinson’s disease classification, with a recent focus in exploring
artificial neural networks. These have been in fact employed using different
kind of data, ranging from genomic data [48] to acoustic measures [5] and
handwritten dynamics [39].
For what concerns the application of deep learning to methylation data, few
approaches have been considered until now. An interesting approach consists
in a modular framework that uses variational autoencoders for constructing
embeddings of the high dimensional data and then performs downstream
classification [31]. Another work proposed instead the use of convolutional
neural networks on a matrix of one-hot encoding of DNA fragments centered
at methylation sites [45]. However, the majority of the studies that applied
deep learning to methylation data have been performed in the context of
cancer-related data.
Different trials have been made concerning the application of traditional
deep learning techniques to SPECT images coming from Parkinson’s dis-
ease datasets. Convolutional neural networks can be applied directly to the
full 3D image, as explored in [9], or to specific smaller voxels selected a priori
from the whole volumetric image [24]. Moreover, in [44], a first attempt to
integrate methylation and SPECT data has been proposed in the context of
Parkinson’s disease data.
More recently the field of deep learning has been extended with the introduc-
tion of neural networks applied to graph-structured data. To the author’s
best knowledge, these models haven’t been applied either to DNA methyla-
44 CHAPTER 4. MODEL APPLICATION

tion data or SPECT images yet.

Traditional deep learning frameworks, such as convolutional neural networks,

rely on the properties that nearby pixels in an image are correlated between
each other and that images have an intrinsic hierarchical structure. Thanks
to their architecture, described in section 2.1, these networks are able to take
advantage of the spatial information given by the image structure. When
dealing with biological data, ranging from gene expression to DNA methyla-
tion, the features are often treated as unrelated to each other instead. How-
ever, it is known that, in intra-cellular processes, genes and other molecules
are highly related to each other, deeply influencing the cell’s functions.
The process of gene expression constitutes a good example of this inter-
connectivity inside the cell. Its core process involves two stages: initially
messenger RNA molecules are produced through the process of transcription
of genes; in a second moment the mRNA undergoes a translation process
that finally synthesize proteins. The starting gene is said to code for the re-
sulting protein and it’s known that there are some genes coding for multiple
proteins. Some of the resulting proteins play an important role in the process
of transcription of other genes, thus generating a network of transcriptional
dynamics between genes, usually defined as gene regulatory network. More-
over, the proteins produced by the whole gene expression process will interact
between each other inside the cell in complex biological pathways, that can
be described by a protein-protein interaction (PPI) network .
For this reason, new works are emerging that aims to integrate this biolog-
ical networks in the structure of deep learning models, as done for instance
in [40]. The idea is that the information about gene interactions can be
used to impose a biological bias on the model, in the same way as convolu-
4.2. GRAPH NEURAL NETWORKS FOR PARKINSON’S DISEASE...45

tional network impose a spatial bias on image data [12]. Since the process
of DNA methylation seems to be deeply related with gene expression, and
thus with protein production, in this work we will try to construct a graph
neural network that process methylation data after having arranged them
on a protein-protein interaction network. This allows to introduce biological
knowledge in the model by allowing the model to know which features in
the input are biologically connected to each other [4]. The introduction of
prior validated knowledge can be particularly useful in the case of high di-
mensional data with restricted number of samples, which is the case of most
dataset containing DNA methylation data.

When considering instead 2-dimensional or 3-dimensional images, convolu-

tional neural networks have proved to be the state-of-the-art method and it’s
difficult that a graph neural network can achieve better results when applied
on the same input. What a standard neural network cannot do, instead, is
to work directly with higher-order representation of an image, such as the
set of superpixels of which it is composed. The main advantage that can
derived from working on superpixels is a reduction in the size of the model,
which can be especially important when the size of input images is big [25].
Moreover, in the context of brain imaging, not only the volumetric images
are big, but it is also common to have a really small amount of samples for
the training of the model.
In this situation, to avoid training a big model, various strategies can be
followed. The extraction of patches that might relate to the problem to be
solved is one of those, but often requires some biological knowledge before-
hand. Alternatively, the dimensionality of the input can be reduced by down-
sampling, but this strategy often doesn’t bring great results in the medical
46 CHAPTER 4. MODEL APPLICATION

context. By extracting superpixels instead, we are able to maintain informa-

tion on the geometrical structures inside the images [25].

4.3 Dataset overview

The data used in this work were obtained from the Parkinson’s Progression
Markers Initiative (PPMI) database (www.ppmi-info.org/data). The PPMI
database contains information from different data modalities, among which
clinical data, imaging ones and various types of genetic information.
In the next sections, the purpose will be to apply the described models to
SPECT images and DNA methylation data present in the dataset. There
are 450 patients in the PPMI cohort for which both methylation samples
and SPECT data have been registered. Of the total amount of patients, 317
have been diagnosed with PD, while 133 are healthy patients.

The methylation profiling was performed using Illumina Human Methylation

EPIC array on whole-blood extracted DNA samples. Each sample was ana-
lyzed at a single-nucleotide resolution and the analysis returned the methy-
lation profiling of 864067 CpG sites across the genome. The methylation
dataset is therefore composed by a matrix of dimensions 450 × 864067, where
each element represents the methylation state of a particular CpG site in a
specific sample. In particular, for each site, the reported value is a β-value
that estimates the methylation level with:

M
β=
M +U

where M is the number of cells in the sample where the analyzed CpG site is
methylated and U is the number of cells for which the DNA molecule was not
4.4. DATA PREPROCESSING 47

Figure 4.3: Histogram representing frequencies of β-values in a representative

sample

methylated in that position. The β-values are therefore intensity estimates,

with β ∈ [0, 1]. As it can be noticed from figure 4.3, a big portion of probes
tends to assume values at the extremes of the interval.

The SPECT dataset consists instead of 3D grayscale images that have been
aligned together and saved in the NifTI-1 data format. Each image is repre-
sented by a 91 × 109 × 91 tensor, with values normalized to the range [0, 1].

4.4 Data preprocessing

4.4.1 DNA methylation data

As a first explorative step, an unsupervised analysis over the methylation

matrix has been performed through principal component analysis. The pro-
jections of the data on the two principal components is shown in figure 4.4.1.
48 CHAPTER 4. MODEL APPLICATION

Figure 4.4: Principal component analysis of methylation data

From those projections, it’s possible to notice how the data seems to be di-
vided in two clusters, related to the biological sex of the patients. Healthy
and control patients seem to be divided equally in the two clusters. For this
reason, all the probes related to DNA positions situated on sex chromosomes
have been excluded from the analysis.
Moreover, [52] identified a number of probes that, because of the specific
region of DNA that they refer to, should be excluded from the analyses for
probe selection. We then further pre-filter those probes from the data ma-
trix.
A further selection step is performed considering the final goal of embedding
the features in a network of genes: only the CpGs located in a DNA position
corresponding to a gene are kept in the analysis.
Subsequently, the 10000 most important probes for the goal of Parkinson’s
disease classification in the dataset have been selected on the base of univari-
ate ANOVA statistical tests.
As previously said, the goal is to embed the selected features on a biologically
meaningful graph. There are many different biological graphs that can be
considered, which differ for the quantity of genes considered and the amount
4.4. DATA PREPROCESSING 49

and type of relationship between them. For the purpose of this work we de-
cided to use a the protein-protein interaction graph. In particular, we refer to
the interaction data registered in the STRING database [13], which contains
information of gene associations recovered from multiple sources, like biolog-
ical experiments, literature information and computational prediction. We
decide to focus in particular on the association that have been biologically
proved, in order to avoid to add more uncertainty in the problem.
The resulting graph dataset will be composed by 450 graphs, each with 10000
nodes: each node will represent a location in the DNA (CpG). The edges be-
tween nodes will instead be constructed based on the biological interactions
between the proteins encoded by the genes present in the CpG location.

4.4.2 SPECT imaging data

For what concerns the volumetric SPECT data, they are first pre-processed
using a specific algorithm for superpixels generation. Different algorithms
have been proposed for this purpose, but one of the fastest and most ef-
fective is the simple linear iterative clustering (SLIC) algorithm. SLIC is
based on a procedure that generates superpixels by clustering pixels based
on their color similarity and proximity in the image [1]. An example of re-
sults given by SLIC superpixel is given in figure 4.4.2, where it is applied
to a slice of a SPECT image. The SLIC algorithm is performed using the
skimage.segmentation.slic function on the 3D images, attempting to cre-
ate around 1000 superpixels for each image.
We then want to construct a graph based on this division in superpixels:

• each superpixel will be associated with a node in the graph;

• the feature vector of each node will be constituted by four features: the
50 CHAPTER 4. MODEL APPLICATION

Figure 4.5: Slice of a SPECT image on the left; resulting boundaries given
by SLIC algorithm on the right

first one will be the mean between the intensity values of the elements
in the superpixel, the other three are the coordinates of the center of
mass z ∈ R3 of the superpixel itself;

• edges between nodes are formed based on the euclidean distance be-
tween centers of mass, based on the following formula [11]:
!
kzi − zj k2
Wij = exp −
σ2

with σ = 0.1π. Only for the k = 20 nearest nodes the edge will actually
be maintained in the final graph.

Starting from an initial input size of 91 × 109 × 91 = 902629, the dimen-

sionality of the input will then be reduced to a set of around 4 × 1000 node
features, while the average amount of edges in the newly created graphs will
then be of 20000 edges.
4.5. EXPERIMENTS AND RESULTS 51

4.5 Experiments and results

The code has been implemented in Python, using the libraries PyTorch [38]
and PyTorch Geometric [14] for the neural networks section.
The model constructed for classifying the methylation graphs consists of 3
convolutional layers, with 2 final fully connected layers. The size of the hid-
den node representations has been set to 4 after some experiments. Skip con-
nections after each convolutional layer are also included to allow the model
to be more flexible. The big size of the underlying graph constitutes a limit
in the construction of a deeper and more complex architecture.
The network has been trained with the so called RMSProp (Root Mean
Square Propagation) algorithm, using a standard cross-entropy loss function.
The training is performed for 200 epochs, with an early stopping option if
the accuracy doesn’t improve for a certain amount of epochs. The hyperpa-
rameters used during training are summarized in table 4.1.

Training hyperparameters (methylation)

Learning rate 10−3
Batch size 16
Number of epochs 200
Early stopping (epochs) 50
L2 weight decay 10−1

Table 4.1: Values of hyperparameters used during training with RMSProp

(methylation data classifier)

The experiments have been performed with a stratified 3-fold cross-validation,

across different types of convolutional layers:
52 CHAPTER 4. MODEL APPLICATION

• ChebNet refers to the convolutional operation described in formula

(3.16), with r = 3;

• GCN refers to the convolution described in (3.20);

• GraphSAGE refers to the opearation in (3.23);

• GAT is instead the attentional convolution in (3.32), with 3 heads.

The results obtained with different layer types are reported in figure 4.6. The
results are given in terms of Area Under the ROC (Receiver Operating Char-
acteristic) curve. The plot of a ROC curve summarizes the performance of
a binary classification model on the positive class, plotting the True Positive
Rate against the False Positive Rate, when varying the threshold parameter
of the classifier. The Area Under the ROC Curve (AUC) is a good metric
for evaluating the expressive power of a classifier and it’s superior to the
accuracy in case of imbalanced datasets with binary labels.

Figure 4.6: Average value of AUC obtained using different types of convolu-
tional layers in the model
4.5. EXPERIMENTS AND RESULTS 53

We can observe that none of the classifiers provides a high average ROC-
AUC value. However, it seems that the one composed with the GraphSAGE
convolutional layers performs a bit better than the rest, providing an average
AUC of 0.74 and an average classification accuracy of 75.8%.
The attentive model proposed in 3.3 has been tested on the methylation
graphs, however it does not seem able to capture the information from the
methylation signal.
The reasons for the low performances observed in figure 4.6 can be various.
First of all, the amount of data available, together with the high dimension-
ality of the dataset itself and the amount of noise that it contains, makes it
difficult to train any model in a 3-fold cross-validation fashion. Only 66, 6%
of the data is in fact used during training, while the remaining ones equally
split between validation and test sets.
In second instance, we made a strong assumption by choosing a specific bi-
ological structure for the underlying graph. Different experiment should be
performed to analyse if, using different biological graphs, the results might
improve. Moreover, the big amount of data made the training of a deeper
convolutional network difficult, while at the same time the feature selection
step was also really affected by the few samples available for such noisy data.

Differently from the classifier used for the methylation data, the models con-
structed for classifying the SPECT graphs exploit the framework introduced
in section 3.3. They consists of 6 convolutional layers and also in this case
three different types of convolution (from ChebConv, GraphSAGE and GAT )
are analyzed. The latent feature vectors are this time composed by 10 values,
which is also the same amount of variables that will constitute the graph rep-
resentation g, produced after the fully connected layers attached at the end
54 CHAPTER 4. MODEL APPLICATION

of the convolutional block. The attentive mechanism described in equation

(3.38) is applied every two layers and the attention function used is (3.39).
This time the model doesn’t include a skip-connection mechanism, as the
attentive connections already serve as ways for the information contained
in the first layers to be considered directly when building the final vector
of features. The optimization algorithm used during training is Adam, an
adaptation of the classical stochastic gradient descent procedure, and the
hyperparameters used are reported in table 4.2.

Training hyperparameters (SPECT)

Learning rate 5 · 10−3
Batch size 16
Number of epochs 100
Early stopping (epochs) 20

Table 4.2: Values of hyperparameters used during training with Adam

(SPECT data classifier)

Different experiments have been carried out in the context of the final inte-
gration of the attentive vectors:

• case 1: the attentive vectors {gal }l∈L are averaged together and the
resulting vector is used for the final classification;

• case 2: the vectors {gal }l∈L are concatenated together before performing
the final classification;

• case 3: the global vector g and the vectors {gal }l∈L are averaged together
and then used for classification.
4.5. EXPERIMENTS AND RESULTS 55

Figure 4.7: Average value of AUC obtained using different kind of convolu-
tional layers in the model, with different integration of attentive vectors

The results of the experiments, analyzed with different convolutional layers,

are shown in figure 4.7. They suggest that considering the global vector g
together with {gal }l∈L is probably bringing more noise into the model, result-
ing in the worst performances consistently across layer type. The model that
instead seems to perform better is the one that compute the average between
the features of the attentive vectors at different scales.
Even if more experiments should be performed for what regards a more sys-
tematic hyperparameters selection, adapting it to each model separately, we
decided to focus on the model that performed better on the previous anal-
ysis: the 6-layer model with convolutional layer composed by ChebConv. A
comparison is carried out to verify what happens if the attentive mechanism
is applied after each of the first three layers (early attention) or after each of
the last three layers (late attention).
56 CHAPTER 4. MODEL APPLICATION

Attention model Accuracy (%) ROC-AUC

Early attention 80.81 ± 2.03 0.887 ± 0.014
Late attention 79.86 ± 1.05 0.844 ± 0.018

Table 4.3: Results from SPECT graph classifier

Interestingly, the results are different from what expected: the attention
model seems to work slightly better, expecially in terms of AUC, when ap-
plied after the first layers of the network. In the case of convolutional net-
works applied on euclidean images, [22] noticed how attention models at-
tached to deeper layers brought to better results. This was easily explained
by the fact that in images from the natural world, as the one analyzed in
the paper (CIFAR-100), the last layers are the ones that convey the most
information about objects present in the image. However, in our case the
setting is quite different, all the images represent the same object (a patient’s
brain) and it’s more probable that important details for classification can be
found at smaller scales. Moreover, the division in superpixels already caused
a coarsening of the input data, so that each superpixel and its immediate
neighbours already contain aggregated information ready to be used by the
network.
More experiments concerning analysis of network depth and variations of
attentional functions should be considered in order to further optimize the
models.
Chapter 5

Conclusions and future works

In this work an extensive introduction to graph neural networks has been

provided, with a description of the main methods developed in this emerging
field. After describing how the convolutional operations, typical of CNNs,
can be extended to graphs, a particular focus has been given to attentive
mechanisms. In the context of graph classification, a new method for com-
puting multi-scale self-attentive features has been introduced. Attention
mechanisms have previously been proposed in various structures inside graph
neural networks, however no mechanism allowed to integrate attention in a
multi-scale fashion, taking into account representations at different scales si-
multaneously and allowing them to attend one over the other.
The presented methods have been applied to graphs constructed starting
from biological data and an analysis of how different models perform on
different data has been carried out. The research of new models able to in-
tegrate biological knowledge inside the standard machine learning pipelines
is a field that is just starting to be explored and may bring many important
advantages in the research field. From one point of view, the integration of
58 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS

biological biases may help to increase both accuracy and interpretability of

the models. On the other hand, insights given by interpretable models may
serve as an input for generating new biological hypothesis and thus open-
ing new research directions. A sector in which graph neural networks may
be useful in the future is represented by the research of important biologi-
cal pathways correlated with specific diseases and, in this context, attentive
mechanisms on graphs could play an important role.
However, when adding biological networks in the analysis, it’s important to
verify the intrinsic value of the underlying graphs that are selected to be used
as bias. An idea for a future work stands in the analysis of how different prior
graphs can affect the final results and which ones bring the most valuable
knowledge for the task at hand.
For what concerns DNA methylation data, it could be interesting to analyze
if different gene interaction graphs, such as gene regulatory networks or co-
expression ones, could add more value to the analysis. Moreover, it could
be interesting to approach the problem with a multiplex framework, where
nodes representing genes are connected by multiple edge-types in a unified
multi-layer structure.
In the context of volumetric images, it’s important to notice that analysis
at different scales may bring different results. Low-levels of abstraction tend
to maintain information about the detail of the images, while using bigger
superpixels only the global information is preserved. For this reason it can
be interesting to integrate different levels of abstraction in the same graph
neural network, as proposed by [25], and analyze how this feature would im-
prove the results in the context of biomedical imaging.
It’s also important to notice that, in the context of Parkinson’s disease, the
amount of available multimodal datasets is slowly increasing. Among these
59

data, functional magnetic resonance (fMRI) are a kind of data that is nat-
urally predisposed to be represented as graphs. After the necessary prepro-
cessing steps, specific region of interest in the brain can be encoded as nodes
and then, by computing the correlation of signals between these regions, a
correlation matrix can be constructed. The final edges will be derived from
the correlation matrix, by applying a task specific thresholding operation.
The final connectivity graphs, describing functional relationship between re-
gions of the brain, represent the perfect input for graph neural networks [32].
In this setting, an attentive model could potentially bring many insights on
the parts of the brain that relates with specific functional problems, like the
ones connected to Parkinson’s disease.
In the context of data integration in the biomedical field, at least at the
current stage, one of the main limitation is given by the fact that not all the
data types are available for every patients, so classical data-integration tech-
niques may fail to work cause of the small amount of data in the intersection
of datasets. It could therefore be interesting to explore how data encoders,
trained separately on different data modalities, would work in a downstream
integration task, when presented with new data that haven’t been seen be-
fore.
Finally, integration with other datasets and exploration of transfer learning
techniques might be one of the most promising fields to explore, with the goal
of expanding the available knowledge in the context of Parkinson’s disease.
60 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS
Bibliography

[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas-
cal Fua, and Sabine Susstrunk. Slic superpixels, June 2010.

[2] Moran Artzi, Einat Even-Sapir, Hedva Shacham, Avner Thaler, Avi
Urterger, Susan Bressman, Karen Marder, Talma Hendler, Nir Giladi,
Dafna Ben Bashat, and Anat Mirelman. Dat-spect assessment depicts
dopamine depletion among asymptomatic g2019s lrrk2 mutation carri-
ers. PLOS ONE, 12, 04 2017.

[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. CoRR,
abs/1409.0473, 2014.

[4] Paul Bertin, Mohammad Hashir, Martin Weiß, Geneviève Boucher, Vin-
cent Frappier, and Joseph Paul Cohen. Analysis of gene interaction
graphs for biasing machine learning models. ArXiv, abs/1905.02295,
2019.

[5] Lucijano Berus, Simon Klancnik, Miran Brezocnik, and Mirko Ficko.
Classifying parkinson’s disease based on acoustic measures using artifi-
cial neural networks. Sensors, 19:16, 12 2018.

61
62 BIBLIOGRAPHY

[6] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and
Pierre Vandergheynst. Geometric deep learning: Going beyond eu-
clidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.

[7] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun.
Spectral networks and locally connected networks on graphs. CoRR,
abs/1312.6203, 2013.

[8] Catalina Cangea, Petar Velickovic, Nikola Jovanovic, Thomas Kipf,

and Pietro Liò. Towards sparse hierarchical graph classifiers. ArXiv,
abs/1811.01287, 2018.

[9] Hongyoon Choi, Seunggyun Ha, Hyung-Jun Im, Sun Paek, and Dong
Lee. Refining diagnosis of parkinson’s disease with deep learning-based
interpretation of dopamine transporter imaging. NeuroImage: Clinical,
16, 09 2017.

[10] Yu-Hsuan Chuang, Kimberly Paul, Jeff Bronstein, Yvette Bordelon,

Steve Horvath, and Beate Ritz. Parkinson’s disease is associated with
dna methylation levels in human blood and saliva. Genome Medicine,
9, 12 2017.

[11] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convo-

lutional neural networks on graphs with fast localized spectral filtering.
In NIPS, 2016.

[12] Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko,
and Yoshua Bengio. Towards gene expression convolutions using gene
interaction graphs. ArXiv, abs/1806.06975, 2018.
BIBLIOGRAPHY 63

[13] Damian Szklarczyk et al. String v11: protein-protein associa-

tion networks with increased coverage, supporting functional dis-
covery in genome-wide experimental datasets. Nucleic Acids Res.,
47(D1):D607–D613, 2019.

[14] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning
with pytorch geometric. ArXiv, abs/1903.02428, 2019.

[15] Adriana Galvan, Annaelle Devergnas, and Thomas Wichmann. Alter-

ations in neuronal activity in basal ganglia-thalamocortical circuits in
the parkinsonian state. Frontiers in Neuroanatomy, 9:5, 2015.

[16] Hongyang Gao and Shuiwang Ji. Graph u-nets. In ICML, 2019.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016. http://www.deeplearningbook.org.

[18] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in

graph domains. In Proceedings. 2005 IEEE International Joint Confer-
ence on Neural Networks, 2005., volume 2, pages 729–734 vol. 2, July
2005.

[19] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive repre-
sentation learning on large graphs. In NIPS, 2017.

[20] David K. Hammond, Pierre Vandergheynst, and Rémi Gribonval.

Wavelets on graphs via spectral graph theory. ArXiv, abs/0912.3848,
2008.

[21] Steve Horvath and Ken Raj. Dna methylation-based biomarkers and
the epigenetic clock theory of ageing. Nature Reviews Genetics, 19, 04
2018.
64 BIBLIOGRAPHY

[22] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip H. S. Torr.
Learn to pay attention. ArXiv, abs/1804.02391, 2018.

[23] Thomas Kipf and Max Welling. Semi-supervised classification with

graph convolutional networks. ArXiv, abs/1609.02907, 2016.

[24] Ivan S. Klyuzhin, Nikolay Shenkov, Arman Rahmim, and Vesna Sossi.
Use of deep convolutional neural networks to predict parkinson’s disease
progression from datscan spect images. 2018.

[25] Boris Knyazev, Xiao Lin, Mohammed Abdel Rahman Amer, and Gra-
ham W. Taylor. Image classification with hierarchical multigraph net-
works. ArXiv, abs/1907.09000, 2019.

[26] Boris Knyazev, Graham W. Taylor, and Mohammed Abdel Rahman

Amer. Understanding attention and generalization in graph neural net-
works. In NeurIPS, 2019.

[27] Antonina Kouli, Kelli M. Torsney, and Wei-Li Kuan. Parkinson’s dis-
ease: Etiology, neuropathology, and pathogenesis. In Parkinson’s Dis-
ease: Pathogenesis and Clinical Aspects. 2018.

[28] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification
using structural attention. In Proceedings of the 24th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, KDD
’18, page 1666–1674, New York, NY, USA, 2018. Association for Com-
puting Machinery.

[29] John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen Ahmed,
and Eunyee Koh. Attention models in graphs: A survey. ArXiv,
abs/1807.07984, 2018.
BIBLIOGRAPHY 65

[30] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pool-
ing. In ICML, 2019.

[31] Joshua J. Levy, Alexander J. Titus, Curtis L. Petersen, Youdinghuan

Chen, Lucas A. Salas, and Brock C. Christensen. Methylnet: An auto-
mated and modular deep learning approach for dna methylation analy-
sis. bioRxiv, 2019.

[32] Xiaoxiao Li, Nicha C. Dvornek, Yuan Zhou, Juntang Zhuang, Pamela
Ventola, and James S. Duncan. Graph neural network for interpreting
task-fmri biomarkers. In MICCAI, 2019.

[33] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel.
Gated graph sequence neural networks. CoRR, abs/1511.05493, 2016.

[34] Ernesto Miranda-Morales, Karin Meier, Ada Sandoval-Carrillo, José

Salas-Pacheco, Paola Vázquez-Cárdenas, and Oscar Arias-Carrión. Im-
plications of dna methylation in parkinson’s disease. Frontiers in Molec-
ular Neuroscience, 10:225, 2017.

[35] Volodymyr Mnih, Nicolas Manfred Otto Heess, Alex Graves, and Koray
Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.

[36] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà,

Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on
graphs and manifolds using mixture model cnns. 2017 IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), pages 5425–
5434, 2016.

[37] Lisa Moore, Thuc Le, and Guoping Fan. Dna methylation and its basic
66 BIBLIOGRAPHY

function. Neuropsychopharmacology : official publication of the Ameri-

can College of Neuropsychopharmacology, 38, 07 2012.

[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
and Adam Lerer. Automatic differentiation in pytorch. 2017.

[39] Pereira R., Silke Weber, Christian Hook, Gustavo de Rosa, and João
Papa. Deep learning-aided parkinson’s disease diagnosis from handwrit-
ten dynamics. 06 2016.

[40] SungMin Rhee, Seokjun Seo, and Sun Kim. Hybrid approach of relation
network and localized graph convolutional filtering for breast cancer
subtype classification. In IJCAI, 2017.

[41] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfar-

dini. The graph neural network model. IEEE Transactions on Neural
Networks, 20(1):61–80, Jan 2009.

[42] Seong-Jin Son, Mansu Kim, and Hyunjin Park. Imaging analysis of
parkinson’s disease patients using spect and tractography. Scientific
Reports, 6(1):38070, 2016.

[43] Sven Suwijn, Caroline Boheemen, Rob de Haan, Gerrit Tissingh, Jan
Booij, and Rob Bie. The diagnostic accuracy of dopamine transporter
spect imaging to detect nigrostriatal cell loss in patients with parkin-
son’s disease or clinically uncertain parkinsonism: A systematic review.
EJNMMI research, 5:12, 12 2015.

[44] Devin Taylor, Simeon E. Spasov, and Pietro Lió. Co-attentive cross-
BIBLIOGRAPHY 67

modal deep learning for medical evidence synthesis and decision making.
ArXiv, abs/1909.06442, 2019.

[45] Qi Tian, Jianxiao Zou, Jianxiong Tang, Yuan Fang, Zhongli Yu, and
Shicai Fan. Mrcnn: a deep learning model for regression of genome-
wide dna methylation. BMC Genomics, 20(2):192, 2019.

[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In NIPS, 2017.

[47] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana

Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks.
ArXiv, abs/1710.10903, 2017.

[48] Qianfan Wu, Adel Boueiz, Alican Bozkurt, Arya Masoomi, Allan Wang,
Dawn DeMeo, Scott Weiss, and Weiliang Qiu. Deep learning methods
for predicting disease status using genomic data. Journal of biometrics
and biostatistics, 9, 01 2018.

[49] Rex Ying, Jiaxuan You, Cynthia. Morris, Xiang Ren, William L. Hamil-
ton, and Jure Leskovec. Hierarchical graph representation learning with
differentiable pooling. In NeurIPS, 2018.

[50] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A
survey. ArXiv, abs/1812.04202, 2018.

[51] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and
Maosong Sun. Graph neural networks: A review of methods and appli-
cations. ArXiv, abs/1812.08434, 2018.
68 BIBLIOGRAPHY

[52] Wanding Zhou, Peter Laird, and Hui Shen. Comprehensive charac-
terization, annotation and innovative use of infinium dna methylation
beadchip probes. Nucleic acids research, 45, 10 2016.

2022 Book GraphNeuralNetworksFoundations PDF
100% (3)
2022 Book GraphNeuralNetworksFoundations PDF
701 pages
Hundred Restaurants List
No ratings yet
Hundred Restaurants List
12 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
POLARIS RPG - Core Rulebook 1 Beta 05 (8527262) PDF
100% (1)
POLARIS RPG - Core Rulebook 1 Beta 05 (8527262) PDF
269 pages
Glass Block Technical Presentation
No ratings yet
Glass Block Technical Presentation
16 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
Military Packaging Guidelines
No ratings yet
Military Packaging Guidelines
267 pages
Lecture8 Computational Graph Pytorch TF
No ratings yet
Lecture8 Computational Graph Pytorch TF
64 pages
DLG Book
No ratings yet
DLG Book
332 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
22 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
A Comprehensive Survey of Graph Neural Networks PDF
No ratings yet
A Comprehensive Survey of Graph Neural Networks PDF
22 pages
Convolutional Neural PDF
No ratings yet
Convolutional Neural PDF
187 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
No ratings yet
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
24 pages
HTC-8670 70T PDF
No ratings yet
HTC-8670 70T PDF
40 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
Graph Neural Networks
No ratings yet
Graph Neural Networks
124 pages
Davanagere
No ratings yet
Davanagere
11 pages
Drip Irrigation Pipes
No ratings yet
Drip Irrigation Pipes
8 pages
FF ADCD ابو غانم
No ratings yet
FF ADCD ابو غانم
5 pages
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
No ratings yet
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
38 pages
GML Introduction
No ratings yet
GML Introduction
11 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Fan Tool Kit - Ad Hoc Group - V4dd
No ratings yet
Fan Tool Kit - Ad Hoc Group - V4dd
121 pages
Fixed Drug Eruptions PDF
No ratings yet
Fixed Drug Eruptions PDF
7 pages
Graph Neural Networks Explained
No ratings yet
Graph Neural Networks Explained
22 pages
Thesis Master 2022 Application of GNN For Graph Classification
No ratings yet
Thesis Master 2022 Application of GNN For Graph Classification
81 pages
CV Mot
No ratings yet
CV Mot
69 pages
Bacciu 2020
No ratings yet
Bacciu 2020
62 pages
Neural Networks Lecture Notes
No ratings yet
Neural Networks Lecture Notes
231 pages
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
No ratings yet
How Cosmic Forces Shape Our Destiny, by Nikola Tesla, 1915
4 pages
DGCNN
No ratings yet
DGCNN
8 pages
Graph Convolutional Networks Adaptations and Applications
No ratings yet
Graph Convolutional Networks Adaptations and Applications
6 pages
Case Study
No ratings yet
Case Study
19 pages
Physics Inter Part 1 (Sample/Guess Paper) For Exams in 2020
No ratings yet
Physics Inter Part 1 (Sample/Guess Paper) For Exams in 2020
3 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
7510 Graph Neural Networks For Lear
No ratings yet
7510 Graph Neural Networks For Lear
19 pages
DLunit 4
No ratings yet
DLunit 4
16 pages
Dsa Theory Da
No ratings yet
Dsa Theory Da
41 pages
Introduction to Graph Neural Networks
100% (1)
Introduction to Graph Neural Networks
122 pages
12.advanced DL Topics
No ratings yet
12.advanced DL Topics
104 pages
06 GNN Slides
No ratings yet
06 GNN Slides
66 pages
Suevey On GNN
No ratings yet
Suevey On GNN
31 pages
Introduction to CNNs and Representation Learning
No ratings yet
Introduction to CNNs and Representation Learning
10 pages
GNN Foundations Frontiers and Applications Chapter3
No ratings yet
GNN Foundations Frontiers and Applications Chapter3
11 pages
Workshop:: How To Write An Effective Business Plan in Just 3 Hours
100% (2)
Workshop:: How To Write An Effective Business Plan in Just 3 Hours
24 pages
CS224w Machine Learning With Graphs
No ratings yet
CS224w Machine Learning With Graphs
127 pages
DLG Book
No ratings yet
DLG Book
82 pages
L11 Learning III Neural Network Architectures
No ratings yet
L11 Learning III Neural Network Architectures
35 pages
Navsure N400i
No ratings yet
Navsure N400i
76 pages
GNN Foundations Frontiers and Applications Chapter4
No ratings yet
GNN Foundations Frontiers and Applications Chapter4
21 pages
Graph Convolutional Networks Review
No ratings yet
Graph Convolutional Networks Review
23 pages
Spontaneous Vaginal Delivery Report
No ratings yet
Spontaneous Vaginal Delivery Report
1 page
Introduction To Graph Neural Networks - Zhiyuan Liu & Jie Zhou
No ratings yet
Introduction To Graph Neural Networks - Zhiyuan Liu & Jie Zhou
142 pages
2024 - Introduction To Graph Neural Networks A Starting
No ratings yet
2024 - Introduction To Graph Neural Networks A Starting
49 pages
q14 SVC 052 Chaudhry r0
No ratings yet
q14 SVC 052 Chaudhry r0
5 pages
DL Unit 4 Perfect PDF - 1
No ratings yet
DL Unit 4 Perfect PDF - 1
23 pages
GNNs for IoMT and Computer Vision
No ratings yet
GNNs for IoMT and Computer Vision
15 pages
Why Are Graph Neural Networks Effective For EDA Problems
No ratings yet
Why Are Graph Neural Networks Effective For EDA Problems
8 pages
GNNS
No ratings yet
GNNS
7 pages
Tle
No ratings yet
Tle
7 pages
Country Home Ibusa Duplex Electrical 1
No ratings yet
Country Home Ibusa Duplex Electrical 1
15 pages
Pune Metro Environmental Impact
No ratings yet
Pune Metro Environmental Impact
15 pages
Approximation - and Quantization-Aware Training For Graph Neural Networks
No ratings yet
Approximation - and Quantization-Aware Training For Graph Neural Networks
14 pages
Computational Intelligence and Neuroscience - 2018 - Voulodimos - Deep Learning For Computer Vision A Brief Review
No ratings yet
Computational Intelligence and Neuroscience - 2018 - Voulodimos - Deep Learning For Computer Vision A Brief Review
13 pages
Mastertop 1210i M 12-04
No ratings yet
Mastertop 1210i M 12-04
3 pages
Graph Learning A Survey
No ratings yet
Graph Learning A Survey
19 pages
Applsci 13 07150
No ratings yet
Applsci 13 07150
15 pages
Deep Learning: An Overview of Convolutional Neural Network (CNN)
No ratings yet
Deep Learning: An Overview of Convolutional Neural Network (CNN)
54 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Impact of Pressure on IRP Fatigue
No ratings yet
Impact of Pressure on IRP Fatigue
23 pages
Deep Learning Notes (1) 2
No ratings yet
Deep Learning Notes (1) 2
54 pages
Polygon Shafts & Components Guide
No ratings yet
Polygon Shafts & Components Guide
6 pages
Graph Neural Networks Methods Applications and Opp
No ratings yet
Graph Neural Networks Methods Applications and Opp
35 pages
CORS Software Quick Guide - V1.80 - 202010
No ratings yet
CORS Software Quick Guide - V1.80 - 202010
23 pages
Enter The Computer World - 8
No ratings yet
Enter The Computer World - 8
137 pages
Deep Learning Concise Notes
No ratings yet
Deep Learning Concise Notes
4 pages
Export Statistics - COIR ALL ITEMS - Coir Board
No ratings yet
Export Statistics - COIR ALL ITEMS - Coir Board
6 pages
How The Rib of Adam Is Incorrectly Translated
No ratings yet
How The Rib of Adam Is Incorrectly Translated
5 pages
Stress Strain
No ratings yet
Stress Strain
17 pages
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
No ratings yet
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
1 page
Tiger in The Zoo
No ratings yet
Tiger in The Zoo
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Thesis 6
No ratings yet
Thesis 6
62 pages
GraphML Course Overview
No ratings yet
GraphML Course Overview
5 pages
SpatialPPIv2 - Enhancing Protein-Protein Interaction Prediction Through Graph Neural Networks With Protein Language Models
No ratings yet
SpatialPPIv2 - Enhancing Protein-Protein Interaction Prediction Through Graph Neural Networks With Protein Language Models
11 pages
Graph Neural Networks: Foundations, Frontiers, and Applications First Edition Lingfei Wu Full Access
100% (1)
Graph Neural Networks: Foundations, Frontiers, and Applications First Edition Lingfei Wu Full Access
143 pages
DeepAllo - Allosteric Site Prediction Using Protein Language Model (PLM) With Multitask Learning
No ratings yet
DeepAllo - Allosteric Site Prediction Using Protein Language Model (PLM) With Multitask Learning
8 pages
An End-To-End Attention-Based Approach For Learning On Graphs
No ratings yet
An End-To-End Attention-Based Approach For Learning On Graphs
25 pages
Compare JOint
No ratings yet
Compare JOint
13 pages