Tesis 7
Tesis 7
Master Thesis
Supervisors:
Prof. Pietro Liò
Prof. Elisa Ficarra
Candidate:
Chiara Sopegno
March 2020
A chi mi ha sostenuta in questi anni
Abstract
Graph neural networks have emerged in the past years as very promising
methods for the analysis of graph-structured data. Useful insights can in
fact be extracted by allowing learning models to take into account relation-
ships between entities in a graph. The main methods used in the context
of graph neural networks are here described and compared, with a focus on
the extension of convolutional layers to graph structures. Afterwards, an
analysis of how attention mechanisms integrate with graph neural networks
is introduced. In this context a new architecture is proposed, with a self-
attentive mechanism that allows a graph neural network to attend over its
own input in the context of graph classification. An application of these
methods to biomedical data is finally presented, with an example in the field
of Parkinson’s disease classification.
Contents
1 Introduction 1
3 Methods 15
3.1 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Graph Convolutional Layers . . . . . . . . . . . . . . . 18
3.1.2 Graph Pooling Layers . . . . . . . . . . . . . . . . . . 28
3.2 Attention Mechanism in Graph Neural Networks . . . . . . . . 31
3.3 Self-Attentive Graph Classifier . . . . . . . . . . . . . . . . . . 35
4 Model application 39
4.1 Parkinson’s disease . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 DNA Methylation . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 SPECT Images . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Graph neural networks for Parkinson’s disease classification . . 43
4.3 Dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 47
3
4.4.1 DNA methylation data . . . . . . . . . . . . . . . . . . 47
4.4.2 SPECT imaging data . . . . . . . . . . . . . . . . . . . 49
4.5 Experiments and results . . . . . . . . . . . . . . . . . . . . . 51
Bibliography 61
Chapter 1
Introduction
The field of deep learning has become very popular in the past years, bring-
ing to the creation of neural network models capable of reaching the state-
of-the-art performance in many machine learning tasks. The majority of
deep learning methods analyses data that naturally lie in euclidean spaces.
However, in many learning tasks, the input data inherently contain relation-
ships between their entities. This is for example the case of users in social
networks, neurons in brain connectomes or interacting particles in physical
systems, all of which constitute systems of naturally interconnected items.
By disregarding the connectivity between entities in such networks, relevant
information can be lost during the automated learning process. For this
reason, new models have been introduced to take into account the underly-
ing structure of this kind of data. The architectures of these models, called
graph neural networks, are built directly on graph structures, where nodes
represent the analyzed items and edges encode information about the rela-
tionships between them.
Graph neural networks have first been introduced by [41] as an extension of
2 CHAPTER 1. INTRODUCTION
recursive neural networks, but recently the field expanded with several models
extending the concepts of convolutional neural networks to graph-structured
data. Useful insights can in fact be extracted by allowing learning models to
take into account relationships between the items in input and capture this
information through a message-passing framework between the nodes of the
graphs.
At the same time, the mechanism of attention in deep learning models has
recently started to gain increasing popularity. Initially introduced in the
context of neural machine translation, it is starting to be integrated in other
fields such as computer vision. The basic idea behind these mechanisms is
to allow the models to focus on the most important part of the input during
training, while disregarding underlying noise in the signal. An important
advantage of such mechanisms lies in the interpretability of the resulting
models. It would be therefore desirable to introduce the characteristics of
attentive mechanisms also in the context of signals defined on non-euclidean
domains.
• Activation layer : the layer that introduces non-linearity inside the neu-
ral network model, applying a non-linear function on the feature map.
The most used ones are ReLU , hyperbolic tangent and sigmoid func-
tion.
• Pooling layer : the layer that reduces the size of the feature map, by
retaining only the most important information in a spatial neighbour-
hood of its input. The most common pooling strategies usually apply
simple mathematical functions, such as summing the elements in the
selected patch or taking the average between them.
where f (x) = (f1 (x), . . . , fp (x)) is the p-dimensional input to the network,
with f ∈ L2 (Ω), C and P denote convolutional and pooling layers respec-
tively, and Θ = (Γ(1), . . . , Γ(K)) is the hyper-vector of network parameters.
The generic convolutional layer g = CΓ (f ), with compactly supported filters
Γ = (γl,l0 ), can be described by the following equation:
p
!
X
gl (x) = ξ fl0 ? γl,l0 (x) , l = 1, . . . , p, l0 = 1, . . . , q (2.2)
l0 =1
• Loss evaluation: given the i-th sample, the probabilities of class mem-
bership ŷ (i) are compared with the one-hot encoding y (i) of the true
label, in order to compute the loss function. A common example is the
cross-entropy loss function:
L = L(ŷ (i) , y (i) ) = −(y (i) log ŷ (i) + (1 − y (i) ) log(1 − ŷ (i) )) (2.5)
The training process usually involves many iterations, with the goal of find-
ing a configuration of the parameters space that allows to minimize the loss
function. Each complete presentation of the dataset to the network is called
epoch. In most cases it has been observed that presenting the input samples
to the network in batches helps to stabilize the training procedure and im-
proves overall results.
The success of CNNs is primarily due to their ability to exploit the sta-
tionarity, locality and intrinsic hierarchical structure of the data [6].
In the visual domain, stationarity refers to the fact that many of the prop-
erties of an image are shift-invariant. Therefore, the use of filters that are
convolved across all the input allows to recognize identical features indepen-
dently of their spatial locations. This aspect allows therefore to share weights
across neurons, thus diminishing the number of parameters and reducing the
risk of overfitting.
Secondly, in images small sets of nearby adjacent pixels are able on their
own to encode significant information for the downstream task. This allows
the use of filters whose support is much smaller than the size of the input,
further diminishing the number of required parameters.
Moreover, by using a multi-layer structure that alternates pooling and con-
volutional operators, CNNs are able to learn multi-scale latent features in a
hierarchical way. This last aspect works particularly well in capturing the
intrinsic hierarchical structure of data like images, allowing the detection of
different aspects of the image at different layers of the CNN: the first layers
will detect edges and simpler shapes, while the last ones will be able to detect
entire objects and perform classification.
10 CHAPTER 2. MACHINE LEARNING BACKGROUND
exp(eij )
αij = PT (2.9)
k=1 exp(eik )
2.2. ATTENTION MECHANISM 11
where the inputs eij to the softmax function are the alignment coefficients
[3]:
Equation (2.10) applies the alignment model to vectors that are computed
from the input and output sentences by using recurrent neural networks.
However, it’s also possible to apply a self-attention mechanism by computing
the alignment model directly between different positions of input and output
sequences, as proposed in the Transformer model by [46]. They propose in
fact to separate the output sequence into a set of queries, represented by a
matrix Q ∈ Rq×d , and the input one into a set of keys, encoded in K ∈ Rk×d ,
where q is the number of queries, k the number of keys and d is the shared
dimension of queries and keys. The attention scores are computed through
the alignment model:
QK T
A = SoftMax √ (2.11)
d
12 CHAPTER 2. MACHINE LEARNING BACKGROUND
CNNs after these were fully trained. The objective was to understand the
relative importance that features in a layer had in the creation of the follow-
ing feature maps. However, the use of attention in a post-hoc formulation
doesn’t let it participate in the learning process in order to improve it.
f
and the concatenation s∈S gas substitutes the vector g as global represen-
tation of the image.
This method allows for an integration of global and local features, attend-
ing one over the others and thus creating a multi-scale attention framework.
The attention scores represent how well a region of the intermediate feature
map relates to the global representation of the image itself, and enables the
learning process to amplify the influence of the most significant regions, dis-
carding the information coming from noisy regions. Moreover, the use of the
attention mechanism at different layers allows to focus on different spatial
resolutions.
Chapter 3
Methods
Wij = 0 if (i, j) ∈
/E
(3.1)
Wij > 0 if (i, j) ∈ E
The set of nodes represents the units involved in the graph, for example peo-
ple in social networks or interacting particles in a physical system. Edges
represent instead specific relationships between the defined entities, though
they may have different interpretation depending on the context. For what
concerns instead the optional weight Wij associated to an edge (i, j), it usu-
ally defines the strength of that specific relationship.
3.1. GRAPH NEURAL NETWORKS 17
The amount of works that has been proposed in the field of deep learning
applied to graphs can be generally divided into two categories [41]:
In [7], graph neural network have been introduced in a CNN-like structure for
the first time. First of all, this involves giving a definition of what it means to
apply convolutions on graph structures data. The propagation of information
across elements is in fact at the basis of any convolutional neural network.
Then, even if not strictly necessary, an extension of the idea of pooling layer
should also be given, in order to allow the network to also exploit a possible
hierarchical structure of the data. Batch normalization and activation layers
can instead be trivially extended from the ones used in standard convolu-
tional neural networks. Moreover, the training can be performed by the use
of the back-propagation algorithm with chosen task-specific losses.
18 CHAPTER 3. METHODS
Given a signal x ∈ Rn over the nodes of the graph, we can define the smooth-
ness functional k∇xkW as:
X XX
k∇xkW = k∇xkW (i) = wij [x(i) − x(j)]2 (3.8)
i i j
where the notation k∇xkW (i) denotes the smoothness functional at node i.
Given this definition, we have that:
√
u0 = arg min k∇xk2W = (1/ n)1n
x∈R, kxk=1
is the smoothest signal on the graph structure (it is in fact a constant over
the nodes), while the other eigenvectors of L are in general given by:
of orthonormal eigenvectors. These vectors are also called the Fourier modes
n−1
of the graph and their associated non-negative eigenvalues {λi }i=0 identify
the frequencies of the graph.
The graph Fourier transform is defined as the projection onto the eigenbasis
of the graph Laplacian operator:
x̂ = U T x
20 CHAPTER 3. METHODS
UT x UT y
x?y =U = U diag(yˆ1 , . . . , yˆn )x̂ (3.9)
x0 = gθ ? x = U G(Λ)U T x (3.10)
that the resulting feature matrix X (k+1) will have dimensions n × fk+1 . The
idea is similar to classical CNNs: the input signals are passed through these
learnable filters, which aggregate the information, and subsequently the non-
linearity ξ is applied to the output.
However, often only the smoothest part of the signal is useful in the analysis,
since it is the one that is less likely to carry signal noise. It can therefore
be useful to consider only the first d eigenvectors of the Laplacian and sub-
stitute U with Ud = [u0 , . . . , ud ] ∈ Rn×d . The cutoff value d depends on the
specif application, on the regularity of the graph and on the sample size. It
is important to notice that often the single high frequency eigenvectors do
not bring much information, but we can obtain much more information by
allowing the network to combine them together.
We have seen that equation (3.11) allows us to define a translation operation
in the spectral domain. By defining the convolution on this domain, we are
now able to allow weight sharing across different locations of the graph, how-
ever the filters used aren’t spatially localized. It can be observed that, in the
euclidean grid, the spatial localization is translated into smoothness in the
spectral domain. So, in order to learn filters that are actually localized in
the original spatial domain, it is necessary to learn spectral multipliers that
give a smooth signal in the Fourier domain. This can be done also in graphs,
by selecting only a subset of multipliers and using interpolation kernels to
obtain the rest, thus allowing localization of the filters in the spatial domain.
In particular the smooth filters can be described by:
(k) (k)
diag(Gi,j ) = Bαi,j (3.12)
T0 (x) = 1,
T1 (x) = x, (3.14)
where the Chebyshev coefficients θ ∈ Rr are again learnable and the diagonal
Laplacian is rescaled through Λ̃ = 2Λ/λmax − In , so that its elements lie in
[−1, 1].
With the filters defined in (3.15), the convolution operation (3.11) on a signal
x ∈ Rn can be rewritten as follows:
r−1
X
0 T
x = gθ ? x = U Gθ (Λ)U x = θj U Tj (Λ̃)U T x
j=0
r−1
X
= θj Tj (L̃)x (3.16)
j=0
r−1
X
= θj x̄j
j=0
where L̃ = 2L/λmax −In is the rescaled Laplacian and the values x̄j = Tj (L̃)x
can be computed using the recursive relation:
with x̄0 = x and x̄1 = L̃x. Thanks to the compactly supported filters, we
have defined filters which, to be computed, require a computational com-
plexity of O(|E|), i.e linear in the number of edges of the graph. This means
that for sparse graphs the computation can be performed in linear time with
respect to the input size n. We can thus see how avoiding an explicit use of
the Fourier transform brings a great computational advantage.
with parameters that are shared across the different locations of the graph.
Since we have taken a further approximation of the Chebyshev expansion,
these filters will aggregate the signal only across a 1-hop neighbourhood of
the graph. Nevertheless, the consecutive application of different filters will
allow to aggregate information from a larger neighbourhood.
If we further introduce the constraint θ∗ = θ0 = −θ1 , we obtain the following
filter:
gθ ? x = θ∗ (In + D−1/2 W D−1/2 )x (3.19)
However, since the matrix In + D−1/2 W D−1/2 has eigenvalues in [0, 2], we
can encounter problems of vanishing/exploding gradients. For this reason, a
renormalization of W is introduced:
where Θ ∈ Rfk ×fk+1 is the matrix of filter parameters, with fk+1 output
channels.
It should be noted that, when constructing the new feature matrix with this
type of convolutions, we give same importance to the features of the node
itself and to the ones of its neighbours. However, in some applications, it
might be useful to introduce a trade-off parameter µ in the definition of W̃ :
W̃ = W + µIn
The main problem of the spectral approach defined by (3.11) can be find
in the definition of convolution dependent on the Fourier basis: this depen-
dence implies that a model learnt on a specif domain is not easily transferable
26 CHAPTER 3. METHODS
to a different one [36]. For this reason different approaches have been devel-
oped in order to allow the convolution operator to deal with different sized
neighborhoods [51].
One of the methods that overcome this limitation is introduced in [19]. The
basic idea of this method is to learn a function able to generate node em-
beddings only by looking at the neighbourhood of the single node. The
method is called GraphSAGE, which stands for SAmple and aggreGatE, and
a schematic representation can be found in figure 3.1.1. The main differ-
ence from previous methods is that it doesn’t learn one different embedding
for each node, but it learns an aggregation operator over a subset of the
neighbouring nodes:
(k+1)
xN (v) = s({xu(k) , ∀u ∈ N (v)}) (3.22)
(k+1)
x(k+1)
v = ξ(Wk · [x(k)
v ||xN (v) ]) (3.23)
In [19], different functions s are proposed. The most simple idea is to use
the mean over the vectors of the neighbouring set. Alternatively a pooling
aggregator can be used, in the form:
Extending the idea of pooling layers from CNNs to graph domains is partic-
ularly challenging. This is due to two main factors: the lack of a definition of
spatial locality and, in the setting of multiple domains, the variable number
of nodes in different data samples. In the context of graphs, in fact, there’s
not a clear definition of what a patch is and it is therefore complicated to
decide which nodes should be pooled together.
Despite the challenge, a notion of pooling is often necessary when dealing
with graph classification. It was therefore initially introduced in the form of
a readout layer where global pooling is performed: all the nodes embeddings
are pooled together and a simple permutation-invariant function is applied
to them. This kind of layer is usually inserted after the message-passing
and aggregating steps are performed by convolutional layers, and it’s usually
followed by one or more fully connected layers that allow to perform classi-
fication.
The downside of these methods is that they are intrinsically flat and don’t
allow to capture any hierarchical structure that might be present in the net-
work. This is the reason why the authors of [49] proposed a pooling layer
in which a soft cluster assignment is performed and afterwards the node fea-
tures of each cluster are aggregated together. This structure allows to create
a smaller graph structure in the next layer, where the information regarding
similar nodes has been aggregated together.
The model is based on learning a cluster assignment matrix through a graph
neural network module of convolutional layers. This matrix can be denoted
by S (k) ∈ Rnk ×nn+1 , where nk in the number of nodes at layer k and nk+1 is
the number of clusters that we want to generate (which coincides with the
3.1. GRAPH NEURAL NETWORKS 29
number of nodes of the input to the next layer). If we denote as X (k) ∈ Rnk ×fk
and W (k) ∈ Rnk ×nk the matrix of node features and the weighted adjacency
matrix respectively, at layer k, we can compute the soft assignment as:
T
X (k+1) = S (k) X (k) ∈ Rnk+1 ×fk (3.25)
where we can note that the number of node features fk remains unchanged
in the pooling process. Subsequently the graph structure between the new
cluster nodes can be generated through:
T
W (k+1) = S (k) W (k) S (k) ∈ Rnk+1 ×nk+1 (3.26)
where the elements of W (k+1) denote the strength of the connections between
each pair of clusters [49]. This method, called DiffPool, introduces a graph
coarsening procedure that allows to reduce the size of the graph when going
deeper in the neural network and this helps to detect possible hierarchical
structures in the graph.
While the previous approach works really well for small graphs, it can en-
counters problems when processing bigger structures. First of all, because of
the soft-clustering procedure, the resulting graph will not be sparse even if
the starting one had few connections [50]. Secondly, it requires to store at
each step the soft-clustering matrix, which at the beginning of the network
30 CHAPTER 3. METHODS
X (k) p(k)
y (k) =
kp(k) k (3.27)
(k)
i = τ (y , r)
where τ is a function that selects only the indices of the top drne elements
in y (k) and i ∈ Rr is a vector denoting the retained indices. The selection
step is then performed by:
where is the Hadamard product and (·)i is the indexing operation based
on vector i. The idea behind this method is to relegate the step of the
aggregation of information between nodes in the convolutional layers, and
just perform a selection of the most important nodes in the pooling layer.
This selection, performed just by slicing the feature matrix and the adjacency
matrix, allows to maintain sparsity in the graph [8], a characteristic that was
missing in DiffPool.
3.2. ATTENTION MECHANISM IN GRAPH NEURAL NETWORKS 31
(k+1)
with m representing the number of attention heads and xi ∈ Rmfk+1 .
The model described in (3.33) helps to bring more stable results, besides
being parallelizable and computational efficient. Moreover, differently from
many layers described in 3.1.1, it does not depend on knowing the graph
structure upfront, while at the same time taking into consideration all the
neighbours of a certain node when computing its embedding.
For what concern instead graph embeddings, the goal is to learn a func-
tion φ : G → Rm that maps the whole graph to a low-dimensional vector.
In the context of attention mechanisms this is done by learning an attention
function that gives different important scores to different subregions of the
graph. The embedding can then be fed into a standard neural network to
perform what is called attention-based graph classification [29].
One of the first works that proposed to use attention in the context of graph
classification is GAM [28], which is based on attention-guided walks over a
selected number of informative nodes. In particular, GAM uses an attention
mechanism over the neighbourhood of a node in order to decide which steps
to take at each next iteration. This mechanism is trained through a rein-
forcement learning approach and constitutes a model that is able to learn
which are the important parts of the graph. Moreover, in the case of very
large graphs, multiple walks can be initialized and performed in parallel.
An attention mechanism has also been introduced in the context of a read-out
layer in [33]. In fact, given the node feature vectors after the last convolu-
(K) (K)
tional layer {x1 , . . . , xn }, an embedding of the entire graph G can be
computed following the attention model:
!
X
(K) (K)
hG = ξ αi tanh g(xi ) (3.34)
i∈V
34 CHAPTER 3. METHODS
where D̃ and W̃ are defined as in (3.20), X is the feature matrix at layer k and
p ∈ Rfk is the only vector of learnable parameter in the layer. The attention
scores learnt can then be used to modulate the information coming to the
next layer or can be used in (3.36) to create a pooling layer. Alternatively,
other convolutional operations can also be used in (3.37).
3.3. SELF-ATTENTIVE GRAPH CLASSIFIER 35
The proposed method aims to extend the ideas introduced in [22] to the case
of graph inputs. The authors of [22] presented an attentive framework in
the context of convolutional neural networks for image classification. The
core idea consisted in re-purposing the global representation of the image
as a query that could attend over the hidden representations generated at
intermediate layers of the network. Since every image classification task can
be seen as a specific case of a graph classification problem, the main idea
introduced in this work is to extend their concept to the case of graph classi-
fication. Therefore, an attentive framework that can be applied to different
graph neural network architectures is here proposed.
In the setting of graph classification, the input to the problem is given by a
set of labeled graphs S = {(Gi , yi )}i∈I , where each Gi ∈ G represents a graph
and the corresponding yi ∈ Y is the label representing the class to which the
graph belongs. The cardinality of Y represents the number of different classes
in the dataset, while G is the set of input graphs. The main objective of the
learning process is to find an approximation fˆ of the theoretical function
f : G → Y that maps each graph to its corresponding label.
As previously said, given a generic graph G in input, it will be characterized
by its structure (V, E, W ) and by a set of node feature vectors, that are sum-
marized in the input feature matrix X (0) ∈ Rn0 ×f0 , where n0 is the number
of nodes and f0 the number of features. The more general setting, which
includes also vectors of features for every edge, is here excluded.
The starting feature matrix, together with the graph structure, is fed into the
neural network, which will process them through a sequence of convolutional
and pooling layers. After each layer k, a new feature matrix X (k) ∈ Rn is
36 CHAPTER 3. METHODS
produced, and, while in the case of convolutional layers the graph structure
remains the same, after pooling layers also the triplet (V (k) , E (k) , W (k) ) is
recomputed.
At each layer, the rows of X (k) represent the feature vectors of the nodes
present in that layer: {x1 , . . . , xnk } ∈ Rfk . Before the first layer, these vec-
tors will represent only the information pertaining the related node, but as
the layers go on, the node features will represent the aggregated information
of a portion of neighbouring nodes in the graph.
When arriving at the last layer, in order to perform the final graph classifica-
tion, it is necessary to produce a global vector g ∈ Rm , in which to aggregate
the processed information of the whole graph. This vector can be obtain
through different kind of read-out layers. Commonly used strategies include
taking the max or mean values over the features vectors of all the nodes, or
otherwise just summing them up. This brings to a global feature vector with
number of features m equal to the number fk of node features in the last
convolutional layer of the network. Alternatively, a read-out layer like the
one in (3.34) can be exploited.
The proposed method will use the global representation of the input, repre-
sented by vector g, as a query in the attention mechanism. Keys and values
will instead be constituted by the hidden representation of nodes in the in-
termediate layers (or by linear transformation of them). The basic idea is to
use this mechanism at different intermediate layers along the architecture, in
order to be able to get information from different scales.
Given a chosen layer l, the global vector g will be used to attend over the
set of features vector {x1 , . . . , xnl } ∈ Rfl learnt at that layer. This process
will produce attention scores {αil }i=1,...,nl through the application of a general
attention function. The scores are computed through the classical attentive
3.3. SELF-ATTENTIVE GRAPH CLASSIFIER 37
framework:
eli
αil = nl
P l
, ei = a(g, xli ) (3.38)
i=1 ei
where a is a generic function computing the alignment model. This function
will in fact return a value representing how well the feature vector of node i
at layer l relates with the global representation of the graph, thus creating a
multi-scale attention mechanism.
The function a can be a single-layer neural network taking as input a con-
catenation of the two vectors:
where, in the case m 6= fl , it’s first necessary to transform the input feature
vectors through a shared projection into a space of dimension m, with W ∈
Rm×fl . The same has to be done also if a multiplicative attention mechanism
is used, following:
a(g, xli ) = g T (W xli ) (3.41)
Model application
certain pesticides, may play an important role. It is also known that the risk
factor is higher in men than in women and that the disease usually develops
in elderly people [34].
Parkinson’s disease can’t be cured, but the symptoms can be controlled and
contained for a fairly long period. Medications that increase the level of
dopamine can be useful for improving movements and diminishing tremors,
while a healthy lifestyle can help to reduce a lot of problems connected to
PD, such as constipation, lack of flexibility and high levels of anxiety [27].
Brain imaging is one of the most commonly used techniques for the diagno-
sis of Parkinson’s disease. In particular, dopamine transporter single-photon
emission computed tomography (DaT SPECT) is one of the most effective
42 CHAPTER 4. MODEL APPLICATION
tional network impose a spatial bias on image data [12]. Since the process
of DNA methylation seems to be deeply related with gene expression, and
thus with protein production, in this work we will try to construct a graph
neural network that process methylation data after having arranged them
on a protein-protein interaction network. This allows to introduce biological
knowledge in the model by allowing the model to know which features in
the input are biologically connected to each other [4]. The introduction of
prior validated knowledge can be particularly useful in the case of high di-
mensional data with restricted number of samples, which is the case of most
dataset containing DNA methylation data.
M
β=
M +U
where M is the number of cells in the sample where the analyzed CpG site is
methylated and U is the number of cells for which the DNA molecule was not
4.4. DATA PREPROCESSING 47
The SPECT dataset consists instead of 3D grayscale images that have been
aligned together and saved in the NifTI-1 data format. Each image is repre-
sented by a 91 × 109 × 91 tensor, with values normalized to the range [0, 1].
From those projections, it’s possible to notice how the data seems to be di-
vided in two clusters, related to the biological sex of the patients. Healthy
and control patients seem to be divided equally in the two clusters. For this
reason, all the probes related to DNA positions situated on sex chromosomes
have been excluded from the analysis.
Moreover, [52] identified a number of probes that, because of the specific
region of DNA that they refer to, should be excluded from the analyses for
probe selection. We then further pre-filter those probes from the data ma-
trix.
A further selection step is performed considering the final goal of embedding
the features in a network of genes: only the CpGs located in a DNA position
corresponding to a gene are kept in the analysis.
Subsequently, the 10000 most important probes for the goal of Parkinson’s
disease classification in the dataset have been selected on the base of univari-
ate ANOVA statistical tests.
As previously said, the goal is to embed the selected features on a biologically
meaningful graph. There are many different biological graphs that can be
considered, which differ for the quantity of genes considered and the amount
4.4. DATA PREPROCESSING 49
and type of relationship between them. For the purpose of this work we de-
cided to use a the protein-protein interaction graph. In particular, we refer to
the interaction data registered in the STRING database [13], which contains
information of gene associations recovered from multiple sources, like biolog-
ical experiments, literature information and computational prediction. We
decide to focus in particular on the association that have been biologically
proved, in order to avoid to add more uncertainty in the problem.
The resulting graph dataset will be composed by 450 graphs, each with 10000
nodes: each node will represent a location in the DNA (CpG). The edges be-
tween nodes will instead be constructed based on the biological interactions
between the proteins encoded by the genes present in the CpG location.
• the feature vector of each node will be constituted by four features: the
50 CHAPTER 4. MODEL APPLICATION
Figure 4.5: Slice of a SPECT image on the left; resulting boundaries given
by SLIC algorithm on the right
first one will be the mean between the intensity values of the elements
in the superpixel, the other three are the coordinates of the center of
mass z ∈ R3 of the superpixel itself;
• edges between nodes are formed based on the euclidean distance be-
tween centers of mass, based on the following formula [11]:
!
kzi − zj k2
Wij = exp −
σ2
with σ = 0.1π. Only for the k = 20 nearest nodes the edge will actually
be maintained in the final graph.
The code has been implemented in Python, using the libraries PyTorch [38]
and PyTorch Geometric [14] for the neural networks section.
The model constructed for classifying the methylation graphs consists of 3
convolutional layers, with 2 final fully connected layers. The size of the hid-
den node representations has been set to 4 after some experiments. Skip con-
nections after each convolutional layer are also included to allow the model
to be more flexible. The big size of the underlying graph constitutes a limit
in the construction of a deeper and more complex architecture.
The network has been trained with the so called RMSProp (Root Mean
Square Propagation) algorithm, using a standard cross-entropy loss function.
The training is performed for 200 epochs, with an early stopping option if
the accuracy doesn’t improve for a certain amount of epochs. The hyperpa-
rameters used during training are summarized in table 4.1.
The results obtained with different layer types are reported in figure 4.6. The
results are given in terms of Area Under the ROC (Receiver Operating Char-
acteristic) curve. The plot of a ROC curve summarizes the performance of
a binary classification model on the positive class, plotting the True Positive
Rate against the False Positive Rate, when varying the threshold parameter
of the classifier. The Area Under the ROC Curve (AUC) is a good metric
for evaluating the expressive power of a classifier and it’s superior to the
accuracy in case of imbalanced datasets with binary labels.
Figure 4.6: Average value of AUC obtained using different types of convolu-
tional layers in the model
4.5. EXPERIMENTS AND RESULTS 53
We can observe that none of the classifiers provides a high average ROC-
AUC value. However, it seems that the one composed with the GraphSAGE
convolutional layers performs a bit better than the rest, providing an average
AUC of 0.74 and an average classification accuracy of 75.8%.
The attentive model proposed in 3.3 has been tested on the methylation
graphs, however it does not seem able to capture the information from the
methylation signal.
The reasons for the low performances observed in figure 4.6 can be various.
First of all, the amount of data available, together with the high dimension-
ality of the dataset itself and the amount of noise that it contains, makes it
difficult to train any model in a 3-fold cross-validation fashion. Only 66, 6%
of the data is in fact used during training, while the remaining ones equally
split between validation and test sets.
In second instance, we made a strong assumption by choosing a specific bi-
ological structure for the underlying graph. Different experiment should be
performed to analyse if, using different biological graphs, the results might
improve. Moreover, the big amount of data made the training of a deeper
convolutional network difficult, while at the same time the feature selection
step was also really affected by the few samples available for such noisy data.
Differently from the classifier used for the methylation data, the models con-
structed for classifying the SPECT graphs exploit the framework introduced
in section 3.3. They consists of 6 convolutional layers and also in this case
three different types of convolution (from ChebConv, GraphSAGE and GAT )
are analyzed. The latent feature vectors are this time composed by 10 values,
which is also the same amount of variables that will constitute the graph rep-
resentation g, produced after the fully connected layers attached at the end
54 CHAPTER 4. MODEL APPLICATION
Different experiments have been carried out in the context of the final inte-
gration of the attentive vectors:
• case 1: the attentive vectors {gal }l∈L are averaged together and the
resulting vector is used for the final classification;
• case 2: the vectors {gal }l∈L are concatenated together before performing
the final classification;
• case 3: the global vector g and the vectors {gal }l∈L are averaged together
and then used for classification.
4.5. EXPERIMENTS AND RESULTS 55
Figure 4.7: Average value of AUC obtained using different kind of convolu-
tional layers in the model, with different integration of attentive vectors
Interestingly, the results are different from what expected: the attention
model seems to work slightly better, expecially in terms of AUC, when ap-
plied after the first layers of the network. In the case of convolutional net-
works applied on euclidean images, [22] noticed how attention models at-
tached to deeper layers brought to better results. This was easily explained
by the fact that in images from the natural world, as the one analyzed in
the paper (CIFAR-100), the last layers are the ones that convey the most
information about objects present in the image. However, in our case the
setting is quite different, all the images represent the same object (a patient’s
brain) and it’s more probable that important details for classification can be
found at smaller scales. Moreover, the division in superpixels already caused
a coarsening of the input data, so that each superpixel and its immediate
neighbours already contain aggregated information ready to be used by the
network.
More experiments concerning analysis of network depth and variations of
attentional functions should be considered in order to further optimize the
models.
Chapter 5
data, functional magnetic resonance (fMRI) are a kind of data that is nat-
urally predisposed to be represented as graphs. After the necessary prepro-
cessing steps, specific region of interest in the brain can be encoded as nodes
and then, by computing the correlation of signals between these regions, a
correlation matrix can be constructed. The final edges will be derived from
the correlation matrix, by applying a task specific thresholding operation.
The final connectivity graphs, describing functional relationship between re-
gions of the brain, represent the perfect input for graph neural networks [32].
In this setting, an attentive model could potentially bring many insights on
the parts of the brain that relates with specific functional problems, like the
ones connected to Parkinson’s disease.
In the context of data integration in the biomedical field, at least at the
current stage, one of the main limitation is given by the fact that not all the
data types are available for every patients, so classical data-integration tech-
niques may fail to work cause of the small amount of data in the intersection
of datasets. It could therefore be interesting to explore how data encoders,
trained separately on different data modalities, would work in a downstream
integration task, when presented with new data that haven’t been seen be-
fore.
Finally, integration with other datasets and exploration of transfer learning
techniques might be one of the most promising fields to explore, with the goal
of expanding the available knowledge in the context of Parkinson’s disease.
60 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS
Bibliography
[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pas-
cal Fua, and Sabine Susstrunk. Slic superpixels, June 2010.
[2] Moran Artzi, Einat Even-Sapir, Hedva Shacham, Avner Thaler, Avi
Urterger, Susan Bressman, Karen Marder, Talma Hendler, Nir Giladi,
Dafna Ben Bashat, and Anat Mirelman. Dat-spect assessment depicts
dopamine depletion among asymptomatic g2019s lrrk2 mutation carri-
ers. PLOS ONE, 12, 04 2017.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. CoRR,
abs/1409.0473, 2014.
[4] Paul Bertin, Mohammad Hashir, Martin Weiß, Geneviève Boucher, Vin-
cent Frappier, and Joseph Paul Cohen. Analysis of gene interaction
graphs for biasing machine learning models. ArXiv, abs/1905.02295,
2019.
[5] Lucijano Berus, Simon Klancnik, Miran Brezocnik, and Mirko Ficko.
Classifying parkinson’s disease based on acoustic measures using artifi-
cial neural networks. Sensors, 19:16, 12 2018.
61
62 BIBLIOGRAPHY
[6] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and
Pierre Vandergheynst. Geometric deep learning: Going beyond eu-
clidean data. IEEE Signal Processing Magazine, 34:18–42, 2017.
[7] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun.
Spectral networks and locally connected networks on graphs. CoRR,
abs/1312.6203, 2013.
[9] Hongyoon Choi, Seunggyun Ha, Hyung-Jun Im, Sun Paek, and Dong
Lee. Refining diagnosis of parkinson’s disease with deep learning-based
interpretation of dopamine transporter imaging. NeuroImage: Clinical,
16, 09 2017.
[12] Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko,
and Yoshua Bengio. Towards gene expression convolutions using gene
interaction graphs. ArXiv, abs/1806.06975, 2018.
BIBLIOGRAPHY 63
[14] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning
with pytorch geometric. ArXiv, abs/1903.02428, 2019.
[16] Hongyang Gao and Shuiwang Ji. Graph u-nets. In ICML, 2019.
[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016. http://www.deeplearningbook.org.
[19] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive repre-
sentation learning on large graphs. In NIPS, 2017.
[21] Steve Horvath and Ken Raj. Dna methylation-based biomarkers and
the epigenetic clock theory of ageing. Nature Reviews Genetics, 19, 04
2018.
64 BIBLIOGRAPHY
[22] Saumya Jetley, Nicholas A. Lord, Namhoon Lee, and Philip H. S. Torr.
Learn to pay attention. ArXiv, abs/1804.02391, 2018.
[24] Ivan S. Klyuzhin, Nikolay Shenkov, Arman Rahmim, and Vesna Sossi.
Use of deep convolutional neural networks to predict parkinson’s disease
progression from datscan spect images. 2018.
[25] Boris Knyazev, Xiao Lin, Mohammed Abdel Rahman Amer, and Gra-
ham W. Taylor. Image classification with hierarchical multigraph net-
works. ArXiv, abs/1907.09000, 2019.
[27] Antonina Kouli, Kelli M. Torsney, and Wei-Li Kuan. Parkinson’s dis-
ease: Etiology, neuropathology, and pathogenesis. In Parkinson’s Dis-
ease: Pathogenesis and Clinical Aspects. 2018.
[28] John Boaz Lee, Ryan Rossi, and Xiangnan Kong. Graph classification
using structural attention. In Proceedings of the 24th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining, KDD
’18, page 1666–1674, New York, NY, USA, 2018. Association for Com-
puting Machinery.
[29] John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen Ahmed,
and Eunyee Koh. Attention models in graphs: A survey. ArXiv,
abs/1807.07984, 2018.
BIBLIOGRAPHY 65
[30] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-attention graph pool-
ing. In ICML, 2019.
[32] Xiaoxiao Li, Nicha C. Dvornek, Yuan Zhou, Juntang Zhuang, Pamela
Ventola, and James S. Duncan. Graph neural network for interpreting
task-fmri biomarkers. In MICCAI, 2019.
[33] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel.
Gated graph sequence neural networks. CoRR, abs/1511.05493, 2016.
[35] Volodymyr Mnih, Nicolas Manfred Otto Heess, Alex Graves, and Koray
Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.
[37] Lisa Moore, Thuc Le, and Guoping Fan. Dna methylation and its basic
66 BIBLIOGRAPHY
[38] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward
Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga,
and Adam Lerer. Automatic differentiation in pytorch. 2017.
[39] Pereira R., Silke Weber, Christian Hook, Gustavo de Rosa, and João
Papa. Deep learning-aided parkinson’s disease diagnosis from handwrit-
ten dynamics. 06 2016.
[40] SungMin Rhee, Seokjun Seo, and Sun Kim. Hybrid approach of relation
network and localized graph convolutional filtering for breast cancer
subtype classification. In IJCAI, 2017.
[42] Seong-Jin Son, Mansu Kim, and Hyunjin Park. Imaging analysis of
parkinson’s disease patients using spect and tractography. Scientific
Reports, 6(1):38070, 2016.
[43] Sven Suwijn, Caroline Boheemen, Rob de Haan, Gerrit Tissingh, Jan
Booij, and Rob Bie. The diagnostic accuracy of dopamine transporter
spect imaging to detect nigrostriatal cell loss in patients with parkin-
son’s disease or clinically uncertain parkinsonism: A systematic review.
EJNMMI research, 5:12, 12 2015.
[44] Devin Taylor, Simeon E. Spasov, and Pietro Lió. Co-attentive cross-
BIBLIOGRAPHY 67
modal deep learning for medical evidence synthesis and decision making.
ArXiv, abs/1909.06442, 2019.
[45] Qi Tian, Jianxiao Zou, Jianxiong Tang, Yuan Fang, Zhongli Yu, and
Shicai Fan. Mrcnn: a deep learning model for regression of genome-
wide dna methylation. BMC Genomics, 20(2):192, 2019.
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In NIPS, 2017.
[48] Qianfan Wu, Adel Boueiz, Alican Bozkurt, Arya Masoomi, Allan Wang,
Dawn DeMeo, Scott Weiss, and Weiliang Qiu. Deep learning methods
for predicting disease status using genomic data. Journal of biometrics
and biostatistics, 9, 01 2018.
[49] Rex Ying, Jiaxuan You, Cynthia. Morris, Xiang Ren, William L. Hamil-
ton, and Jure Leskovec. Hierarchical graph representation learning with
differentiable pooling. In NeurIPS, 2018.
[50] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A
survey. ArXiv, abs/1812.04202, 2018.
[51] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and
Maosong Sun. Graph neural networks: A review of methods and appli-
cations. ArXiv, abs/1812.08434, 2018.
68 BIBLIOGRAPHY
[52] Wanding Zhou, Peter Laird, and Hui Shen. Comprehensive charac-
terization, annotation and innovative use of infinium dna methylation
beadchip probes. Nucleic acids research, 45, 10 2016.